Danny J Diaz

Danny Jesus Diaz, PhD

I am a computational protein engineer. My research consists of developing sequence- and structure-based machine learning frameworks for identifying stabilizing and functional mutations in proteins. I collaborate extensively with experimental protein engineers to accelerate the developability and functionality of proteins for biotechnology applications.

I received my PhD in Chemistry under Dr Andrew Ellington, and Dr Eric Anslyn at the University of Texas at Austin. During my PhD, I was the primary developer of MutCompute: a machine learning as a service tool for structure-based ML-guided protein engineering. Currently, I lead the Deep Proteins Groups at the Institute for Foundations of Machine Learning (IFML). I co-founded Intelligent Proteins, LLC where we use machine learning-guided protein engineering to develop protein-based biotechnologies for nutraceutical, therapeutic, and biomanufacturing applications.

Email / CV / Bio / LinkedIn / Google Scholar / X / Github

Research

I'm interested in protein engineering, machine learning, computer vision, biocatalysis, cancer metabolism, rare metabolic diseases, automation, and startup/entrepreneurship.

My research consist of developing machine learning frameworks that leverage sequence, structure, and functional data for enzyme discovery and protein engineering applications. Representative papers are highlighted.

Machine Learning Papers

Distilling structural representations into protein sequence models
Jeffrey Ouyang-Zhang, Chengyue Gong, Yue Zhao, Philipp Krähenbühl, Adam R Klivans, Daniel J Diaz
BioRxiv, 2024

We present a novel method for distilling structural representations into protein sequence models, which we call structure-tuning. We first pre-train two graph-transformer on a large protein structure dataset using the EvoRank loss (MutRank) and reconstruction loss (Atomic AutoEncoder). The representations of these structure models is tokenized using K-mean clustering. The structure tokens are used to fine-tune ESM2's representations on the UniClust30 dataset, which we named Implicit Structure Model. We show that ISM's structure-enriched representation outperforms a variety of state-of-the-art sequence-based models and multi-modal models on a variety of protein structure prediction and protein engineering tasks. ISM is a drop in replacement for ESM2 and can be found here.

A Systematic Evaluation of The Language-of-Viral-Escape Model Using Multiple Machine Learning Frameworks
Brent Allman, Luiz Vieira, Daniel J Diaz, Claus O Wilke,
BioRxiv, 2024

It is critical to rapidly identify mutations with the potential for immune escape or increased disease burden (variants of concern). A recent study proposed that viral variants-of-concern can be identified using two quantities extracted from protein language models: grammaticality and semantic change. Grammaticality is intended to be a measure of whether a viral protein variant is viable, and semantic change is intended to be a measure of the variants potential for immune escape. Here, we systematically test this hypothesis, taking advantage of several high-throughput datasets that have become available since the original study, and also evaluating additional machine learning models for calculating the grammaticality and semantic metrics. We find that grammaticality correlates with protein viability, though the more traditional metric, ΔΔG, appears to be more effective. By contrast, we do not find compelling evidence that the semantic change metric can effectively identifying immune escape mutations.

Evolution-Inspired Loss Functions for Protein Representation Learning
Chengyue_Gong, Adam Klivans, James Madigan Loy, Tianlong Chen, Qiang Liu, Daniel J Diaz
International Conference on Machine Learning, 2024

Current, protein representation learning methods primarily rely on BERT- or GPT-style self-supervised learning and use wildtype accuracy as the primary training/validation metric. Wildtype accuracy, however, does not align with the primary goal of protein engineering: suggest beneficial mutations rather than to identify what already appears in nature. To address this gap between the pre-training objectives and protein engineering downstream tasks, we present Evolutionary Ranking (EvoRank): a training objective that incorporates evolutionary information derived from multiple sequence alignments (MSAs) to learn protein representations specific for protein engineering applications. Across a variety of phenotypes and datasets, we demonstrate that an EvoRank pre-trained graph-transformer (MutRank) results in significant zero-shot performance improvements that are competitive with ML frameworks fine-tuned on experimental data. This is particularly important in protein engineering, where it is expensive to obtain data for fine-tuning.

Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations
Daniel J Diaz, Chengyue Gong, Jeffrey Ouyang-Zhang, James M Loy, Jordan Wells, David Yang, Andrew D Ellington, Alexandros G Dimakis, Adam R Klivans
Nature Communications, 2024

Stability Oracle is a structure-based graph-transformer framework that is first pre-trained with BERT-style self-supervision on the MutComputeX dataset and then fine-tuned on a curated subset of the megascale cDNA-display proteolysis dataset. Here, we introduce several innovations to overcome well-known challenges in data scarcity and bias, generalization, and computation time, such as: Thermodynamic Permutations for data augmentation, structural amino acid embeddings to model a mutation with a single structure, a protein structure-specific attention-bias mechanism that makes graph transformers a viable alternative to graph neural networks.

Binding Oracle: Fine-Tuning From Stability to Binding Free Energy
Chengyue Gong, Adam R Klivans, Jordan Wells, James Loy, Qiang Liu, Alexandros G Dimakis, Daniel J Diaz,
NeurIPS GenBio Workshop Spotlight, 2023

Fine-tuning machine learning frameworks to a small experimental dataset is prone to overfitting. Here, we present Binding Oracle: a Graph-Transformer framework that fine-tunes Stability Oracle to ∆∆G of binding for protein-protein interfaces (PPI) via a technique we call Selective LoRA. Selective LoRA, uses the gradient norms of each layer to select the subset most sensitive to the fine-tuning dataset--here it was a curated subset of Skempi2.0 (B1816)--and then finetunes the selected layers with LoRA. By applying Selective LoRA to Stability Oracle, we are able to achieve SOTA on the S487 PPI test set and generalization between different types of PPI interfaces.

Predicting a Protein’s Stability under a Million Mutations
Jeffrey Ouyang-Zhang, Daniel J Diaz, Adam R Klivans Philipp Krahenbuhl
NeurIPS, 2023

The mutate everything framework allows the fine-tuning of sequence-based (ESM2) and MSA-based (AlphaFold2) protein foundation models on phenotype data with parallel decoding. Here, we demonstrate how their representations can be fine-tuned on the cDNA-display proteolysis dataset with the mutate everything framework to predict the thermodynamic impact of single point mutations (∆∆G). The AlphaFold2 fine-tuned model, StabilityFold, is able to achieve similar results to Stability Oracle on a variety test sets. More importantly, The mutate everything framework allows for parallel decoding of single and higher-order amino acid substitutions into ∆∆G predictions. This capability not only enables rapid DMS inferencing of proteins but makes double mutant DMS inferencing computationally tractable.

Hotprotein: A novel framework for protein thermostability prediction and editing
Tianlong Chen, Chengyue Gong, Daniel J Diaz, Xuxi Chen, Jordan Tyler Wells, Zhangyang Wang, Andrew Ellington, Alexandros G Dimakis, Adam Klivans
ICLR, 2023

We curated an organism-based temperature dataset (HotProteins) for distinguishing proteins with varying thermostability (cryophiles, psychrophiles, mesophiles, thermophiles, and hyperthermophiles). We proposed structure-aware pretraining (SAP) and factorized sparse tuning (FST) to fine-tune the sequence-based transformer, ESM-1b, representations to generate a classifier and regressor to predict a protein's organism class or growth temperature.

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry
Anastasiya V Kulikova, Daniel J Diaz, Tianlong Chen, Jeffrey Cole, Andrew D Ellington, Claus O Wilke
Scientific Reports, 2023

We compare and contrast self-supervised sequence-based transformers and structure-based 3DCNNs models. We find that there is a variance-bias tradeoff between the two protein modalities. Convolutions provide an inductive bias for protein structures where the more powerful sequence-based transformers demonstrate increase variance.

Learning the Local Landscape of Protein Structures with Convolutional Neural Networks
Anastasiya V Kulikova, Daniel J Diaz, James M Loy, Andrew D Ellington, Claus O Wilke
Journal of Biological Physics, 2021

We compare how self-supervised 3DCNNs learn the local mutational landscape of proteins against evolution via Multiple Sequence Alignments. We find that structure-based 3DCNNs amino acid likelihoods have weak correlation with MSAs and their wildtype confidence is dependent on the structural position of the residue. Where core residues being more confidently predicted.

Protein Papers

Engineering a photoenzyme to use red light
Jose M. Carceller, Bhumika Jayee, Claire G. Page, Daniel G. Oblinsky, Gustavo Mondragón-Solórzano, Nithin Chintala, Jingzhe Cao, Zayed Alassad, Zheyu Zhang, Nathaniel White, Daniel J. Diaz, Andrew D. Ellington, Gregory D. Scholes, Sijia S. Dong, Todd K. Hyster
Cell Chem, 2024

Previously, we engineered a ene-reductase (ERED) photoenzyme capable of asymmetric synthesis of α-chloroamides under blue light using directed evolution and machine learning-guided protein engineering with MutComputeX. Here, we conduct allosteric tuning of the electronic structure of the cofactor-substrate complex using directed evolution and MutComputeX to dramatically increase catalysis under red light (99% yields). Computational studies show a different electron transition for cyan and red light and how mutations at the protein surface allosterically tune the active site complex.

Biosensor and machine learning-aided engineering of an amaryllidaceae enzyme
Simon d'Oelsnitz, Daniel J Diaz, Daniel J Acosta, Tyler L Dangerfield, Mason W Schechter, Matthew B Minus, James R Howard, Hannah Do, James Loy, Hal Alper, Andrew D Ellington
Nature Communications, 2024

We engineered a transcription factor and a methyl transferase to improve the regioselectivity and titer yield for production of 4O-methyl-norbelladine. To engineer the 4O-methyltransferase enzyme, we developed MutComputeX: a self-supervised 3DResNet trained to generalize to protein-ligand, -nucleotide, and -protein interfaces. MutComputeX was used to design mutations on a computational ternary structure of the AlphaFolded methyl-transferase with SAH and norbelladine docked with Gnina. This is the first time three machine learning models (AlphaFold, Gnina, MutComputeX) were synergized to engineer the surface and active site of an enzyme and combined with an engineered transcription factor for high-throughput screening.

Asymmetric Synthesis of α-Chloroamides via Photoenzymatic Hydroalkylation of Olefins
Yi Liu, Sophie G Bender, Damien Sorigue, Daniel J Diaz, Andrew D Ellington, Greg Mann, Simon Allmendinger, Todd K Hyster
Journal of the American Chemical Society, 2024

We engineered a Ene-reductase photoenzyme capable of asymmetric synthesis of α-chloroamides under blue light. MutComputeX was used to identify several mutation that improved the activity and stereoselectivity observed under blue light.

Machine learning-aided engineering of hydrolases for PET depolymerization
Hongyuan Lu, Daniel J Diaz, Natalie J Czarnecki, Congzhi Zhu, Wantae Kim, Raghav Shroff, Daniel J Acosta, Bradley R Alexander, Hannah O Cole, Yan Zhang, Nathaniel A Lynd, Andrew D Ellington, Hal S Alper
Nature, 2022

We utilized MutCompute to guide the engineering of a mesophilic and thermophilic PET hydrolases. We examined the ability of the mesophilic PET hydrolase (FAST-PETase) to depolymerize post-consumer PET waste. FAST-PETase was capable of depolymerizing ~50 post-consumer PET waste within 2-4 days. Furthermore, the ML designs increased the depolymerization capacity of the thermophilic PET hydrolase (ICCM) by 100%. Here is a time-lapse video of the depolymerization of a full PET container from Walmart.

Improved Bst DNA Polymerase Variants Derived via a Machine Learning Approach
Inyup Paik, Phuoc HT Ngo, Raghav Shroff, Daniel J Diaz, Andre C Maranhao, David JF Walker, Sanchita Bhadra, Andrew D Ellington
Biochemistry, 2021

Bst Polymerase was stabilized via ML-guided protein engineering in order to shorten the diagnostic time of LAMP-OSD assays during the height of the COVID19 pandemic. LAMP-OSD is an isothermal DNA amplification technique that enables field diagnostic of COVID19 in poor resource settings.

GroovDB: A database of ligand-inducible transcription factors
Simon d’Oelsnitz, Joshua D Love, Daniel J Diaz, Andrew D Ellington
ACS SynBio, 2022

A database for ligand-induced transcription factor. These transcription factor serve as starting points for high-throughput screening genetic biosensors.

Discovery of novel gain-of-function mutations guided by structure-based deep learning
Raghav Shroff, Austin W Cole, Daniel J Diaz, Barrett R Morrow, Isaac Donnell, Ankur Annapareddy, Jimmy Gollihar, Andrew D Ellington, Ross Thyer
ACS SynBio, 2020

The development and initial experimental validation of the MutCompute framework: a self-supervised 3DCNN trained on the local chemistry surrounding an amino acid. The model was experimentally characterized for its ability to identify residues where the wildtype amino acid is incongruent for its surrounding chemical environment (protein only) and primed for gain-of-function. Here, BFP, phosphomannose isomerase, and beta-lactamase were engineered via machine learning.

Reviews

Using machine learning to predict the effects and consequences of mutations in proteins
Daniel J Diaz, Anastasiya V Kulikova, Andrew D Ellington, Claus O Wilke
Current Opinion in Structural Biology, 2023

A review on the state-of-the-art machine learning frameworks (as of July 2022) for characterizing the functional and stability effects of point mutations on proteins.

Pushing Differential Sensing Further: The Next Steps in Design and Analysis of Bio‐Inspired Cross‐Reactive Arrays
Hazel A Fargher, Simon d'Oelsnitz, Daniel J Diaz, Eric V Anslyn
Analysis & Sensing, 2023

A perspective on the future technology developments of differential sensing.

Patents

Mutations for improving activity and thermostability of petase enzymes
Hongyuan Lu, Daniel J Diaz, Hannah Cole, Raghav Shroff, Andrew D Ellington, Hal Alper
WO Patent WO2022076380A2

ML-engineered PETase variants, inluding the FAST-PETase varaint. Discovery of the N233K ML-design that replaces Ca2+ dependence for stability and activity with a lysine cation.

Leaf-branch compost cutinase mutants
Hongyuan Lu, Daniel J Diaz, Andrew D Ellington, Hal Alper
WO Patent WO2023154690A2

ML-engineered Cutinase variant, which makes use of the N233K ML-design that replaces Ca2+ dependence for stability and activity with a lysine cation.

Engineered human serine dehydratase enzymes and methods for treating cancer
Everett Stone, Ebru Cayir, Daniel J Diaz, Raghav Shroff
WO Patent WO2024006973A1

ML-engineered variants of the human serine dehydratase enzyme for the treatment of luminal breast cancer.

Recombinant proteins with increased solubility and stability
Andrew Ellington, Inyup Paik, Andre Maranhao, Sanchita Bhadra, David Walker, Daniel J Diaz, Ngo Phuoc
US Patent US20240011000A1

ML-engineered Bst DNA Polymerase variants with increased solubility and stability for LAMP-OSD assays for isothermal COVID19 diagnostics.

Methods and compositions related to modified methyltransferases and engineered biosensors
Andrew Ellington, Daniel J Diaz, Simon d'Oelsnitz
US Patent App. 63/493,065

ML-engineered Norbelladine 4O-MethylTransferase variants for biomanufacturing of Galantamine intermediate 4O-methyl-norbelladine. First, ML-engineered enzyme where the ML-designs were conditioned on the active site ligand and cofactor. ML-designs were generated using a computational ternary structure: protein (AlphaFold2), SAH cofactor (GNINA-docking), and norbelladine substrate (GNINA-docking).

Website source code was taken and modified from Jon Barron's website.