Danny Jesus Diaz
I am a computational protein engineer.
My research consists of developing sequence- and structure-based machine learning frameworks
for identifying stabilizing and functional mutations in proteins. I collaborate extensively with
experimental protein engineers to accelerate the developability and functionality of proteins for
biotechnology applications.
I received my PhD in Chemistry under Dr Andrew Ellington,
and Dr Eric Anslyn at the
University of Texas at Austin.
I was a NSF GRFP honorable mention and an IFML fellow.
During my PhD, I was the primary developer of MutCompute:
a machine learning as a service tool for structure-based ML-guided protein engineering.
Currently, under the co-directors Dr Adam Klivans
and Dr Alex Dimakis, I lead the
Deep Proteins Groups at the Institute for Foundations of Machine Learning (IFML). Recently,
I co-founded Intelligent Proteins, LLC where we use machine learning-guided protein engineering
to develop protein-based biotechnologies for therapeutic (biologics) and biomanufacturing
applications.
Email /
CV /
Bio /
LinkedIn /
Google Scholar
/
Twitter /
Github
|
|
Research
I'm interested in protein engineering, machine learning, computer vision,
biocatalysis, cancer metabolism, rare metabolic diseases, and startup/entrepreneurship.
My research consist of training machine learning algorithms to understanding how molecular
interactions manifest into
protein and cellular phenotypes.
Representative papers are highlighted.
|
Machine Learning Papers
|
Binding Oracle: Fine-Tuning From Stability to Binding Free Energy
Chengyue Gong,
Adam R Klivans,
Jordan Wells,
James Loy,
Qiang Liu,
Alexandros G. Dimakis,
Daniel J Diaz,
NeurIPS GenBio Workshop Spotlight, 2023
Fine-tuning machine learning frameworks to a small experimental dataset is prone to overfitting.
Here, we present Binding Oracle: a Graph-Transformer framework that fine-tunes Stability Oracle
to ââG of binding for protein-protein interfaces (PPI) via a technique we call Selective LoRA.
Selective LoRA, uses the gradient norms of each layer to select the subset most sensitive to the
fine-tuning dataset--here it was a curated subset of Skempi2.0 (B1816)--and then finetunes the
selected layers with LoRA. By applying Selective LoRA to Stability Oracle, we are able to achieve
SOTA on the S487 PPI test set and generalization between different types of PPI interfaces.
|
Predicting a Proteinâs Stability under a Million Mutations
Jeffrey Ouyang-Zhang,
Daniel J Diaz,
Adam R Klivans
Philipp Krahenbuhl
NeurIPS, 2023
The mutate everything framework allows the fine-tuning of sequence-based (ESM2) and MSA-based
(AlphaFold2) protein foundation models on phenotype data with parallel decoding. Here, we
demonstrate how their
representations can be fine-tuned on the
cDNA-display proteolysis
dataset with the mutate everything framework to predict the thermodynamic impact of
single point mutations (ââG). The AlphaFold2 fine-tuned model, StabilityFold, is able to achieve
similar
results to Stability Oracle on a variety test sets. More importantly, The mutate everything
framework allows for parallel decoding of single and higher-order amino acid substitutions into ââG
predictions.
This capability not only enables rapid DMS inferencing of proteins but makes double
mutant DMS inferencing computationally tractable.
|
Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations
Daniel J Diaz,
Chengyue Gong,
Jeffrey Ouyang-Zhang,
James M Loy,
Jordan Wells,
David Yang,
Andrew D Ellington,
Alexandros G Dimakis,
Adam R Klivans
BioRxiv, 2023
A Graph-Transformer framework that is first pre-trained with self-supervision on
the MutComputeX dataset and then fine-tuned on a curated subset of
the cDNA-display proteolysis
dataset.
We also present Thermodynamic Permuations: a thermodynamically valid data augmentation technique
that balances mutation type sampling and ddG distribution for training and test sets.
|
Hotprotein: A novel framework for protein thermostability prediction and editing
Tianlong Chen,
Chengyue Gong,
Daniel J Diaz,
Xuxi Chen,
Jordan Tyler Wells,
Zhangyang Wang,
Andrew Ellington,
Alex Dimakis,
Adam Klivans
ICLR, 2023
We curated an organism-based temperature dataset (HotProteins) for distinguishing proteins
with varying thermostability (cryophiles, psychrophiles, mesophiles, thermophiles, and
hyperthermophiles).
We proposed structure-aware pretraining (SAP) and factorized sparse tuning (FST) to
fine-tune the sequence-based transformer, ESM-1b, representations to generate a classifier and
regressor to predict a protein's organism class or growth temperature.
|
Two sequence- and two structure-based ML models have learned different aspects of protein
biochemistry
Anastasiya V Kulikova,
Daniel J Diaz,
Tianlong Chen,
Jeffrey Cole,
Andrew D Ellington,
Claus O Wilke
Scientific Reports, 2023
We compare and contrast self-supervised sequence-based transformers and structure-based 3DCNNs
models.
We find that there is a variance-bias tradeoff between the two protein modalities.
Convolutions provide an inductive bias for protein structures where the
more powerful sequence-based transformers demonstrate increase variance.
|
Learning the Local Landscape of Protein Structures with Convolutional Neural Networks
Anastasiya V Kulikova,
Daniel J Diaz,
James M Loy,
Andrew D Ellington,
Claus O Wilke
Journal of Biological Physics, 2021
We compare how self-supervised 3DCNNs learn the local mutational landscape of proteins
against evolution via Multiple Sequence Alignments. We find that structure-based 3DCNNs
amino acid likelihoods have weak correlation with MSAs and their wildtype confidence
is dependent on the structural position of the residue.
Where core residues being more confidently predicted.
|
Protein Papers
|
Synthetic microbial sensing and biosynthesis of amaryllidaceae alkaloids
Simon d'Oelsnitz,
Daniel J Diaz,
Daniel J Acosta,
Mason W Schechter,
Matthew B Minus,
James R Howard,
Hannah Do,
James Loy,
Hal Alper,
Andrew D Ellington
BioRxiv, 2023
We engineered a transcription factor and a methyl transferase to improve the
regioselectivity and titer yield for production of 4O-methyl-norbelladine.
To engineer the 4O-methyltransferase enzyme, we developed MutComputeX:
a self-supervised 3DResNet trained to generalize to protein-ligand, -nucleotide, and -protein
interfaces.
MutComputeX was used to design mutations on a computational ternary structure of
the AlphaFolded methyl-transferase
with SAH and norbelladine docked
with Gnina.
This is the first time three machine learning models (AlphaFold, Gnina, MutComputeX) were
synergized to engineer the surface and active site of an enzyme and combined with an
engineered transcription factor for high-throughput screening.
|
Machine learning-aided engineering of hydrolases for PET depolymerization
Hongyuan Lu,
Daniel J Diaz,
Natalie J Czarnecki,
Congzhi Zhu,
Wantae Kim,
Raghav Shroff,
Daniel J Acosta,
Bradley R Alexander,
Hannah O Cole,
Yan Zhang,
Nathaniel A Lynd,
Andrew D Ellington,
Hal S Alper
Nature, 2022
We utilized MutCompute to guide the engineering of a mesophilic and thermophilic PET hydrolases.
We examined the ability of the mesophilic PET hydrolase (FAST-PETase) to depolymerize post-consumer
PET waste.
FAST-PETase was capable of depolymerizing ~50 post-consumer PET waste within 2-4 days. Furthermore,
the ML designs
increased the depolymerization capacity of the thermophilic PET hydrolase (ICCM) by 100%.
Here is a time-lapse video of the depolymerization of a
full PET container from Walmart.
|
Improved Bst DNA Polymerase Variants Derived via a Machine Learning Approach
Inyup Paik,
Phuoc HT Ngo,
Raghav Shroff,
Daniel J Diaz,
Andre C Maranhao,
David JF Walker,
Sanchita Bhadra,
Andrew D Ellington
Biochemistry, 2021
Bst Polymerase was stabilized via ML-guided protein engineering in order to shorten the diagnostic
time of LAMP-OSD assays during the height of the COVID19 pandemic. LAMP-OSD is an isothermal DNA
amplification technique that enables field diagnostic of COVID19 in poor resource settings.
|
GroovDB: A database of ligand-inducible transcription factors
Simon dâOelsnitz,
Joshua D Love,
Daniel J Diaz,
Andrew D Ellington
ACS SynBio, 2022
A database for ligand-induced transcription factor.
These transcription factor serve as starting points for high-throughput screening genetic
biosensors.
|
Using machine learning to predict the effects and consequences of mutations in proteins
Daniel J Diaz,
Anastasiya V Kulikova,
Andrew D Ellington,
Claus O Wilke
Current Opinion in Structural Biology, 2023
A review on the state-of-the-art machine learning frameworks (as of July 2022) for characterizing
the
functional and stability effects of point mutations on proteins.
|
Discovery of novel gain-of-function mutations guided by structure-based deep learning
Raghav Shroff,
Austin W Cole,
Daniel J Diaz,
Barrett R Morrow,
Isaac Donnell,
Ankur Annapareddy,
Jimmy Gollihar,
Andrew D Ellington,
Ross Thyer
ACS SynBio, 2020
The development and initial experimental validation of the MutCompute framework: a self-supervised
3DCNN trained on the local chemistry surrounding an amino acid. The model was experimentally
characterized for its ability to identify residues where the wildtype amino acid is incongruent for
its surrounding chemical environment (protein only) and primed for gain-of-function.
Here, BFP, phosphomannose isomerase, and beta-lactamase were engineered via machine learning.
|
Other Papers
|
Pushing Differential Sensing Further: The Next Steps in Design and Analysis of BioâInspired
CrossâReactive Arrays
Hazel A Fargher,
Simon d'Oelsnitz,
Daniel J Diaz,
Eric V Anslyn
Analysis & Sensing, 2023
A perspective on the future technology developments of differential sensing.
|
|