美国Arc研究所Brian L. Hie小组取得一项新突破。他们的研究利用Evo 2进行所有生命域的基因组建模与设计。相关论文于2026年3月4日发表在《自然》杂志上。
尽管基因组测序、合成和编辑工具已经改变了生物学研究,但课题组人员仍然缺乏对基因组编码的巨大复杂性的充分理解,无法预测许多类型的基因组变化的影响,也无法智能地组成新的生物系统。从不同生物体的基因组序列中学习信息的人工智能模型具有越来越先进的预测和设计能力。
在这里,课题组介绍Evo 2,这是一个生物基础模型,训练了9万亿DNA碱基对,这些DNA碱基对来自一个高度管理的基因组图谱,跨越所有生命领域,具有100万个单核苷酸分辨率的令牌上下文窗口。Evo 2学会了准确预测基因变异的功能影响——从非编码致病性突变到临床显著的BRCA1变异——而无需针对特定任务进行微调。机制可解释性分析表明,Evo 2学习了与生物学特征相关的表征,包括外显子-内含子边界、转录因子结合位点、蛋白质结构元件和噬菌体基因组区域。Evo 2的生成能力在基因组尺度上产生线粒体、原核和真核序列,比以前的方法具有更大的自然性和一致性。在预测模型和推理时间搜索的指导下,Evo 2还可以生成经过实验验证的染色质可接近性模式。该团队已经使Evo 2完全开放,包括模型参数、训练代码、推理代码和OpenGenome2数据集,以加速对生物复杂性的探索和设计。
据了解,所有的生命都用DNA编码信息。
附:英文原文
Title: Genome modelling and design across all domains of life with Evo 2
Author: Brixi, Garyk, Durrant, Matthew G., Ku, Jerome, Naghipourfar, Mohsen, Poli, Michael, Sun, Gwanggyu, Brockman, Greg, Chang, Daniel, Fanton, Alison, Gonzalez, Gabriel A., King, Samuel H., Li, David B., Merchant, Aditi T., Nguyen, Eric, Ricci-Tam, Chiara, Romero, David W., Schmok, Jonathan C., Taghibakhshi, Ali, Vorontsov, Anton, Yang, Brandon, Deng, Myra, Gorton, Liv, Nguyen, Nam, Wang, Nicholas K., Pearce, Michael T., Simon, Elana, Adams, Etowah, Amador, Zachary J., Ashley, Euan A., Baccus, Stephen A., Dai, Haoyu, Dillmann, Steven, Ermon, Stefano, Guo, Daniel, Herschl, Michael H., Ilango, Rajesh, Janik, Ken, Lu, Amy X., Mehta, Reshma, Mofrad, Mohammad R. K., Ng, Madelena Y., Pannu, Jaspreet, R, Christopher, St. John, John, Sullivan, Jeremy, Tey, Joseph, Viggiano, Ben, Zhu, Kevin, Zynda, Greg, Balsam, Daniel, Collison, Patrick, Costa, Anthony B., Hernandez-Boussard, Tina, Ho, Eric, Liu, Ming-Yu, McGrath, Thomas, Powell, Kimberly, Pinglay, Sudarshan, Burke, Dave P.
Issue&Volume: 2026-03-04
Abstract: All of life encodes information with DNA. Although tools for genome sequencing, synthesis and editing have transformed biological research, we still lack sufficient understanding of the immense complexity encoded by genomes to predict the effects of many classes of genomic changes or to intelligently compose new biological systems. Artificial intelligence models that learn information from genomic sequences across diverse organisms have increasingly advanced prediction and design capabilities1,2. Here we introduce Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life to have a 1 million token context window with single-nucleotide resolution. Evo 2 learns to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific fine-tuning. Mechanistic interpretability analyses reveal that Evo 2 learns representations associated with biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements and prophage genomic regions. The generative abilities of Evo 2 produce mitochondrial, prokaryotic and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Evo 2 also generates experimentally validated chromatin accessibility patterns when guided by predictive models3,4 and inference-time search. We have made Evo 2 fully open, including model parameters, training code5, inference code and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.
DOI: 10.1038/s41586-026-10176-5
Source: https://www.nature.com/articles/s41586-026-10176-5
Nature:《自然》,创刊于1869年。隶属于施普林格·自然出版集团,最新IF:69.504
官方网址:http://www.nature.com/
投稿链接:http://www.nature.com/authors/submit_manuscript.html
