英国InstaDeep公司Thomas Pierrot等研究人员合作开发出,用于构建和评估人类基因组学的稳健基础模型。相关论文于2024年11月28日在线发表于国际学术期刊《自然—方法学》。
研究人员提出了一项关于基础模型的广泛研究,该模型在DNA序列上进行预训练,命名为“核苷酸变换器”,其参数从5000万个到25亿不等,并整合了来自3202个人类基因组和850个来自不同物种的基因组数据。这些变换器模型生成了针对特定背景的核苷酸序列表示,能够在数据匮乏的情况下仍然进行准确预测。
研究人员展示了这些开发的模型可以以低成本进行微调,以解决多种基因组学应用问题。尽管没有监督,模型学会了将注意力集中在关键基因组元素上,并可用于改善遗传变异的优先级排序。在基因组学中训练和应用基础模型为从DNA序列中准确预测分子表型提供了一种广泛适用的方法。
据介绍,从DNA序列预测分子表型仍然是基因组学中的一项长期挑战,通常受限于注释数据的匮乏以及任务之间无法转移学习的困境。
附:英文原文
Title: Nucleotide Transformer: building and evaluating robust foundation models for human genomics
Author: Dalla-Torre, Hugo, Gonzalez, Liam, Mendoza-Revilla, Javier, Lopez Carranza, Nicolas, Grzywaczewski, Adam Henryk, Oteri, Francesco, Dallago, Christian, Trop, Evan, de Almeida, Bernardo P., Sirelkhatim, Hassan, Richard, Guillaume, Skwark, Marcin, Beguir, Karim, Lopez, Marie, Pierrot, Thomas
Issue&Volume: 2024-11-28
Abstract: The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.
DOI: 10.1038/s41592-024-02523-z
Source: https://www.nature.com/articles/s41592-024-02523-z
Nature Methods:《自然—方法学》,创刊于2004年。隶属于施普林格·自然出版集团,最新IF:47.99
官方网址:https://www.nature.com/nmeth/
投稿链接:https://mts-nmeth.nature.com/cgi-bin/main.plex