Despite the vast accumulation of genomic data, the RNA regulatory code must still be better understood. Genomic foundation models, pre-trained on large datasets, can adapt RNA representations for biological prediction tasks. However, current models rely on training strategies like masked language modeling and next token prediction, borrowed from domains such as text and vision, which lack biological insights. Experimental methods like eCLIP and ribosome profiling help study RNA regulation but are expensive and time-consuming. Machine learning models trained on genetic sequences provide an efficient, cost-effective alternative, predicting essential cellular processes like alternative splicing and RNA degradation.
Recent research proposes using foundation models in genomics, employing self-supervised learning (SSL) to train on unlabeled data. At the same time, these models aim to generalize well across tasks with fewer labeled samples. Genomic sequences present challenges due to low diversity and high mutual information, as constrained by evolutionary forces. Consequently, SSL models often reconstruct non-informative parts of the genome, leading to ineffective representations for RNA prediction tasks. Despite improvements in model scaling, the performance gap between SSL-based approaches and supervised learning remains wide, indicating the need for better strategies in genomic modeling.
Researchers from institutions including the Vector Institute and the University of Toronto have introduced Orthrus, an RNA foundation model pre-trained using a contrastive learning objective with biological augmentations. Orthrus maximizes the similarity between RNA transcripts from splice isoforms and orthologous genes across species, using data from 10 model organisms and over 400 mammalian species in the Zoonomia Project. By leveraging functional and evolutionary relationships, Orthrus significantly outperforms existing genomic models on mRNA property prediction tasks. The model excels in low-data environments, requiring minimal fine-tuning to achieve state-of-the-art performance in RNA property predictions.
The study employs contrastive learning to analyze RNA splicing and orthology using modified InfoNCE loss. RNA isoforms and orthologous sequences are paired to identify functional similarities, and the model is trained to minimize the loss. The research introduces four augmentations—alternative splicing across species, orthologous transcripts from over 400 species, gene identity-based orthology, and masked sequence inputs. The Mamba encoder, a state-space model optimized for long sequences, is used to learn from RNA data. Evaluation tasks include RNA half-life, ribosome load, protein localization, and gene ontology classification, using various datasets for performance comparison.
Orthrus employs contrastive learning to build a structured representation of RNA transcripts, enhancing the similarity between functionally related sequences while minimizing it for unrelated ones. This dataset is constructed by pairing transcripts based on alternative splicing and orthologous relationships, assuming these pairs are functionally closer than random ones. Orthrus processes RNA sequences through the Mamba encoder and applies decoupled contrastive learning (DCL) loss to distinguish between related and unrelated pairs. Results show Orthrus outperforms other self-supervised models in predicting RNA properties, demonstrating its effectiveness in tasks like RNA half-life prediction and gene classification.
In conclusion, Orthrus leverages an evolutionary and functional perspective to capture RNA diversity by using contrastive learning to model sequence similarities from speciation and alternative splicing events. Unlike prior self-supervised models focused on token prediction, Orthrus effectively pre-trains on evolutionarily related sequences, reducing reliance on genetic diversity. This approach enables strong RNA property predictions like half-life and ribosome load, even in low-data scenarios. While the method excels in capturing shared functional regions, potential limitations arise in cases where isoform variation minimally impacts certain RNA properties. Orthrus demonstrates superior performance over reconstruction-based methods, paving the way for improved RNA representation learning.
Check out the Paper, Model on HF, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.
[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)
The post Orthrus: A Mamba-based RNA Foundation Model Designed to Push the Boundaries of RNA Property Prediction appeared first on MarkTechPost.