Project Area A

Projects of the CRC 1768

Project Area A

Tools for nucleotide sequences and regulation

Project Area A focuses on the development of bioinformatic tools for nucleic acid sequences and regulations by nucleotides sequences. Clearly, understanding a virus is not possible without the “sequence level” information. During the last decade and, in particular, during the pandemic, publicly available sequence information of viruses from high-throughput sequencing (HTS) has blossomed, urgently requiring tools to analyse sequence data at the petabase scale. General purpose tools for sequence analysis may lead to unsatisfactory results that potentially compromise the accuracy and consistency of virus classification when applied to virus sequence data, due to the various unique features of viruses. In particular, we lack a data structure to accurately represent different haplotypes of a quasispecies, including their co-occurring mutations. We also face problems when generating simple genomic multiple sequence alignments (gMSAs). Although virus genomes are relatively short, their variety is enormously large. Calculating and visualising all virus sequences simultaneously is not feasible due to computational time and memory limitations. This raises questions of sequence selection for gMSAs. 

Even annotation of virus genomes cannot be adequately performed with off-the-shelf tools due to largely incomplete databases. Bioinformatic analysis of the host response is established on the transcriptome level. However, signalling pathways of individual viruses have to be revealed. Efforts were made during the COVID-19 pandemic to predict virus infection severity using available sequence data due to the vast amount of training data available. Yet efforts were in large parts futile, especially for machine learning approaches. Beyond the threat of virus infections, bacteriophages may serve as medical treatments: The “One Health” approach emphasises the need for scientific, economic, policy, and legal measures to promote their use, as demonstrated by vibriophage N4. A potential explanation for the unsatisfactory performance of current bioinformatic tools is that developers tried to leapfrog virus understanding; rather, machine learning models were trained to jump from sequence to diagnosis directly. As part of this CRC, we will develop computational methods that approach virus sequence data at a nuanced and “more understanding” pace. Project A01 will develop a bioinformatic tool to monitor virus development using Deep Sequencing data, which allows the quasispecies reconstruction of viruses using haplotype analysis. This approach will also lead to a fundamental understanding of the host range of viruses in correlation with their quasispecies size. This novel data structure may also contain novel annotation features beyond single-gene information, such as transcription start sites (of e.g. vibriophage N4) or RNA secondary structures (of e.g. SARS-CoV-2), which can be directly used in project A02 for a more accurate phylogenetic tree reconstruction tool for virus quasispecies or ancient viruses (> 2 000 years old). A03 will provide computational methods for the analysis of transcriptomic data of the heavily infected host cell. Here, data analysis is complicated by the fact that we can only analyse the mixture of virus and host transcriptome, requiring in silico deconvolution of the data. Finally, A04 will blend in vitro experiments and machine learning, enabling a neural network to choose the direction of RNA driven antiviral wet-lab experiments.

Do viruses exploit their quasispecies for host range evolution?

Viruses exist as dynamic populations of closely related virus genomes arising from mutations, known as quasispecies. We hypothesise that viruses use their quasispecies to expand their evolutionary potential, making them critical for adaptation to new hosts and for resistance to host defences or immunity. Yet, the evolutionary trajectories of viruses cannot be fully understood without considering their ecological context. Host range and environmental conditions act as powerful filters and drivers of virus diversification, raising fundamental research questions at the interface of virus ecology and evolution: How do host interactions and environmental factors shape the emergence, stability, and adaptability of virus quasispecies? If the genetic diversity of the quasispecies reflects the evolutionary potential and ecological interactions of the virus, by which molecular mechanisms do viruses exploit their quasispecies for host range evolution? To address these fundamental questions at the core of our project, we will develop and apply a novel suite of computational tools based on Sequence Variation Graphs (SVGs). SVGs are increasingly utilised for population structure analysis in higher organisms, but their application in virology is limited due to the high mutation rates and genomic diversity of viruses. Nevertheless, they offer potential for analysing data from both genomic and metagenomic samples. In work package WP 1, we will build a quasispecies Sequence Variation Graph (qs-SVG) toolkit that can store sequencing data of virus populations, and we will use and further improve the tool in the remaining work packages. Once the tool is built, we will first apply it to an ideal case where abundant data is available, i.e., SARS-CoV-2 and Influenza viruses before, taking on a more challenging case, i.e., bacteriophages found in environmental metagenomes. By combining these two study systems in our project, we will be able to test different functionalities of the qs-SVG toolkit and develop an optimal bioinformatic solution. In WP 2, we will exploit new and existing data on the quasispecies of human viruses and bacteriophages with broad and narrow host ranges, to test the specific hypothesis that viruses with a broad host range also have a large quasispecies. In WP 3, we will investigate under what conditions viruses evolve their quasispecies. We will examine the relationship between quasispecies, host diversity, and the environment, using both in vitro data from isolates, and in situ data by screening metagenomic data sets derived from environmental samples. Finally, in WP 4 we will focus on the underlying mutational mechanisms. How does quasispecies sequence variation arise? Can viruses exploit mutation to expand their host range, and what molecular mechanisms enable this?

The qs-SVG toolkit will provide immediate access to mutations and genetic functions, enabling us to explore the association of these features with variable sites. We will also chart the distribution of the mutational mechanisms across ecosystems and host types (G3). Thus, our new tool suite will facilitate describing, quantifying, and understanding emerging viruses (G1) and offer new perspectives for studying their evolution (G2) in the context of host range.

Project Leaders

Prof. Dr. Bas E. Dutilh

 Institute of Biodiversity, Ecology,

and Evolution, Friedrich Schiller University Jena

Prof. Dr. Kirsten Küsel

Institute of Biodiversity, Ecology,

and Evolution, Friedrich Schiller University Jena

Phylogeny of functional sequence elements in virus genomes

Virus phylogenies are commonly built from selected open reading frames (ORFs) or genes and ignore recent discoveries on virus genome complexity from functional genomics (omics) studies on virus-infected cells. These omics studies use RNA-seq, Ribo-seq, SHAPE-seq or other sequencing-based assays and vastly extended our knowledge on virus genomes by detecting numerous novel functional sequence elements (FSEs). However, these studies commonly ignore one fundamental question: Are these novel FSEs conserved during virus evolution and thus likely to play an important role in the virus life cycle?
FSEs identified in omics studies include short ORFs (sORFs) with <100 nucleotides, e.g., upstream ORFs (uORFs) within 5’ untranslated regions (UTRs) of other ORFs, or alternative proteins generated from the same locus through programmed ribosomal frameshifting or alternative splicing, Fig. A02.1. In addition, novel virus non-coding RNAs like circular RNAs (circRNAs) and microRNAs (miRNAs) have been discovered. Furthermore, binding sites of host RNA and DNA binding proteins in virus DNA or RNA can now be determined at large scale. These FSEs cannot be predicted from sequence alone and some FSEs have to form specific RNA structures to be functional.
To date, no standardised, comprehensive tool is available to detect different types of virus FSEs from omics data and analyse their conservation; existing phylogenetics approaches focus only on protein-coding genes. In this project, we will close this gap by developing tools to identify FSEs that are conserved in sequence and/or structure for (1) reconstructing their evolutionary histories; (2) incorporating them into robust virus phylogenies; and (3) predicting potential functional roles. As recombination is an important evolutionary process that affects many viruses, we will implement a method for recombination-aware reconstruction of phylogenies. We will therefore contribute to central goals G1, G2, and G3 of the CRC VirusREvolution. Our tools will initially be developed for SARS-CoV-2, vibriophage N4, hepatitis B virus (HBV), and herpes simplex virus 1 (HSV-1) and will be generalised to other viruses in subsequent funding phases. Here, inclusion of ancient HBV and HSV-1 genomes and recombination events will also enable us to describe the evolutionary histories of viruses spanning several thousand years. Genome annotations extended with conserved FSEs will be incorporated into VirJenDB within NFDI4Microbiota.

Project Leaders

Prof. Dr. Caroline Friedel

Institute for Informatics,

Ludwig-Maximilians-University Munich

Dr. Denise Kühnert

Centre for Artificial Intelligence in Public Health,

Robert Koch Institute

Detecting time-resolved and virulence-associated host responses to virus infection

Understanding the cell’s transcriptional response to virus infections is crucial for comprehending the host’s molecular defence, the pathogen’s strategies to circumvent these mechanisms, and hence the virus capability to cause severe disease (i.e. their virulence). However, currently available methods are not sufficient to accurately determine the cellular transcription response, especially for viruses that cause a global host cell shut-off. In these cases, available computational read normalisation strategies prohibit an accurate analysis of differential gene expression. Moreover, current off-the-shelf methods also cannot assess the expression of all relevant transcript classes, in particular the highly repetitive transposable elements (TEs) that have been reported to trigger innate immunity in virus-infected cells. Currently used analysis pipelines do not yet enable the assessment of individual TE copy expression.
Thus, we aim to develop tools for (i) normalisation of reads to accurately measure the host transcription shutoff imposed by viruses and (ii) mapping expression of individual TE copies. Moreover, we aim to (iii) combine these methods into a tool to systematically analyse and cluster expression time series to characterise expression trajectories and infer regulatory interactions during virus infections. We expect that the development and implementation of the proposed software will subsequently enable us to improve the transcriptome-based prediction of a pathogen’s virulence.

Project Leaders

Prof. Dr. Steve Hoffmann

Faculty of Biological Sciences,

Friedrich Schiller University Jena,

Leibniz Institute on Aging, Fritz Lipmann Institute

Prof. Dr. Friedemann Weber

Institute of Virology,

Veterinary Medicine,

Justus Liebig University Giessen

Harnessing synthetic small RNAs to probe, decode, and optimise phage-host interactions

Regulatory RNAs have emerged as powerful tools in synthetic biology due to their programmability and ability to modulate gene expression with high specificity. Among these, small RNAs (sRNAs) that act through base-pairing interactions offer a versatile platform for controlling molecular processes in both prokaryotic and eukaryotic systems . Indeed, synthetic regulatory RNAs have already shown potential in metabolic engineering, gene regulation, and diagnostics. However, despite their broad regulatory utility, synthetic regulatory RNAs have not yet been broadly applied to antiviral strategies, especially those targeting RNA-RNA interactions relevant during virus infection. In viruses, RNA structures and RNA-mediated gene regulation are closely linked to replication and host manipulation, making them attractive targets for RNA-based interference. However, rationally designing effective synthetic RNAs remains a major challenge due to the complexity of RNA folding, target recognition, and the dynamic nature of virus-host interactions. Recent advances in the design of synthetic regulatory RNAs and machine learning, particularly neural networks (NN), now offer a path towards predictive modelling of these interactions. Furthermore, integrating experimental feedback into model training holds promise for accelerating the design-test-learn cycle of the synthetic biology toolbox.
In this project, we aim to close this gap by systematically and adaptively optimising antiviral RNAs that target virus and/or host RNAs that are required for virus infection and replication. Specifically, we will develop a neural networkbased tool that integrates predictive modelling of RNA-RNA interactions with experimental feedback to optimise synthetic antiviral RNAs. The tool will focus on targeting both coding and non-coding features of virus and host RNAs to disrupt virus entry, replication, and exit. Our initial work will concentrate on bacteriophages, with future expansion to eukaryotic viruses.
The tool will learn from wet-lab data to improve predictions of functional RNA interactions and guide the identification of more potent RNA molecules in iterative cycles. By modelling both interacting and non-interacting RNA pairs, the system will distinguish functional mechanisms from background noise, increasing design accuracy. Through collaborations within the CRC VirusREvolution consortium, the neural network will be enhanced with diverse datasets, improving its generalisability across different phage-host systems following the overall goals G2 and G3.
The resulting platform will provide the foundation for programmable RNA therapeutics with high specificity, adaptability, and reduced likelihood of resistance. It will also establish general principles for RNA-mediated antiviral defence that can be leveraged across different organisms and virus families. From a broader perspective, this project bridges computational and experimental biology to tackle one of the central challenges in virology and synthetic biology – how to rationally design molecules that can interfere with evolving virus systems. This strategy goes beyond classical design principles and opens avenues for responsive, data-driven synthetic biology. Taken together, the proposed work will (a) advance our understanding of virus adsorption, entry, replication, and escape; (b) support the long-term goal of intelligent, programmable, and adaptive biological interventions; and (c) provide novel intervention strategies targeting viruses at the RNA level. This aligns closely with the overarching research goals of the CRC: understanding of virus evolution (G2), virus-host interactions (G3), and the mechanisms of virus infection (G4).

Project Leaders

Prof. Dr. Manja Marz

Institute of Computer Science,

Friedrich Schiller University Jena

Prof. Dr. Kai Papenfort

 Institute for Microbiology