viruses in silico | the EVBC lecture series » Past lectures

Endogenous bornavirus-like elements
19 June 2023 | 10 am CEST
Masayuki Horie, Osaka Metropolitan University, Japan

Endogenous viral elements (EVEs) are inheritable sequences present in the genomes of eukaryotes originating from viruses. Although bornaviruses are non-retroviral RNA viruses, we and others previously discovered EVEs derived from bornaviruses in vertebrate genomes. These findings have expanded our knowledge of virology, cellular biology, evolutionary biology, and more.

Here, I will cover endogenous bornavirus-like elements (EBLs) from three aspects: (i) the biological significance of EBLs, (ii) the paleovirology of bornaviruses, and (iii) a novel strategy to detect EBLs.


Tomasz Wirecki

Computational identification of highly conserved, structured functional modules in long RNA molecules.
27. March 2023 | 04 pm CET
Tomasz Krzysztof Wirecki, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland

The majority of viruses that infect humans and other organisms, have Ribonucleic Acid (RNA) genomes. The structural elements of mRNA molecules have regulatory functions in both the untranslated (UTR) and coding (CDS) regions of viral genomes. Local secondary structures that are evolutionarily conserved in CDSs have been shown to be functional, and evidence suggests that these structures constitute an evolutionarily conserved level of functional organization in viruses.
We propose to identify highly conserved structured functional modules in positive-sense viral RNA (++ssRNA) genome sequences, including both UTRs and CDSs, and group them based on their sequence-structure relationship using theoretical studies and structural biology. We have developed a novel computational approach to search for highly conserved, structured functional modules in long RNA molecules.
Our prototype program uses a sliding window approach to carry out secondary structure predictions and utilizes various structure predictors to generate local structures. The procedure allows for the determination of stably structured regions, and finally, all robustly folded structures for the whole sequence are reported. We present the software benchmark results based on structures we previously


George Spyrou

Network-based approaches for the identification of viral-mediated pathogenic mechanisms and of candidate anti-viral drugs
20. February 2023 | 04 pm CET
George Spyrou, The Cyprus Institute of Neurology & Genetics, Cyprus

Virus-host protein-protein interactions (PPIs) are critical regarding the effect in human health a viral infection may have. These inter-species interactions can lead to viral-mediated perturbations of the human interactome causing the generation of various complex diseases. Network-based approaches can highlight the possible relationship between a viral infection and other diseases. From another perspective, similarities between viruses at both viral protein and host protein level, may provide the appropriate framework to identify and rank candidate drugs to be used against a specific virus. Specifically, combining information obtained from: (a) ranked path


Virus genome resource comparison
30. January 2023 | 04 pm CET
Noriko Cassman & Muriel Ritsch, Friedrich Schiller University Jena, Germany


Computational Tools for Deeper Mining of Viromes and Bacteriophage Genomes
28. November 2022 | 06 pm CET / 09 am PST
Anca Segall, San Diego State University, USA

Given the numerical dominance of bacteriophages, their central roles in all microbiomes, and their diversity, it is clear we need a better understanding of the genetic and functional potential of viruses. In recent years, the resurgence and greater spread of phage therapy has raised the stakes even higher: the better we know our phages, the more safely we can use them as surrogate antibiotics. I will describe our most recently developed, machine learning-based tools for more thorough annotation of bacteriophage genomes.


Sebastian Lequime

The viruses within: endogenous viral elements as genomic fossils to explore the deep-evolutionary history of viruses
24. October 2022 | 04–05 pm CEST
Sebastian Lequime, University of Groningen, The Netherlands

Endogenous viral elements (EVEs) are integrations of full or partial viral genomes in their host nuclear genome. Endogenous retroviruses are widespread in eukaryote genomes, but other non-retroviral viruses have been shown to generate EVEs, albeit more rarely. These genomic imprints can shed light on the deep evolutionary history of viruses, ancestral host ranges, and ancient viral-host interactions. EVEs are thus ‘genomic fossils’ that we can use to compensate for the absence of physical fossil traces of viruses. 

In this presentation, I will use viruses from the Flaviviridae family as an example. This viral family contains important members, including well-known human pathogens, such as Zika, dengue, and hepatitis C viruses. The first EVEs derived from Flaviviridae have been identified in mosquitoes, but recent studies have extended the host range to include other invertebrates and some vertebrates such as fish.

However, none had been detected in mammals, even though the family encompasses numerous mammal-infecting members. Through a comprehensive in silico screening of a large dataset of available mammalian genomes, our study identified two novel Flaviviridae-like EVEs, derived from the genus Pestivirus, in the reference genome of the Indochinese shrew (Crocidura indochinensis), a first in mammals. Homologs of these novel EVEs were subsequently detected in other shrew species, especially within the Crocidurinae subfamily and one in the Soricinae subfamily on different continents. Based on this wide distribution, we estimate that the integration event occurred before the last common ancestor of the subfamily, about 10.8 million years ago, attesting to an ancient origin of pestiviruses and supporting a deep-evolutionary history of flavivirids in general.


Ana B Abecasis

Phylogenetics and Transmission Studies in HIV-1
26. September 2022 | 04–05 pm CEST
Ana Abecasis, Universidade Nova de Lisboa, Portugal


Simon Roux

RNA phages: more than meets the eye?
22. August 2022 | 06–07 pm CEST / 09–10 am PDT
Simon Roux, DOE Joint Genome Institute, USA

The vast majority of bacteriophage diversity seems to be found in the Caudoviricetes class, i.e. double-stranded DNA phages with head-tail virion morphology. Across sample types and environments, these head-tail viruses are almost systematically the most diverse and abundant type of phage detected and reported, often alongside a handful of other groups of less abundant DNA phages. In contrast, RNA phages are typically not considered as relevant components of environmental phage communities. Here, we analyzed 330k novel RNA-dependent RNA polymerases (RdRP) mined from 5,150 diverse metatranscriptomes, which represented a five-fold increase of RNA virus diversity compared to current datasets. Combined genome annotation and phylogenetic analyses revealed a transformed picture of global RNA phage diversity, with a dramatically expanded levivirus clade now representing ~ a third of the global RNA virus diversity and routinely identified as the most abundant type of RNA viruses in wastewater, soil, and rhizosphere samples, as well as 8 proposed new families and genera of RNA phages distinct from the known leviviruses and cystoviruses and spanning across several phyla in a global RNA phylogeny. Among these, a new clade of segmented Partiti-like phages was consistently identified across multiple years in several Yellowstone hot spring biofilms and predicted to infect a highly abundant Roseiflexus strain, suggesting an important ecological role for these new viruses in this microbiome. Overall, this global RNA virosphere analysis unveiled an unprecedented and so-far-uncharacterized richness of RNA phages, highlighted key environments such as soil and hot spring biofilms where a large portion of this novel RNA phage diversity resides, and indicates that RNA phages must be considered when investigating phage impact on microbiomes.


Áine O’Toole

SARS-CoV-2 Lineages: Lessons Learned and Perspectives for Future Epidemic Tracking
25. July 2022 | 04–05 pm CEST
Áine O’Toole, University of Edinburgh, UK

The scale of data produced throughout this SARS-CoV-2 pandemic has presented novel opportunities and challenges for the field of genomic epidemiology. The Pango lineage nomenclature system for SARS-CoV-2, first proposed in April 2020, has been a key tool in tracking of SARS-CoV-2 throughout this pandemic. The lineage system is a set of rules that defines epidemiological lineages of SARS-CoV-2. As the pandemic unfolded, we adapted the lineage system and implementation to respond to the changing scales of data and the shifting mutational and epidemiological landscape of SARS-CoV-2.

During this presentation, two years on from the inception of the lineage system, I will discuss lessons learned in its execution, including the limitations and challenges, and reflect on whether such a system has been an effective tool. I will discuss the challenges faced at the science-media interface that led to the creation of the WHO’s Greek-letter based variant names but also the importance of horizon-scanning and monitoring of nascent lineages and variants. I will provide a perspective on how Pango adapted to the ever-increasing scale of data, the software and systems built to support this system thus far, and the challenge of automation going forward. I consider the longevity of the Pango lineage system and the infrastructure needed for the management, maintenance and improvement of this approach to tracking virus epidemics.


Nardus Mollentze

Identifying and prioritising poorly-characterised viruses with zoonotic potential
27. June 2022 | 04–05 pm CEST
Nardus Mollentze, MRC-University of Glasgow Centre for Virus Research, UK

The rate at which novel viruses are being discovered vastly outpaces our capacity for phenotypic characterisation. For most viruses, we lack data on key characteristics such as their ability to infect humans, and therefore fail to rationally prioritise viruses for further research or vaccine development. In this talk, I will show how virus genomes – often the first and only data available for newly-discovered viruses – can be used to identify and prioritise potential human-infecting viruses, and how host phylogeny can be used to optimise surveillance by identifying other potential hosts. Finally, I will show how the host range data resulting from wildlife surveillance can be combined with virus genomes to allow even more accurate identification of potential zoonotic viruses. Combined, these approaches allow evidence-driven virus surveillance and increase the feasibility of downstream biological and ecological characterisation of viruses relevant to human health.


Nanopores – is there anything they can’t do?
23. May 2022 | 04–05 pm CEST
Daniel Depledge, Hannover Medical School, Germany

Decoding viral transcriptomes by conventional RNA sequencing approaches is complicated by high gene density, overlapping reading frames, and complex splicing patterns. To overcome this, we have made extensive use of nanopore direct RNA sequencing (DRS) in viral contexts – a process in which native polyadenylated RNAs isolated directly from the cellular environment are sequenced, without the recoding and amplification biases inherent to other sequencing methodologies.

By sequencing full-length mRNAs, we are able to precisely map 5’ transcription start sites along with 3’ cleavage and polyadenylation sites. This has enabled us to produce ultra-high resolution annotations of important human viruses including VZV, SVV, HSV-1, adenovirus, and SARS-CoV-2. Moreover, we and others have also shown that nanopore DRS datasets can be interrogated to detect sites at which ribonucleotides are modified, and that poly(A) tail length distributions can be estimated, further enhancing the utility of these datasets.

During this presentation, I will highlight several of our annotation studies and computational tools (DRUMMER & NAGATA) and describe why these have been crucial to our ongoing studies of (i) herpesviral latency and (ii) the role of N6-methyladenosine in regulating diverse viral infections. 


Open Virome: Foundations for discovery of the first 100 million RNA viruses
25. April 2022 | 04–05pm CEST
Artem Babaian, University of Cambridge, UK

Rapid growth in nucleic acid sequencing is illuminating Earth’s vast genetic diversity. This exponential growth in available sequencing data is driving a hyper-exponential increase in the number of known viruses. Recently, I analyzed 5.7 million sequencing samples (10.2 petabases) for the RNA virus hallmark gene RNA dependent RNA polymerase, which revealed the equivalent of 130,000 novel RNA viruses (compared to 15,000 RNA viruses in all public databases). All of this data is freely explorable online at www.serratus.io or via the new RdRP-interface palmID, www.serratus.io/palmid.

Yet this near order-of-magnitude increase in the RNA virome is superficial, a starting point to a more fundamental question: How are we going to interpret the ~100 million RNA viruses which will be found by the end of the decade?

I will share an overview of how to navigate and use Open Virome project data to aid in RNA virus discovery, and open a (opinionated) discussion on what RNA virus discovery will mean in the exabase era.


Thilo Muth

The promise of mass spectrometry-based virus proteomics: taking a peek at current bioinformatics applications and limitations
28. February 2022 | 03–04 pm CET
Thilo Muth, Bundesanstalt für Materialforschung und -prüfung, Germany

Driven by recent technological advances and the need for improved viral diagnostic applications, mass spectrometry-based proteomics comes into play for detecting viral pathogens accurately and efficiently. However, the lack of specific algorithms and software tools presents a major bottleneck for analyzing data from host-virus samples. For example, accurate species- and strain-level classification of a priori unidentified organisms remains a very challenging task in the setting of large search databases. Another prominent issue is that many existing solutions suffer from the protein inference issue, aggravated because many homologous proteins are present across multiple species. One of the contributing factors is that existing bioinformatic algorithms have been developed mainly for single-species proteomics applications for model organisms or human samples. In addition, a statistically sound framework was lacking to accurately assign peptide identifications to viral taxa. In this presentation, an overview is given on current bioinformatics developments that aim to overcome the above-mentioned issues using algorithmic and statistical methods. The presented methods and software tools aim to provide tailored solutions for both discovery-driven and targeted proteomics for viral diagnostics and taxonomic sample profiling. Furthermore, an outlook is provided on how the bioinformatic developments might serve as a generic toolbox, which can be transferred to other research questions, such as metaproteomics for profiling microbiomes and identifying bacterial pathogens.

Download slides.


Behind the scenes: Estimating infectiousness throughout SARS-CoV-2 infection course
24. January 2022 | 04-05 pm CET
Terry Jones, Charité – Universitätsmedizin Berlin, Germany

Although post facto studies have revealed the importance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission from presymptomatic, asymptomatic, and mildly symptomatic (PAMS) cases, the virological basis of their infectiousness remains largely unquantified. The reasons for the rapid spread of variant lineages of concern, such as B.1.1.7, have yet to be fully determined. Two elementary parameters for quantifying viral infection and shedding are viral load and whether samples yield a replicating virus isolate in cell culture. 

In this talk, Terry Jones will tell the back story of this paper and the challenges associated with processing and interpreting messy data. 


Caroline Friedel

Expect the unexpected: Lessons for virus bioinformatics from HSV-1 infection
15. December 2021 | 04-05 pm CET
Caroline Friedel, University of Munich, Germany

RNA-seq of virus-infected cells (dual RNA-seq) provides unique opportunities to simultaneously study host and viral transcription. While standard RNA-seq analysis pipelines can be easily adapted for this purpose, unique pitfalls arise in dual RNA-seq analysis due to the manifold ways in which viruses modulate host transcription and RNA levels. In this talk, I will provide an overview on how such modulations in HSV-1 infection impact and bias RNA-seq analysis if not properly taken into account. This includes in particular: (1) Disruption of transcription termination resulting in poly(A) read-through into the intergenic space. This can be mistaken for alternative polyadenylation, upregulation of downstream genes and inhibition of splicing. (2) Degradation of host mRNAs by the viral vhs protein which confounds differential gene expression analysis and leads to an accumulation of circular RNAs (circRNAs) that can be furthermore mistaken for alternative exon usage and differential gene expression.


Michael T Wolfinger

Deciphering viral RNA structure with ViennaRNA
22. November 2021 | 04-05 pm CET
Michael T. Wolfinger, University of Vienna, Austria

The ability of RNA to fold into intricate structures in 2D and 3D is exploited by viruses to mediate tropism. Therefore, a deep understanding of RNA structure formation is crucial for the emerging field of virus bioinformatics. The ViennaRNA Package comes with a rich portfolio of tools to analyze RNA folding and the traits that are associated with particular folds. Single sequence modeling from thermodynamic principles and comparative studies of RNA structure provide a powerful approach to assess evolutionary conservation of viral genomes. This presentation introduces the ViennaRNA Package and examines how it can help virologists to address RNA structuredness by the example of a recent project that was focused around the characterization of alternative RNA structure conservation in different subtypes of tick-borne encephalitis virus (TBEV).

View the slides.


Bas Dutilh

Viromics: from virus discovery to viral ecology
25. October 2021 | 04-05 pm CEST
Bas Dutilh, Utrecht University, Netherlands

The greatest terra incognita in biology is the world of viruses. In recent years, tens of thousands of viruses have been discovered using viromics, but for most of them, we have no idea of their ecological roles. I will use the story of crAssphage to illustrate the developments in the viromics field.


Günter Klambauer

Machine Learning methods for Immune Repertoires
12. August 2021 | 04-05 pm CEST
Günter Klambauer, Johannes Kepler University, Austria

A central mechanism in machine learning is to identify, store, and recognize patterns. How to learn, access, and retrieve such patterns is crucial in Hopfield networks and the more recent transformer architectures. We show that the attention mechanism of transformer architectures is actually the update rule of modern Hopfield networks that can store exponentially many patterns. We exploit this high storage capacity of modern Hopfield networks to solve a challenging multiple instance learning (MIL) problem in computational biology: immune repertoire classification. Accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the COVID-19 crisis. Immune repertoire classification based on the vast number of immunosequences of an individual is a MIL problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. We demonstrate that DeepRC outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class.


Denise Kühnert

Phylodynamic modelling of modern and ancient infectious disease dynamics
20. July 2021 | 03-04 pm CEST
Denise Kühnert, Max Planck Institute for the Science of Human History, Germany

Recent advances in ancient DNA research had a great impact on the study of infectious diseases — we can now recover full genome sequences from pathogenic organisms that caused deadly pandemics hundreds or thousands of years ago. In this talk, I will discuss how Bayesian phylodynamic methods can be used for the analysis of modern and ancient pathogens.


Viral reference database as a critical factor for clinical metagenomics: a review using Virosaurus as an example.
14. June 2021 | 04-05 pm CEST
Philippe Le Mercier, SIB Swiss Institute of Bioinformatics, Switzerland

Philippe Le Mercier

High-Throughput Sequencing (HTS) technology can detect all genetic entities in a clinical sample. A key element in identifying pathogens in HTS output is the reference database. However, building a comprehensive viral database is challenging: there are 1,561 known vertebrate viral species. Moreover, sequence conservation within a viral species can be as low as 50%. If a human reference genome is sufficient to identify all human reads, hundreds of rotavirus references are needed to identify all circulating viruses of that species. Therefore, a viral reference database must cover all viral species and their inherent variability to be effective for analysis of clinical samples.  

In this talk, I will present the challenges and pitfalls of viral genomics: why it is complex to represent viral diversity in a viral reference database. There are more than 10 datasets available, all developed from different viewpoints: from simple queries on GenBank to datasets processed using all public sequences. I will discuss the process used to generate Virosaurus: a database representing all publicly available vertebrate/plant/fungal viral genomes; and the particular case of segmented viruses. 


Martin Hölzer

Genomic surveillance of SARS-CoV-2 at the Robert Koch Institute: an overview and bioinformatics edge cases
17. Mai 2021 | 04-05 pm CEST
Martin Hölzer, Robert-Koch-Institute, Germany

It has been more than a year since the WHO declared a pandemic on March 11, 2020, as a novel coronavirus, SARS-CoV-2, spread worldwide. A significantly increased sequencing effort is currently underway to track ongoing viral evolution and spread while monitoring mutations in the virus genome in response to the pandemic. Genomic sequences of the virus and their structured and continuous analysis form the basis for these critical molecular investigations.

In this talk, I will present the ongoing genomic surveillance efforts of SARS-CoV-2 sequences at the Robert Koch Institute (RKI), Germany’s national public health institute. In mid-January, the Federal Ministry of Health issued a decree requiring laboratories to submit reconstructed SARS-CoV-2 genomes to the RKI to improve genomic surveillance. The number of sequences increased tremendously. Reconstructed genomes of virus-positive samples sequenced directly at RKI further enrich this set. In this talk, I will discuss the various bioinformatics tasks and challenges involved in the daily reconstruction, quality control, annotation, profiling, and clustering of these sequences. Besides, I will discuss important bioinformatics edge cases we have discovered that, if not addressed appropriately, may even lead to misclassification of variants or virus lineages of concern.


Katy Brown

CIAlign, a tool to clean, interpret and visualise multiple sequence alignments, and its application to virus discovery.
26. April 2021 | 02-03 pm CEST
Katy Brown, Cambridge University, UK

Many applications of multiple sequence alignments (MSA) involve working with sequences which are not ideal. Sequences can be incomplete, contain errors or be highly divergent with many mismatches. This is very common when working with high throughput sequencing data, for example in alignments based on de novo assembled transcripts, long read sequencing reads or mixed metagenomic datasets. MSAs based on these sequences often contain many gaps and areas of low quality alignment. It’s still common to manually edit MSAs before performing further analysis but this method is time-consuming and not easily reproducible.

In the first part of this seminar I will discuss a new command-line tool, CIAlign, which we have developed to automatically solve some of the most common problems with MSAs. CIAlign targets four common features of complex MSAs: low quality or incomplete ends of sequences leading to gaps and mismatches, insertions in a minority of sequences dominating the alignment, unexpectedly divergent sequences and very short sequences. It also provides a new type of alignment visualisation, which shows the whole alignment in a single, publication ready image. I will then discuss how I use CIAlign on a day-to-day basis, as part of a computational pipeline developed to identify, classify and characterise novel RNA viruses and cross-species transmissions in honey bees and the other arthropods they interact with, such as native pollinators, ants and mites.


Jakub Bartoszewicz [Picture: Kay Herschelmann]

DeePaC-Live: Predicting pathogenic potentials of short DNA reads with reverse-complement deep neural networks.
22. March 2021 | 04–05 pm CET
Jakub Bartoszewicz, Hasso-Plattner-Institut, Germany

Viruses evolve quickly and may emerge rapidly. Next-generation sequencing is the state-of-the art in open-view pathogen detection, and one of the few methods available at the earliest stages of an epidemic, even when the biological threat is unknown. We show that deep neural architectures can accurately predict if raw, unassembled sequencing reads originate from novel, human-infecting agents, cutting the error rates in half compared to alternative approaches and generalizing to taxonomic units distant from those presented during training. To gain insight in the inner workings of the trained models, we visualize the learned features and the contributions of individual nucleotides to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect virulence-related genes in novel agents. As analyzing the samples as the sequencer is running can greatly reduce the turnaround time, we extend the approach to classify incomplete Illumina and Nanopore reads in real-time. The resulting models show strongly improved performance compared to existing real-time mapping approaches for both sequencing technologies.