International Virus Bioinformatics Meeting 2023 » Oral Presentation Abstracts

SESSION 1: Phages

The promise and pitfalls of prophages

Robert Edwards | Flinders University, Adelaide, Australia

Temperate phages provide unique growth advantages to their host through lysogenic conversion, and are responsible for many of the subtle differences between bacterial strains. However, the microbes also endure a cost to maintain those phages: additional DNA to replicate and proteins to transcribe and translate. The benefit of carrying prophages in bacterial genomes has been proposed to outweigh the costs but has never been quantified. Here, we identified over five million prophages from almost one million bacterial genome assemblies. Analysis of the whole dataset and a representative subset of taxonomically diverse bacterial genomes demonstrated that the normalized prophage density was uniform across all bacterial genomes above 2 Mbp. We identified a constant carrying capacity of phage DNA per bacterial DNA and estimate that each prophage provides cellular services equivalent to approximately 2.1 % of the energy of the cell or 0.8 ATP per bp per hour. We demonstrate taxonomic, geographic, and temporal disparities in the identification of prophages in bacterial genomes that provides novel targets for the identification of new phages. We anticipate that the energetics involved in supporting prophages is balanced by the benefits bacteria accrue from their presence. Furthermore, our data will provide a new framework for the identification of phages in environmental datasets, in diverse bacterial phyla, and from different locations.

Jaeger: A Deep Learning Approach for Predicting Bacteriophage Sequences in Metagenomic Data

Rajitha Yasas Wijesekara | University Medicine Greifswald, Institute for Bioinformatics, Greifswald, Germany

Microbial communities are complex admixtures containing vastly different organisms representing the three domains of life and their viruses. Bacteriophages, the viruses that infect bacteria, are ubiquitous in almost every environment and play a crucial role in shaping the ecological and evolutionary processes of ecosystems by controlling bacterial abundances. They also influence bacterial phenotype in the virocell state by altering bacterial metabolism and drive global nutrient flow. Detecting phages in metagenomic datasets requires specialized bioinformatic tools. Here, we present Jaeger, a novel artificial intelligence (AI) tool for predicting bacteriophage sequences in metagenomic data that can be applied to individual reads (70%accuracy), assembled contigs (90% accuracy), and bins (93% accuracy). Additionally, Jaeger can detect prophages (74% accuracy), and identify other sequence categories such as eukaryotic, bacterial, and archaeal genomes. Jaeger utilizes a deep learning model with dilated convolution and residual connections that learn feature representations from nucleotide sequences, which are subsequently used for classification. We demonstrate that our novel neural architecture performs better than other available methods. Specifically, we compared the performance of Jaeger to PPRMeta, DeepVirFinder, Seeker, and VirSorter2 on the phages in the IMGVR (v4) database and real metagenomic datasets from three different biomes, showing 10-35% decrease in false positive rate without compromising on sensitivity. Together, Jaeger adds a new AI-powered tool to the metagenomics toolbox that will help to understand the composition of complex communities from metagenomic sequencing.

SESSION 2: Virus discovery & classification

Illuminating the RNA Virome through ultra-massive sequence analysis

Artem Babaian | University of Toronto, Molecular Genetics, Toronto, Toronto

Transcriptomic/metatranscriptomic sequencing is revolutionizing the exploration of Earth’s virome. Yet analysis methods are inefficient and don’t scale to the available data. The global biology community has freely shared >30 petabases (3×1016 nt) of sequence data from 10+ million biological samples. Painstakingly collected over 15 years, public data encompass all continents, oceans, thousands of animals, plant, and fungal species, and are valued at $3.6-14.9 billion dollars in direct sequencing cost.

To uncover the total diversity of RNA viruses we developed a cloud-based sequence alignment platform called Serratus (, with which we analyzed 7.4 million public sequencing datasets for the RNA viral hallmark gene, RNA-dependent RNA polymerase. We identified the equivalent of >300,000 novel RNA viruses which is over an order of magnitude increase of known virus diversity; including at least nine new species of Coronaviruses (CoV).

Planetary-scale informatics provides us with unprecedented depth of insights with which to describe virus evolution and ecology. Seven of the novel CoV identified are encoded on segmented genomes (of monophyletic origin). The majority of novel nidoviruses were identified in samples from aquatic vertebrates (axolotl, leopard frog, seahorse, fugu fish, etc…) supporting that there is a vast uncharacterized reservoir of marine nidoviruses. With a high-sensitivity search, we detect genome-fragments from >40 distinct uncharacterized nidoviruses. For both novel and known nidoviruses, we expand virus host-ranges, geographic distributions, ecological niches and identify potential hidden reservoirs in an unbiased manner which at times, challenge our preconceptions about nidovirales.

The next decade of virology will be illuminated by computational virology and requires a shift in our conceptualization of virus discovery and it’s applications to pandemic surveillance. What can be learned today (with scalable methods) from Nidovirales lays the foundation for how we will explore the 100+ million virus species we project to discover by 2030.

RNA virus discovery using HMM of large-scale RNA-dependent RNA polymerase sequence data: NeoRdRp 2.0

Shoichi Sakaguchi | Osaka Medical and Pharmaceutical University, Osaka, Japan

RNA-dependent RNA polymerase (RdRp) is unique to RNA viruses and is used as a marker in the search for RNA viruses from RNA sequencing data. However, it is known that detecting RNA viruses based on RdRp sequence similarity has a limitation when similar RdRp sequences are unavailable in the database used in the analysis. Therefore, we developed an analysis pipeline using a hidden Markov model (HMM) that has better detection of sequences with low similarity.

We obtained the following RdRp-containing sequences: (1) reported RdRp amino acid sequences (4,620 sequences by Wolf et al., 14,680 sequences by Edgar et al., and 209,588 sequences by Zayed et al.); (2) 18,790 sequences from RNA viruses registered in the NCBI Virus database and 10,515,570 RNA virus sequences by Uri et al.; and (3) 565,928 sequences registered in UniProtKB. First, we performed clustering based on sequence similarity for (1), followed by alignment for each cluster. We defined a domain as a region in which at least 75% of the entire sequence was aligned, and we created an HMM profile for each domain. Then, we used the HMM profiles for hmmsearch and detected RdRp sequences from (2). Finally, we created the HMM profile “NeoRdRp” from the RdRp domain data set obtained from (1) and (2).

NeoRdRp yielded 24,995 HMM profiles, which contained known RdRp motifs. Using these HMM profiles, we performed a hmmsearch of (3) and detected 832 out of 836 RdRp sequences. The accuracy and specificity were 99.5% and 79.9%, respectively. We will continue to improve the system by updating and searching RNA sequencing data and utilizing the system to detect novel RNA viruses. Version 1.1 has been reported in a paper, and the NeoRdRp dataset is available at

Automated classification of giant virus genomes using protein family barcodes

Anh Ha | Virginia Tech, Biological Sciences, Blacksburg, USA

Large DNA viruses of the phylum Nucleocytoviricota, or “giant viruses”, are ubiquitous in the environments and play important roles in shaping the dynamics of global ecosystems. Due to the large phylogenetic breadth of this viral group and the highly complex, chimeric nature of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) could be challenging. Here we utilized a machine learning approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content, which we refer to as protein family barcodes. We applied random forests to a training set of 960 quality-checked, phylogenetically diverse Nucleocytoviricota genomes using a pre-selected set of giant virus orthologous groups (GVOGs). The classification model was predictive of giant viruses’ taxonomic group with a cross-validation accuracy of 98% to the Order level and 95% to the Family level. We observed that no individual GVOGs or genome features were critical to the algorithm’s performance and the model’s predictions, suggesting that classification predictions were based on a broad genomic signatures, which lessened the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification model was validated with an independent test set of 410 giant virus genomes with varied genomic completeness and predicted taxonomy with 97% and 94% accuracy to the Order and Family level, respectively. Our results provide a fast and accurate method  for the classification of giant viruses that could easily be adapted to other viral groups.

Using gb2seq to work with unannotated viral genomes based on a GenBank reference

Terry Jones | Charité Universitätsmedizin, Institute of Virology, Berlin, Germany

I will present gb2seq, a Python library and associated command-line scripts that derive information regarding unannotated viral genomes (e.g., a consensus called from a BAM file following alignment with bwa or bowtie), based on annotations in a GenBank reference. The library provides for: extraction of aligned features as nucleotide or amino acid sequences; retrieving information about what is at a site (features, nucleotide, amino acid, codon, frame, etc); translation of offsets between the reference and the unannotated genome with offsets that are absolute or relative to a feature; detailed alignment information for features; use of different aligners (MAFFT and edlib, currently); JSON annotations of the features of an unannotated genome; and convenience methods for checking genomes for sets of expected nucleotide or amino acid values. The command-line scripts provide a simple interface to the library functions, e.g., to extract a translated feature from a set of genomes and check for how many expected substitutions are present.

SESSION 3: Virus visualization

Advanced optical microscopy of virus-cell interactions: challenges and potentials

Christian Eggeling | Institute of Applied Optics and Biophysics, Friedrich‐Schiller‐University & Leibniz Institute of Photonic Technology e.V., Jena, Germany

Molecular interactions are key in cellular signalling including virus-cell interactions. They are usually ruled by the organization and mobility of the involved molecules. Unfortunately, the direct and non-invasive observation of the interactions in the living cell membrane is often impeded by principle limitations of conventional far-field optical microscopes, for example with respect to limited spatio-temporal resolution. Taking HIV-1 and SARS-CoV-2 as examples, we here present how advanced optical microscopy helps highlighting novel aspects of virus-membrane interactions and what challenges we still face.

SESSION 4: Viral infection

SARS-CoV-2-host interactions at the single-cell level: a dynamical complex systems approach.

Santiago F. Elena | Instituto de Biología Integrativa de Sistemas, CSIC-Universitat de València, Valencia, Spain

Single-cell RNA sequencing (scRNA-seq) has been used to characterize the progression of SARS-CoV-2 infection under the assumption that virus accumulation serves as proxy of infection time.  Here, we compared data from in vitro inoculation studies of human bronchial epithelial cells, colon, and ileum organoids, using scaled transcriptional profiles of individual genes.  Our analysis revealed that about 90% of genes exhibited a transcriptional response characterized by a triphasic pattern comprising two down-regulatory phases and one intermediate up-regulatory phase.  We used Taylor’s power law coupled with a ranking system to analyze the longitudinal stability of transcripts in infected cells.  For all three different cell types, we found that in top of the triphasic pattern, instability increases at later stages of infection.  The top 25% of the most stable genes showed the highest similarity across cell types, followed by the quartile with the least stable genes.  Genes clusters were built based on their rank dynamics.  This analysis suggests that the temporal dynamics of some interferon-respondent genes is cell type-specific.  Additionally, a correlation network analysis revealed a distinct correlated response of mitochondrially-encoded genes and genes involved in translation.  Finally, we used the transcriptional time profiles genes to model the progression of infected cells using a high confidence network of protein-protein interactions.

Metabolic labeling, time series, and single cells: a multifaceted approach to studying infection

Lygeri Sakellaridi | University of Würzburg, Würzburg, Germany

Viral infection is a dynamic biological process, facilitated by complex interactions between host and viral genes. On a cellular level, infection may result in different outcomes, eg lytic or latent. The molecular mechanisms that determine the infection outcome remain poorly understood. In order to elucidate these underlying mechanisms, we propose an approach that combines single cell RNA sequencing with metabolic RNA labeling.

Metabolic labeling allows distinguishing between RNA that was synthesized before and after the start of labeling. We first introduce grandR, a computational package that facilitates the analysis of such data and addresses the complex challenges inherent in it. Furthermore, metabolic labeling of single cells makes it possible to acquire two snapshots of expression per cell in a population, corresponding to the transcriptomic state before (past state) and after (current state) the end of labeling. Heterogeneity in the past state can be used to make causal inferences in the current state: eg anti-correlation of expression of a given gene in the past state against the total viral gene expression in the current state would indicate an anti-viral gene. 

Extending this reasoning, time courses allow following the fate of cells over extended periods of time. We introduce a novel trajectory inference approach for connecting cells from consecutive time points based on similarity between their past and current transcriptomic profiles. We envision the data as a network where cells are connected across time, and their changing expression trajectories ”flow” through the network. Mathematically, this is equivalent to the minimum cost-maximum flow problem and can be solved by a well-established optimization algorithm. Solving the network results in expression trajectories that reflect the different paths a biological process may follow across time. By constructing such trajectories in a population of infected cells, it becomes possible to determine factors that influence the infection outcome.

SESSION 5: Viromics

Ancient virome analyses using metagenomic data from ancient individuals

Luca Nishimura | National Institute of Genetics, Mishima, Japan

Ancient DNAs have been discovered from various kinds of archeological samples such as bones and mummified tissues. Those ancient samples contain ancient viral genomes which existed in ancient organisms’ bodies. Those ancient viral genomes are useful to elucidate past pandemic events and long-term viral evolution. For instance, human pathogenic viruses such as influenza A virus and hepatitis B virus have been analyzed since 1997. However, the number of ancient viruses thus far identified is small; most are human pathogenic viruses. Here, we analyzed ancient people’s whole genomic sequencing (WGS) data to discover ancient viruses more comprehensively. We utilized genomic data from 36 ancient individuals who dwelled in the Japanese archipelago and also more than 300 publicly available data. Firstly, we conducted de novo assembly of non-human reads to obtain longer contigs. Those contigs were used for homology search against the modern viral reference genomes and other methods such as machine-learning methods and bacterial immunological memories to find viral genomes. As a result, we detected more than 70,000 candidates of ancient viral contigs including about 100 high-quality ancient viral genomes. To characterize ancient virome, we analyzed ORF components and phylogenetic relationships based on those contigs. For example, we obtained the nearly complete sequence of the Siphovirus contig89 (CT89) existing in the human oral environment. We estimated the relationships between the modern CT89 and the most recent common ancestor by phylogenetic analyses. We then compared the viral components in each sample showing the differences between ancient and modern samples which might reflect different dietary characteristics. Those results suggest that the metagenomic data from ancient samples are useful for elucidating ancient viral characteristics and long-term viral evolution.

One’s Trash is Another’s Treasure – Mining viromics datasets for traces of EV mediated horizontal gene transfer

Dominik Lücking | Max Planck Institute for Marine Microbiology, Bremen, Germany

Marine environmental viral metagenomes, commonly referred to as “viromes”, are typically generated by physically separating viral-like particles (VLPs) from the microbial fraction based on their size and mass. However, most of the methods used to enrich extracellular vesicles (EVs) and gene transfer agents (GTAs) simultaneously. Consequently, the sequence space traditionally referred to as a “virome” contains host-associated sequences, transported via EVs or GTAs. We therefore propose to call the genetic material, isolated from size-fractionated (0.22 µm) and DNase treated samples, protected environmental DNA (peDNA). This sequence space contains viral genomes, DNA transduced by proviruses and DNA transported in EVs and GTAs. Since there is currently no definitive genetic signature for EV-transported DNA, scientists rely on the successful removal of contaminating remaining cellular and free DNA, when analyzing peDNA. Using marine samples collected from the North Sea, we generated a thoroughly purified peDNA dataset and developed a bioinformatic pipeline to determine the potential origin of the purified DNA. This pipeline was applied to our dataset as well as existing global marine ‘viromes’, enabling the identification of known GTA and EV producers, as well as organisms with actively transducing proviruses as the source of the peDNA, thus confirming the reliability of our approach. Additionally, we identified novel and widespread EV producers and found quantitative evidence suggesting that EV-mediated gene transfer plays a significant role in driving horizontal gene transfer (HGT) in the world’s oceans.

SESSION 6: Phylodynamics

HIV-1 transmission studies using phylogenetics: can evolution help guide public health decisions

Ana Abecasis | Universidade Nova de Lisboa, Lisbon, Portugal

Since the 1990s, HIV-1 transmission chains reconstruction has been used for different purposes. It’s first use was for court cases of HIV-1 transmission, more specifically, in the famous Florida dentist case. Later on, HIV-1 transmission chains reconstruction started to be combined with socio-demographic and behavioral data to be used in the context of public health epidemiological studies.

By combining socio-demographic, behavioral and clinical data, we reconstructed transmission chains in different contexts to better understand the most important determinants of transmission of HIV-1 infection in each scenario. Our results indicate the importance of its use to complement classical epidemiological approaches. We address potential pitfalls and methodological constraints, and present the main challenges, implications and applicability to future studies.

Molecular epidemiological approaches to investigate the dispersal dynamic of viruses and the environmental factors impacting it

Simon Dellicour | Université Libre de Bruxelles, Bruxelles, Belgium

Recent advances in genomics, mathematical modelling and computational biology have enabled molecular approaches to become key methods to investigate the spread of viral infectious diseases. In the emerging field of molecular epidemiology, genetic analyses of pathogens are used to complement traditional epidemiological methods in various ways. For instance, genetic analyses offer the possibility to infer linkages between infections that are not evident without analysing viral genomes. In particular, the development of phylogeographic methods has enabled to reconstruct dispersal history of epidemics in a discretised or on a continuous space, using only a relatively limited number of viral sequences sampled from known locations and times. At the Spatial Epidemiology Lab (SpELL, ULB), we develop and apply new analytical approaches exploiting such phylogeographic reconstructions to test epidemiological hypotheses about the external and environmental factors impacting the dispersal history and dynamic of viral epidemics.

Phylodynamic analysis of A(H5N1) highly pathogenic avian influenza viruses provides insight into movement dynamics and host specificity

Will Harvey | University of Edinburgh, Roslin Institute, Edinburgh, United Kingdom

Since 2021, highly pathogenic avian influenza (HPAI) viruses of subtype A(H5N1) have caused a panzootic of unprecedented scale. This has affected both wild and domestic birds with high mortality outbreaks in atypical host species such as shorebirds. Spill-overs in a diverse range of mammalian species including die-offs involving thousands of pinnipeds have raised concerns over zoonotic potential. Observational data suggest changes in the seasonality of recent HPAI A(H5N1) viruses and relaxation of host specificity, however virological or ecological explanations remain elusive.

Using phylodynamic approaches, we reconstructed spatio-temporally resolved phylogenetic trees and examined trends relative to pre-panzootic A(H5N8) viruses. We examine patterns in the times and locations of reassortment events and specific movements such as between Great Britain and continental Europe and from Europe to North America. To explore recent change in host preference or specificity, we used phylogenetic tree shape-based metrics to infer fitness in hosts belonging to different avian orders and compared genotypes. Furthermore, we used supervised machine learning to identify genetic features associated with host species which included amino acid variants such as PB2 E627K, previously identified as adaptations to transmission in mammals.

We discuss these results in the context of recent experimental data, consequences for surveillance and the opportunities for further analyses that incorporate data on wild bird distributions and movements.

SESSION 7: RNA virus evolution

Viral RNA secondary structures: canonical and beyond

Kevin Lamkiewicz  & Sandra Triebel | Friedrich Schiller University Jena, RNA Bioinformatics and High-Throughput Analysis, Jena, Germany

In recent years, RNA biology has evolved around identifying and annotating functional (non-)coding RNAs in organisms from all domains of life. Especially for RNA viruses, it is conceivable that genomic regions and transcripts serve additional functions essential for viral replication induced by their RNA secondary structures. In the canonical RNA structure model, base-pairing of complementary nucleotides will fold the RNA into local and global structural elements, such as the prominent stem-loop hairpin structure.

Despite well-characterized RNA structures in untranslated regions of human-infecting RNA viruses, structures within coding sequences remain unclear. For example, it is hypothesized that RNA-RNA interactions play an essential role in the discontinuous transcription mechanism of coronaviruses (CoVs). Our study proposes stable RNA secondary structures upstream of canonical CoV ORFs that may facilitate specific gene transcription. We include evolutionary information in our predictions by employing structure-guided alignment approaches of related CoVs. We further outline the wet lab experiment set-up to validate our predictions and confirm the functionality of our proposed RNA structures.

In general, RNA sequence and underlying RNA structure sometimes evolve incongruently, severely impacting the state-of-the-art RNA structure prediction approaches since these assume a congruent evolution between sequences. For example, in Filoviridae, we identified conserved structural elements in which homologous nucleotides do not fold into homologous base pairs, i.e., sequence and structure are shifted relative to each other. We used the pairwise alignment distances calculated by BiAlign to generate neighbor-joining trees to assess the evolutionary scenario of such elements. To this end, we evaluated the correlation of incongruent evolving structures with the phylogenetic tree defined by the ICTV and, if applicable, the corresponding gene trees.

The incongruent evolution model and other non-canonical structural elements merit further research into the role of functional RNAs in viruses and the establishment of novel tools.

Recombination and Modular Evolution of Positive-strand RNA Viruses: Similar, but not the Same

Yulia Vakulenko | Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, Sechenov First Moscow State Medical University, Moscow, Russia

Recombination is very common in positive-strand RNA viruses. Along with a high mutation rate, it is one of the major forces generating genetic diversity. We systematically analyzed patterns of natural recombination in four (+)RNA virus families – Astroviridae, Caliciviridae, Picornaviridae and Coronaviridae, using both classical recombination detection methods and by comparing correspondence of genetic distances in different genome regions. A common (and generally known) feature of these virus families was frequent recombination between genome regions encoding nonstructural and structural proteins. However, the recombination profiles within these genome regions were contrasting. In picornaviruses, there was frequent recombination within the nonstructural genome region with no prominent hotspots and almost absent recombination within the structural genome region. Caliciviruses routinely exchanged full structural and nonstructural blocks of the genome, but had few, if any, recombination events within these regions. In astroviruses, moderate recombination was observed within both structural and nonstructural genomic regions. In coronaviruses, the spike gene, but not other structural proteins genes (E, M, N), was most commonly exchanged between coronaviruses. Recombination within the spike gene occurred more frequently than within the nonstructural region, and more commonly involved the entire domains of the spike protein. Therefore, these (+)RNA viruses with very different genome organization and realization had a common general recombination pattern, which could effectively provide independent evolutionary trajectories for structural and non-structural proteins. Protein(s) function was the major factor defining their relative mobility by recombination. On the other hand, viruses of close families could have contrasting recombination patterns within these major genome blocks.

RNAswarm: A Modular Pipeline for Differential RRI Analysis in Influenza A Virus

Gabriel Lencioni Lovate | Friedrich Schiller University Jena, Jena, Germany

RNA proximity ligation methods, such as PARIS, SPLASH, and 2CIMPL, provide an experimental approach to detect RNA-RNA interactions (RRIs) on a large scale. Although these methodologies can probe RRIs with a high throughput, there is currently no established bioinformatics pipeline to statistically compare the frequency of RNA interactions across strains or experimental settings with high throughput. Therefore, we developed RNAswarm, a modular and reproducible Nextflow pipeline for differential RRI analysis. We apply RNAswarm to SPLASH datasets of influenza A virus, a significant global threat due to its ability to cause severe morbidity and mortality, unraveling differentially structured regions in its segmented RNA genome.

With RNAswarm, we identified strain-specific RRI sites in different IAV strains and validated previous findings while quantifying variations across strains. Our pipeline first processes raw reads from RNA proximity ligation experiments, then identifies RRIs, and finally uses DEseq2 to analyze differential RRI representation across different IAV strains or experimental conditions. Our tool enables de novo annotation of discrete interactions by generating pairwise matrices of chimeric reads and fitting Gaussian Mixture Models to identify normally distributed potential interactions.

RNAswarm identifies important RRIs across replicates and highly conserved or flexible interaction sites in different organisms, and provides a reliable and automated approach for prioritizing and comparing RRIs across different conditions or organisms. Our tool can potentially discover new RNA-RNA interactions in viruses and other biological systems. Our tool can potentially discover new RNA-RNA interactions in viruses and other biological systems. Moreover, the modularity and reproducibility of RNAswarm make it a valuable tool for researchers investigating RNA-RNA interactions in diverse contexts.

SESSION 8: Viral sequence analysis

Embedding segmented viral genomes for visualisation, search, and clustering

Udo Gieraths | Charité – Universitätsmedizin Berlin, Virologie, Berlin

Important human and animal pathogens like influenza and rotaviruses have segmented genomes. Such viruses can reassort during co-infection of a cell with different viral strains, in which case a new viral genome is created via the exchange of segments. In such cases, the evolutionary history, and therefore the phylogenetic trees of each segment, may differ considerably. This property hinders classical phylogenetic analysis and simple searches for similar genomes. 

We present an approach to embed segmented viral genomes in a mathematical space that allows efficient search and clustering. In the context of clustering, outliers that do not fit into any cluster are easily identified. These outliers represent rare reassortment events of particular interest, as reassortant viral genomes can give rise to highly pathogenic variants. Using the example of the influenza A virus, we show various applications of our developed segmented viral genome embedding in the context of search, clustering, and outlier detection.

Hyper-EINS: A tool for automated identification of insertions in the hepatitis E virus hypervariable region

Maximilian Nocke | Ruhr University Bochum, Molecular & Medical Virology, Bochum, Germany

Introduction: Hepatitis E virus (HEV) infections are usually asymptomatic and self-limiting, while in immunocompromised or other risk group patients may develop chronic courses. The hypervariable region (HVR) within HEV’s ORF1 is known to integrate sequence snippets of human and HEV origin. Those insertions are associated with replication fitness and chronicity.

Objectives: With Hyper-EINS, we aim to provide a time efficient tool that automates identification and validation of insertions in high-throughput sequencing (HTS) data, while offering an easy to use graphical user interface (GUI) to simplify accessibility. 

Methods: During Hyper-EINS development, we used, implemented and tested a broad variety of informatic tools and languages to increase efficiency of runtime and storage usage. A first, command line only version of the analysis tool was written in Python 3 and established the general workflow. Blastn was integrated in diverse variations to reduce runtime. To assemble full insertion sequences, Trinity was introduced into the pipeline.

To further speed up analysis processes, Hyper-EINS was reimplemented in Julia. The integrated just-in-time compiler provides a runtime advantage, especially on multiple datasets. In order to reduce runtime even more, MMSeqs2 was integrated in the Julia scripts for gene assignment instead of blastn.

Results: During development, runtime was drastically reduced and the need for extended command line usage was bypassed by implementing a GUI. Hyper-EINS was validated by monitoring dynamic rearrangements in the HVR of a chronically HEV infected patient over time. Here, we identified insertions of critical impact for viral fitness and gained further insights into content and distribution of insertions and duplications.

Conclusion: Hyper-EINS has been designed as a user-friendly tool for detecting insertions of human or viral origin in the HVR of HEV from HTS data using computers with limited RAM and processing power. It includes a GUI to help users interpret and validate the output.

Magnipore: Predicting differential single nucleotide changes in Oxford Nanopore Technologies sequencing signal in SARS-CoV-2

Jannes Spangenberg | Friedrich Schiller University Jena, RNA Bioinformatics, Jena, Germany

Oxford Nanopore Technologies (ONT) sequencing technique enables direct ribonucleic acid (RNA) sequencing with the opportunity to detect RNA modifications. In ONT sequencing, nucleotides are pulled through nanopores that are under an internal ion flow. An interruption of this ion flow, the so-called ONT signal, is specific for the respec-

tive nucleotides in the pore. RNA modifications influence this signal. So far, tools for detecting such changes are trained to identify only a small number of selected RNA modifications. Another way to identify RNA modifications is to compare two samples for the presence of different RNA modifications. However, such tools lack the ability to differentiate between significant signal shifts caused by mutations or differential modifications. We present Magnipore, a novel tool to search for significant signal shifts between samples of Oxford Nanopore data from similar or related species and classifies them into mutations and potential modifications. We used Magnipore to compare sequences of the entire SARS-CoV-2 genome obtained by direct RNA sequencing with the corresponding sequence obtained by reverse transcription and

PCR amplification. Included were representatives of the early 2020s Pango lines (n=6), samples from Pango lines B.1.1.7 (n=2, Alpha), B.1.617.2 (n=1, Delta), and B.1.529 (n=7, Omicron). Magnipore utilizes position-wise Gaussian distribution models

and a comprehensible significance threshold to find differential signals. We will demonstrate the difficulties of unevenly distributed coverage and inaccurate signal segmentation to calculate significant signal differences. In the case of previous variants of concern Alpha and Delta, we highlight a set of 55 detected mutations and 15 additional positions hinting at differential modifications. In addition, we identified potential virus-variant and variant-group specific differential modifications.

Magnipore contributes to advancing RNA modification analysis in the context of viruses and virus variants.

VILOCA: Sequencing quality-aware haplotype reconstruction and mutation calling for short- and long-read data

Lara Fuhrmann | ETH Zurich, D-BSSE, Basel, Switzerland

RNA viruses have high mutation rates, short generation times, and large population sizes that result in diverse within-host viral populations, which can affect disease progression and treatment outcome. It is therefore essential to characterize and assess quantitatively within-host viral diversity. Next-generation sequencing (NGS) can inform about the composition of genetic viral within-host populations on different spatial levels ranging from single-nucleotide variants to local or global haplotypes. However, distinguishing between sequencing errors and true biological mutations is difficult. We propose VILOCA, a method for mutation calling and reconstruction of local haplotypes from both short- and long-read NGS data. Local haplotypes refer to local regions that have roughly the length of the input reads. VILOCA recovers local haplotypes, even at low frequencies, by using a Dirichlet mixture model to cluster reads around their (unobserved) haplotypes. To evaluate the performance of VILOCA, we developed a benchmarking framework for comparison to other methods. It allows the flexible generation of haplotype population mixtures and read simulation. We also integrated three real experimental virus samples from Illumina and Nanopore read sequencing data. We compared the performance of VILOCA to several other tools based on mutation calls and haplotype reconstruction on the simulated samples as well as the real experimental samples. On the simulated mixtures from long read technologies, we found that VILOCA outperforms other methods in reliable mutation calling, and can even recover global haplotypes, if enough long reads are present. For our integrated experimental sample of Illumina reads, VILOCA performs similar to other methods, however, it has improved performance in detecting real biological deletions. 

In summary, VILOCA is a novel method for mutation calling, reconstruction of local and global haplotypes, which is especially useful for mixed long-read samples.

SESSION 9: Machine Learning in Viral Surveillance

From High-Throughput Testing to Genomic Surveillance and Public Health Data Integration

Bernhard Renard | Hasso-Plattner-Institut, Potsdam, Germany

Genome sequencing plays an increasing role in infectious disease diagnostics as well as in public health surveillance programs. Facilitated by algorithmic and machine learning approaches for signal processing and information aggregation, rapid sequencing procedures are arriving in clinical infectious disease diagnostics. At the same time, we see an increase in genomic surveillance, which allows early detection of outbreaks and infection chains. We have introduced platforms to learn and predict movement and spreading patterns in a population as well as for predicting genomic risk patterns. We show how graph neural network then allow integration across heterogeneous sources as well as prediction and explanation of spreading patterns.

BLOODVIR: Virus surveillance system for plasma pools based on high-throughput sequencing and machine learning

Martin Machyna | Paul-Ehrlich-Institut, Langen, Germany

The threat posed by novel and re-emerging viruses has increased. Many of these viruses are bloodborne and therefore pose an immediate risk to receivers of blood donations or blood-derived products. It is therefore vital to establish surveillance systems that are capable of detecting viruses in an unbiased manner before they can have a chance to spread in human population. Here we report our efforts on devolvement of a virus detection system for continuous monitoring of infection risks in blood plasma using agnostic high-throughput sequencing and machine learning.

Our preliminary data from comparing k-mer-based and alignment-based metagenomic tools for virus detection identified alignment-based tools as more suitable. While k-mer approaches tend to identify most of the true positives (sensitivity around 1.0), this comes at the expense of the number of false-positives. Alignment-based methods performed better not only in terms of precision, but also yield appropriate results for all the remaining quality metrics. This also holds true when applying the method on mutated datasets. Hence, we used alignment-based methods and built a pipeline accepting reads in fastq format and outputting an html report indicating identified species and quality statistics.

In order to improve our predictions of known and novel viruses from high-throughput sequencing data, we performed hyperparameter tuning of deep neural network (DNN) model designed for metagenomic classification. We trained the model with HIV-1 B-subtype genome fragments that were in silico generated using a range of mutation rates. Evaluating models on sequences from HIV subtypes A, C-L, N, O, P and U revealed a dramatic improvement in the ability to detect these “unseen” HIV variants compared to model trained with non-mutated sequences. In conclusion our system shows a great potential for viral risk monitoring.

Modelling the zoonotic capabilities of avian influenza via genomic machine learning

Liam Brierley | University of Liverpool, Liverpool, UK

Avian influenza is currently a high-risk threat in Europe. The 2021/22 outbreak was the largest yet observed and several countries stringently controlled domestic birds in response. In addition, recent zoonotic bird-to-human transmission in the UK and Russia have increased concerns about further human spillover and adaptation.

Several seminal modelling studies have demonstrated that machine learning algorithms can be trained directly on genome sequence data to make suitable predictions about which virus species may represent future zoonoses. However, few models of zoonotic potential have addressed the wide intra-species variation within avian influenza A viruses.

We use NCBI GenBank and GISAID to source over 20,000 whole genome sequences of avian influenza, including 628 sequences of zoonotic influenza sampled from humans. We then train machine learning algorithms (such as random forests and gradient boosted models) to predict zoonotic status based on genomic and proteomic traits. Features included genome composition of dinucleotides and codons within each gene and physicochemical properties of each protein (e.g., polarity, hydrophobicity). Training sets and cross-validation procedures excluded closely-related sequences to avoid inflated performance and phylogenetic bias from certain subtypes being heavily sampled (and sharing the same zoonotic status). Loss functions were also used to prioritise correct prediction of the zoonotic class.

Our developed models are tailored to identify which lineages of avian influenza currently circulating in wild birds have potential to become zoonotic in future. By extracting influential model features, we identify the most important nucleotide and protein regions associated with human infectivity. If trained appropriately, computational learning frameworks can inform more reactive strategies to prevent zoonotic infection by offering key genomic sites to monitor viral evolution in wildlife and early risk estimates of newly identified viruses as soon as sequences are available.

SESSION 10: Viral Pathogenesis

Sex differences in avian influenza virus infections

Gülsah Gabriel | Leibniz Institute of Virology, Hamburg, Germany

SESSION 11: Metagenomics for Identifying and Tracking Potential Zoonotic Viruses

Metagenomic next generation sequencing as a tool for identifying undiagnosed pathogens

Emma Thomson | University of Glasgow, Glasgow, United Kingdom

Discovering and tracking potential zoonotic species from metagenomic samples with a capture-based oriented pipeline

Maria Tarradas-Alemany | Universitat de Barcelona, Genetics, Microbiology and Statistics, Barcelona, Spain

From the dawn of Next Generation Sequencing(NGS) technologies, those strategies have become crucial in the study of microbial communities from environmental samples. However, there are still some challenges to overcome, either from biological and computational perspectives, to characterize their virome composition. Viral metagenomics has to deal with low quality sequences, possible sample biases (due to chemical inhibitors, degradation, etc), challenging data analysis, and more specifically the lack of standardized regions for classification, the arduous purification of enough biomass for sequencing, and the limited completeness of the available virus databases. In addition, most of the viral particles found in environmental samples correspond to bacteriophages, which further complicates the detection of specific taxons or viral families. 

To overcome some of those issues a bioinformatic protocol is being implemented, to spot and characterize viral species with zoonotic potential in environmental samples processed using capture-based high throughput sequencing. The ongoing NextFlow pipeline includes sequenced reads clean-up, assembly, virome annotation, and viral discovery. A set of viral samples obtained from sewage and bat guano have been already analyzed with this pipeline; moreover, sequences obtained by whole-genome shotgun and probe-based viral capture approaches have been also considered to assess the performance of the capture kit and test the performance of the pipeline.

The results show an increased number of assigned viral contigs in the capture approach (using RVDB database), which also achieves higher coverage and similarity to reference sequences of potentially zoonotic viruses.