International Virus Bioinformatics Meeting 2022 » Oral Presentation Abstracts

Wednesday, 23rd March: Satellite Meeting on SARS-CoV-2

Morning Session
The changing landscape of SARS-CoV-2 genetic diversity | Francois Balloux
No abstract available.
covSonar: Efficient and rapid genome profiling of SARS-CoV-2 sequences | Alice Wittig
With the pandemic spread of SARS-CoV-2, genomic surveillance has become an indispensable key instrument for rapid detection and continuous tracking of new and emerging virus variants in near real time. With over 7 Million publicly available genome sequences by January 2022, SARS-CoV-2 is now by far the most sequenced pathogen. In Germany alone, more than 500.000 genome sequences have been submitted to the National Public Health Institute since the beginning of 2021. 

Here, we present covSonar, an efficient, database-driven system for automatic acquisition of genomic profiles, making them easily accessible, searchable and linked to selected metadata. As a proof-of-concept, we show that the system is capable of rapidly tracking any genomic mutation pattern of SARS-CoV-2 and especially those that provide protection against neutralizing antibodies based on in vitro data (Greaney et al., PMID: 33259788). We integrate several publicly available data resources that are essential for effective genomic surveillance, such as the German electronic sequence data hub (DESH), which includes both random and targeted sequences, and deep mutational scanning data. Integrating this information, we show a positive trend of genomic sequences from randomly drawn samples from Germany mainly carrying Spike protein mutations, which confer protection against class 2 monoclonal antibodies in vitro. Contrary, we did not detect alarming trends for mutations facilitating escape from polyclonal antibodies obtained from patient sera after Moderna vaccination. However, as the human population becomes more and more immunized, the selection pressure on the virus also changes, which could favor more advantageous mutations. Thus, an efficient and sensitive surveillance system for critical mutations of SARS-CoV-2, such as covSonar, is crucial to enable targeted risk assessment.

The emergence of SARS-CoV-2 variants of concern is driven by acceleration of the substitution rate | Sebastian Duchene
The ongoing SARS-CoV-2 pandemic has seen an unprecedented amount of rapidly generated genome data. These data have revealed the emergence of lineages with mutations associated to transmissibility and antigenicity, known as variants of concern (VOCs).

A striking aspect of VOCs is that many of them involve an unusually large number of defining mutations. Current phylogenetic estimates of the substitution rate of SARS-CoV-2 suggest that its genome accrues around 2 mutations per month. However, VOCs can have 15 or more defining mutations and it is hypothesised that they emerged over the course of a few months, implying that they must have evolved faster for a period of time.

In this talk I will present detailed molecular clock analyses of genome sequence data from the GISAID database to assess whether the emergence of VOCs can be attributed to changes in the substitution rate of the virus.

Our results indicate that the emergence of VOCs is driven by an episodic increase in the substitution rate of around 4-fold the background phylogenetic rate estimate that may have lasted several weeks or months. This outcome stands in contrast with the notion that the virus has overall increased its mutation rate. In sum, this study underscores the importance of monitoring the molecular evolution of the virus as a means of understanding the circumstances under which VOCs may emerge.

VOCAL: An early warning system to detect concerning new SARS-CoV-2 variants from sequencing data | Hugues Richard
The evolution of the SARS-CoV2 virus has led to the emergence of waves of variants that exhibit more worrying phenotypes, e.g. resulting in higher antibody escape or transmissibility. New variants are monitored and annotated as Variants Of Interest or Concern (VOI/C) , e.g., by the WHO. Molecular typing can be combined with epidemiological and experimental data, to detect the appearance of new concerning variants. However, delays in sequencing and case reporting, combined with the limited sampling capacity can make their identification lagging weeks behind their emergence in the population. Automated systems that could score concerning samples, just based on their sequence information, are deeply needed.

Based on the extent of convergent evolution observed in SARS-CoV-2, one can devise a system that could generalize from previous examples to rank and identify potential concerning samples based on their amino acid (AA) profile. VOCAL, the Variant Of Concern ALert system, starts from full genome sequences and categorizes the AA changes appearing in the spike protein depending on the type of mutation present and their overlap with known antibody binding sites, epitope regions and sites under positive selection. VOCAL then detects the potentially concerning samples and ranks them according to three tiers of alert level.
We evaluated VOCAL retrospectively by considering all German sequences during two scenarios of emerging VOCs in 2021: the Delta variant (April) and the recent Omicron (December). All of the VOC samples were correctly detected as high concern (Delta: 30/30 (100%), Omicron: 3372/3446 (97%)). For Delta, an additional set of 21 samples was detected, mainly assigned to lineages B.1.617.1 and B.1.617.3 (which have also been reported as concerning). In summary, VOCAL is a specialized tool for the early detection of potentially concerning variants from large collections of SARS-CoV-2 genomes

Afternoon Session
SARS-CoV-2 genomic epidemiology: Bayesian phylodynamic reconstruction and characterization of antigenic evolution | Philippe Lemey
As the COVID-19 pandemic unfolded, viral genomic data was produced at an unprecedented scale allowing to track the SARS-CoV-2 evolutionary and epidemiological dynamics and providing important insights for intervention strategies. Here, I will highlight a number of developments in a Bayesian statistical framework in support of SARS-CoV-2 phylodynamic reconstructions, including the integration of individual travel history and mobility data, and their applications to track the early introduction and spread of the virus and the contribution of persistence and introductions in subsequent waves. Finally, I will illustrate how genomic epidemiology helped to inform vaccine developments at the Rega Institute.
ImpuSARS: A new tool for whole-genome imputation of SARS-CoV-2 | Francisco Ortuño
Viral whole-genome sequencing has been widely proved to be helpful for surveillance throughout the current SARS-CoV-2 pandemic situation. Consequently, unprecedented efforts have been made worldwide to produce and characterize hundreds of thousands of viral sequences. However, the high demand and urgency of results are causing the production of incomplete or low-quality useless sequences. Under these circumstances, an imputation method would clearly benefit by predicting missing regions and rebuilding the whole original sequences so that many sequences can be recovered.

Therefore, we present our recently designed impuSARS tool, which allows the imputation of SARS-CoV-2 genomes by taking advantage of a reference panel with an enormous number of SARS-CoV-2 sequences (over 230k sequences). impuSARS have been developed to be freely distributed, being encapsulated with Docker or available under a conda environment in Python. Consequently, the application can be integrated in any primary data processing pipeline for SARS-CoV-2 whole-genome sequencing. Also, this tool can be adapted to impute any other viral genome by customizing its own reference panel.

ImpuSARS has been validated under several simulated conditions of missing regions (continuous fragments, whole amplicons or sparse single positions) as well as real sequencing samples with low covered genomes. Results showed high accuracy when predicting the original sequences, being able to recover lineages with a 100% precision for almost all the lineage, even with poorly covered genomes (less than 20%). Hence, impuSARS has been proved to accurately recover many incomplete or low-quality sequences that would be otherwise discarded.

National-scale surveillance of emerging SARS-CoV-2 variants in wastewater | Fabian Amman
Monitoring the evolution of SARS-CoV-2 in the human population is crucial for the identification of variants with altered properties such as transmissibility, pathogenicity and/or immune escape. During the ongoing pandemic, wastewater-based epidemiology (WBE) garnered renewed attention as a non-invasive and complementary approach to SARS-CoV-2 sequencing from infected individuals. Yet, data analysis remains a challenge and national WBE surveillance programs have not been widely implemented yet.

Here, we present the results of a national-scale SARS-CoV-2 WBE program in Austria, which aggregated 2,093 samples from December 2020 to September 2021 collected at 95 sewage plants covering over 57% of the total population. Using our Variant Quantification in Sewage pipeline designed for Robustness (vaquero), we reproducible deduced SARS-CoV-2 variant abundance from complex wastewater samples. This was validated by the epidemiological integration of over 130.000 individual variant-genotyped cases of the respective catchment areas. Compared to conventional epidemiological surveillance data, our WBE analyses accurately recapitulated the emergence of the dominant Alpha and Delta variants across the country and delineated large regional clusters of other variants of concern. Finally, we provide a framework to infer variant-specific reproduction numbers from wastewater and predict emerging variants de novo.

Our study demonstrates the power of national-scale WBE. Such non-invasive surveillance programs are likely to play increasingly important roles for tracking the dynamics of new SARS-CoV-2 variants and may proof particularly useful for pandemic management in countries without dense individual surveillance programs.

The power of SARS-CoV-2 genotyping and SNP-based clustering for contextual outbreak assessment | Denis Beslic
The COVID-19 pandemic has triggered an unprecedented increase in viral genome sequencing for molecular surveillance. Datasets of sequenced SARS-CoV-2 genomes are ideally suited for potential outbreak identification but also to enrich and better understand local outbreak events. Using the genetic distance of different samples to analyze their epidemiological relatedness has become an essential method for monitoring transmissions of various pathogens1,2. However, existing approaches are computationally costly and impractical given the current amount of data3.

To quickly identify putative outbreaks and transmission clusters, we developed BREAKFAST, a tool for rapid sequence clustering in the specific context of SARS-CoV-2, and applied it to German and international sequences. Our approach, which derives transmission clusters from SNP occurrences, is motivated by the low mutation rate of SARS-CoV-2. Here, the pairwise genetic distance between multiple sequences is computed via a constructed sparse matrix of alignment-based genomic profiles.

Using pre-computed mutation profiles, we clustered 114,042 sequences in 1.5 minutes using 80 cores and a peak of 1.45 GB of RAM. Its efficiency and intuitive parameters make BREAKFAST suitable for monitoring fast-growing clusters and analyzing potential outbreaks on a daily basis. Computationally intensive phylogenetic tools can be applied to a smaller set of sequences of interest based on the clustering results. To verify the performance of our method, we compared BREAKFAST with recently developed tools for clustering DNA sequences on different viral datasets.

These results demonstrate that targeted methods, which leverage a pathogen’s specific properties, can be used in conjunction with large datasets to provide key insights into the ongoing COVID-19 pandemic. Our approach was applied to add individuals to already known outbreaks, and trigger follow-up epidemiological investigations of transmission clusters.

BREAKFAST is freely available via github.

References
1 Campbell et al., 29029156 (PMID)
2 Andre et al., 17018825
3 Harper et al., 33626040

Thursday, 24th March

Session 1: Viral emergence and surveillance
Real-time to Real-life: Phylogenetics, Pandemics, and What Comes Next | Emma Hodcroft
Since the announcement of the first variant of concern (VoC) in December 2020, the COVID-19 pandemic has been increasingly shaped not only by viral spread, restrictions, and immunity, but also by variants with increased transmission and immune evasion. Detecting and tracking these emerging variants – and deciding how to react to them – has been no small challenge. With over 7 million publicly available sequences, and millions of unique clusters of sequences, identifying those with mutations of interest and determining if they might be the next VoC is far from straightforward.

As the pandemic progresses, heterogeneity in immune history, through infections, vaccinations, and boosters, also means increasing heterogeneity in how ‘concerning’ a VoC may be: the impact of Omicron varied widely across countries. In turn, future variants on the ‘road to endemicity’ may pose different risks to different populations.

Though it’s impossible to predict what future variants may mean for how much SARS-CoV-2 continues to impact society, the return of pathogens that were suppressed during the restrictions of 2020 and early 2021 are a reminder of the common disparity in data and understanding between SARS-CoV-2 and the world of viruses we live in. How do we pivot our real-time test of the role that sequencing, modelling, and immunity panels can play in public health to a sustainable real-life integration of research and healthcare for a better understanding of human viruses overall?

Can genomics help prevent viral emergence? | Daniel Streicker
Virus transmission from wildlife to humans is responsible for a variety of emerging infections, from SARS-CoV-2 to Ebola. The notorious difficulty of anticipating viral emergence means that mitigation efforts predominately initiate only after viruses are circulating in new host species. This reactive approach guarantees a sustained public health burden from re-emerging viruses and preserves opportunities for novel viruses to establish in human populations. Viral genomic data is increasingly accessible but tends to be used to guide outbreak response rather than outbreak prevention. I will present two examples which combine viral sequence data with new biological and computational technologies to inform interventions that seek to forecast and prevent viral emergence. First, focusing on vampire bat rabies, a persistently re-emerging zoonosis in Latin America, I will show how metagenomic and deep sequencing can guide the selection and deployment of virally-vectored self-disseminating vaccines that aim to protect human and animal health by suppressing viral circulation within wildlife reservoirs. Second, I will show how machine learning can identify high risk zoonoses from sequence data alone, enabling data-driven prioritization of viruses for laboratory experiments and field or clinical surveillance. These studies illustrate how genomics, ecology and evolutionary biology can be integrated to reveal fundamental insights into natural host-virus interactions while forging the path from reactive to preventive management.
Genomic Surveillance of the Rift Valley fever: from Sequencing to Lineage assignment | John Juma
The evolutionary history of Rift Valley fever virus (RVFV) is complex and has been greatly influenced by dramatic environmental changes throughout Africa in the past few decades. Over this time period, RVFV gene flow has been impacted on various levels such as geographic dispersal and reassortment events. In overall, there are 15 lineages, designated from A to O. On numerous occasions, viruses from these lineages have been detected outside enzootic regions through probable movement of infected animals and/or mosquitoes. This has led to large outbreaks in countries where the disease had not been previously reported. Genomic surveillance of the virus diversity is crucial in developing intervention strategies. Therefore, we have developed a user-friendly computational tool for rapidly classifying and assigning lineages of partial or whole genome sequences of the virus using the glycoprotein Gn/G2 gene within the M-segment. The computational method is presented both as a command line tool and a web application. A user can provide up to 4000 multi-FASTA sequences. Validation of the tool has been performed on a large dataset comprising of partial and whole genome sequences obtained from public database. The Rift Valley Virus typing tool was able to correctly classify all 129 RVFV sequences at species level with 100% specificity, sensitivity and accuracy. All the sequences in lineages A (n = 13), B (n = 1), C(n=44), D(n=1), E(n=7), F(n=1), I(n=2), J(n=1), M(n=2), N(n=13) and O(n=2) were correctly classified at phylogenetic level, with accuracy, sensitivity and specificity of 100%. We further validated our tool using genomic data we obtained through sequencing following RVF outbreaks. The tool is useful in tracing the origin of outbreaks and supporting surveillance efforts.
Session 2: Virus-host interactions
Diverse anti-interferon strategies by members of the genus phlebovirus | Friedemann Weber
The genus Phlebovirus (order Bunyavirales, tri-segmented negative strand RNA genome) contains species covering a wide spectrum of virulence. Rift Valley fever virus (RVFV), for example, is highly pathogenic, whereas the Sandfly fever Sicilian virus (SFSV) displays an intermediate level of virulence. Although the importance of the mosquito-borne phleboviruses is increasingly recognized, we are only beginning to understand their mechanisms of pathogenicity.

A key virulence factor of phleboviruses is the non-structural protein NSs, an inhibitor of the antiviral type I interferon (IFN) system. Our group has identified the mechanisms by which the NSs proteins of both RVFV and SFSV (i) inhibit the transactivation of the IFN genes and (ii) abrogate the antiviral protein kinase R (PKR) pathway. For RVFV, the NSs was found to recruit several E3 ubiquitin ligases of the F-Box type in order to destroy the general host cell transcription factor TF-IIH as well as PKR, an antiviral mRNA translation inhibitor. For SFSV, by contrast, the NSs is occluding the DNA-binding domain of the IFN transcription factor IRF-3 to inhibit IFN induction, and NSs also binds and reprograms the translation initiation factor eIF2B to immunize the ribosomal machinery against PKR signaling.

Thus, our investigations have shown two surprisingly different IFN escape strategies by these related phleboviruses. While the highly virulent RVFV destroys key host factors of innate immunity, the more benign SFSV only sequesters them.

Staying below the radar and exploiting the host – A toolbox for studying RNA virus – host factor interactions | Andreas Gruber
As viruses require their host cell to reproduce they have evolved various mechanisms to interact with host factors, including RNA binding proteins (RBPs). Previous studies have demonstrated that virus – host RBP interactions can have pro- or antiviral effects. Also, the sequestration of host RBPs by RNA virus genomes was reported to cause changes in host pre-mRNA splicing and polyadenylation as well as mRNA stability, which suggests that virus – host factor interactions can broadly impact the host cell. However, the incidence of such virus – host interactions and the RBP interactomes of RNA virus genomes are largely unknown. To facilitate the study of RNA virus – host factor interactions we have developed a toolbox which enables to identify RBP binding motifs that are enriched or depleted in RNA viral genome sequences. Applying it to 197 single-stranded RNA (ssRNA) virus genomes, we provide a comprehensive overview of potential RNA virus – RBP interactions. Also, we have developed an approach that allows to infer sequence motifs that explain global changes in host pre-mRNA splicing and/or polyadenylation. Our tools and analysis provide insights into the RNA virus – host RBP interaction landscape and aim to support future studies that will ultimately feed into better treatments.
Multi-omics reveals principles of gene regulation and pervasive non-productive transcription in the human cytomegalovirus genome | Christopher Jürges
For decades, HCMV was thought to express ~200 viral proteins during lytic infection. However, in recent years, systems biology approaches suggested a more complex gene expression, comprising over 7,000 distinct viral transcripts encoding for over 1,000 viral ORFs that are expressed with distinct kinetics. These studies provided large amounts of data on transcription or translation, but the link between both fundamental processes in the viral genome and the underlying principles of viral gene regulation remained unclear. Here, we provide a unifying model which integrates these surprising and partially conflicting findings by temporally resolved profiling of stable viral transcripts and their kinetics using two TiSS-profiling approaches combined with metabolic RNA labelling and integrative analysis of our new and the previously published data. We generated a detailed, accurate and time-resolved map of transcriptional activity, mRNA levels, translation and protein levels for HCMV and provide tools for the scientific community to access and use this comprehensive data. These data show that HCMV expresses over 2,600 stable transcripts with distinct kinetics which are governed by core promoter elements. Integrative analysis of previously published PRO-seq, Ribo-seq and proteomics data revealed an even greater complexity for the temporal kinetics of viral proteins, and that this complexity emerges from the combined effects of translation of incoming virion-associated viral mRNAs, combined transcriptional output of multiple TSS expressed with distinct kinetics per viral ORF, activation of late-gene promoters before the onset of genome replication and subsequent signal amplification, and differences in viral RNA and protein stability. Most importantly, we identify pervasive transcription from the HCMV genome that does not result in stable mRNAs and consequently does not contribute to the viral translatome. This explains the vast amount of viral transcription initiation events previously identified and highlights a fundamental new cellular mechanism involved in herpesvirus gene expression.
Session 3: Viral Sequence analysis
Origins and implications of the quasispecies concept | Esteban Domingo
Viral quasispecies refers to the complex and dynamic collections of mutants present in individual samples of RNA (and many DNA) viruses. Mutant input is fueled by error rates during template copying that are nearly one million-fold larger than those exhibited by the replicative DNA polymerases of their host organisms. Discovered in the pre-nucleotide sequencing times, the extent of the complexity of mutant swarms in viral populations has been fully manifested with application of deep sequencing methodologies. Mutant ensembles are generated within individual infected cells, and then they become the substrate for further evolutionary events within hosts and between hosts. Mutant ensembles may behave as units of selection, and virus adaptation is presently viewed as the replacement of mutant subpopulations by others that are better fit to respond to an environmental change. Positive and negative selection are integrated with random drift prompted by bottleneck events within infected cells, organisms and during viral transmission. Quasispecies dynamics can be regarded as a paradigm of the pervasive diversity and complexity of the biosphere increasingly evidenced by meta-genome and single cell analyses.

Following an introduction to the origins of the theoretical and experimental quasispecies concepts, the presentation will describe results with hepatitis C virus (HCV) on how viral fitness can influence resistance to antiviral agents, and on new procedures to visualize HCV diversification based on deep sequencing data. Also, current views on differences between mutant spectra of SARS-CoV-2 and those of other RNA viruses will be presented. Finally, prospects for future developments in the quasispecies field will be outlined.

A guidance to store your virus sequence and knowledge | Muriel Ritsch
Currently, virus genome sequences are stored either in NCBI or specific databases, such as ViPR, the HIV database, or GISAID [Sharma et al., 2014; Kuiken et al., 2003]. These databases contain a fraction of errors, which can appear before submission (sample contamination or assembly mistakes), during submission (misclassification), or even years after submission (taxonomy adjustment).

NCBI and many other general databases do not reliably check whether all uploaded data are correct. Most new entries in these databases are compared by sequence similarity to existing ones, and the mistakes in the databases can cascade. Large-scale, downstream, and evolutionary analysis are hardly possible. Even with much effort and time, filtering true from false entries is not always possible. Good scientific research using these public virus genome databases is further complicated when the metadata or sequences are only partially correct. Especially if one extrapolates the growth of viral data [Paez-Espino et al., 2016; Roux et al., 2018].

To prevent the problem of false-positive sequences in the databases, we propose a guideline for uploading sequences. Here, we present the most common mistakes made in NCBI and other databases and present some main steps that should be followed during uploading virus sequences. We further provide examples of how database entries not following these steps can lead to a false conclusion and even jeopardize complete studies.

We propose the usage of alignments and quality checks (ideally done by the database) to predict whether the entire sequence is correct. Such alignments should be built with other known viruses of the same taxon. Additionally, we tackle the problem of legal issues related to virus databases. We envision a future database containing an easy-to-use interface, quality check, a private workspace, and tools for assembly, alignment, and phylogeny analysis with SOPs in the field.

Exploring the dinucleotide composition of the Flaviviridae with DinuQ | Spyros Lytras
Distinct biases in dinucleotide representation are characteristic of many viral genomes. Some of these biases, for example CpG under-representation, are now attributable to host-specific immune mechanisms that select against these patterns. To advance systematic analysis of genomic dinucleotide compositions, we have developed a novel statistical framework for quantifying dinucleotide representation. The corrected synonymous dinucleotide usage (SDUc) accounts for the amino acid composition of a given coding sequence and allows for statistical assessment of the bias compared to the null expectation of each dinucleotide’s representation. We present a Python3 package, DinuQ, for calculating SDUc and other relevant metrics and an online tool for easy visualisation of results. We applied our framework on a set of Flaviviridae genomes, a diverse group of viruses infecting a wide range of both mammalian and insect hosts. The two dinucleotides that were under significant bias in most genomes were CpG and UpA. Mapping the dinucleotide representation onto the Flaviviridae phylogeny reveals multiple convergent shifts in CpG representation correlating with ancestral host switches, while shifts in UpA representation are less pronounced and seem to be clade- rather than host-specific. Further exploring this short genomic signature can give us unique insights into ancestral and recent interactions of viruses with their hosts.
Genotype-based classification of IAV to unravel reassortment candidates | Alexander Henoch
Influenza A Virus (IAV) is a segmented RNA virus responsible for global pandemics with millions of fatalities in the past. In the case of a co-infection with two different strains, complete segments can be exchanged between these viruses. This exchange is called reassortment and can lead to novel IAV variants and even zoonotic events. It is further known that reassortment is selective and not completely random. Due to the fast evolution and potential recombination of IAV, the human population must rely on annually changing vaccines. However, the mechanisms of reassortment are not entirely understood, complicating the preparation of seasonal vaccines, and decreasing their efficiency. Therefore, a deep understanding of the interactions of all eight segments with each other is needed. However, the current IAV classification into subtypes is based on the serological properties of the two surface protein-coding segments only and thus not sufficient for reassortment analyses.

Here we propose a fast and scaleable method to cluster all sequences of IAV with high dimensional k-mer vectors and a novel clustering approach combining hierarchical and density-based methods. Overall, we clustered over 400,000 different IAV sequences without computationally expensive all-versus-all pairwise comparisons. Our genotype-based classification highly agrees with the serological classification for segment 4 and 6, established by the community with even higher resolution. We further extend our classification to all other segments of IAV.

For each IAV strain in the data set, we connect the clusters into which its respective segments have been grouped into. Highly-connected clusters, therefore, describe sequence combinations that are often observed in viable viruses and potential reassortment candidates. By only considering our reassortment candidates, we decrease the complexity of possible segment to segment interactions by a considerable amount. Extensive analysis of the resulting interaction-patterns might then lead to novel insights into the reassortment process.

Friday, 25th March

Session 4: Virus identification and annotation
VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses | Jiarong Guo
Viruses are a significant player in many biosphere and human ecosystems, but most signals remain “hidden” in metagenomic/metatranscriptomic sequence datasets due to the lack of universal gene markers, database representatives, and insufficiently advanced identification tools.

Here, we introduce VirSorter2, a DNA and RNA virus identification tool that leverages genome-informed database advances across a collection of customized automatic classifiers to improve the accuracy and range of virus sequence detection. When benchmarked against genomes from both isolated and uncultivated viruses, VirSorter2 uniquely performed consistently with high accuracy (F1-score over 0.8) across viral diversity, while all other tools under-detected viruses outside of the group most represented in reference databases (i.e., those in the order Caudovirales). Among the tools evaluated, VirSorter2 was also uniquely able to minimize errors associated with atypical cellular sequences including eukaryotic genomes and plasmids. Finally, as the virosphere exploration unravels novel viral sequences, VirSorter2’s modular design makes it inherently able to expand to new types of viruses via the design of new classifiers to maintain maximal sensitivity and specificity.

With multi-classifier and modular design, VirSorter2 demonstrates higher overall accuracy across major viral groups and will advance our knowledge of virus evolution, diversity, and virus-microbe interaction in various ecosystems. Source code of VirSorter2 is freely available, and VirSorter2 is also available both on bioconda and as an iVirus app on CyVerse. To best serve the research community, we maintain a “live protocol” (dx.doi.org/10.17504/protocols.io.bwm5pc86) for using VirSorter2 for virus sequence identification, including curating less well-studied viruses and mobile genetic elements, and establishing bona fide virus-encoded auxiliary metabolic genes.

Evaluation of gene-calling programs for viral genome annotation | Enrique Gonzalez-Tortuero
The number of newly available viral genomes and metagenomes has increased exponentially due to the development of high throughput sequencing platforms and genome analysis tools. Genome annotation pipelines are mainly based on gene-calling software, which identifies genes independently of the sequence taxonomical background. Although gene-calling programs provide a rapid genome annotation, they can misidentify genes and start codons, perpetuating errors over time. This study evaluated the performance of multiple gene-calling programs for viral genome annotation against the complete RefSeq viral database. Prodigal and FragGeneScan were the most accurate programs for DNA and RNA viruses, respectively, according to the number of coding genes. When the coordinates of the coding genes were considered, Prodigal scored high for DNA viruses, while GeneMarkS generated the most reliable results for RNA viruses. Overall, the quality of the coordinates predicted for RNA viruses was poorer than for DNA viruses, suggesting the need for improved gene-calling programs to deal with RNA viruses. Moreover, none of the gene-calling programs reached 90% accuracy for annotation of DNA viruses. Manual curation should improve any automatic annotation, especially when the presence of genes is validated with wet-lab experiments. Our evaluation of the current gene-calling programs is expected to be helpful for the improvement of viral genome annotation pipelines and highlights the need for more expression data to improve the rigor of reference genomes.
Session 5: Phages
Phages in the human gut: a taxonomist’s perspective | Evelien Adriaenssens
Bacteriophages have long been an underexplored component in microbiome studies, but no more. Interest in phages has boomed and research groups all over the world, have shown the importance of phages in a wide range of environments. The advances in sequencing technology and bioinformatics are now enabling us to reconstruct whole phage genomes and allowing us to investigate the phage diversity at the strain level. But as a community, we are very fond of using families to group our phages into boxes, which – with the current taxonomic framework and nomenclature – is missing out on crucial information.
Here, I will discuss the recent advances in taxonomy and their implications for microbiome studies and show that ultimately, taxonomy is the language that allows us to understand each other’s research.
Dual identification of novel phage receptor-binding proteins based on protein domains and machine learning | Dimitri Boeckaerts
Recent progress in synthetic biology enables the precise engineering of phages and their specificities towards bacterial hosts. The modification or swapping of receptor-binding proteins (RBPs) between phages enables to tune the narrow host specificity, which can avoid the need to discover and cultivate new phages in the lab. In addition, the ever-increasing amount of publicly available sequence data enables the rapid and automated identification of RBP sequences using probabilistic and predictive models. However, applying such models is challenged by non-standardized or lacking annotation of many phage proteins. Recently developed tools have started to bridge this gap but have not specifically focused on RBP sequences, for which many different annotations exist. We have developed two parallel approaches to overcome the complex identification of RBP sequences in publicly available phage genome data. Firstly, we have constructed a collection of both RBP-related Hidden Markov Models (HMMs) from the Pfam database, as well as custom developed HMMs to identify phage RBPs based on conserved protein domains. Secondly, we have developed a comprehensive processing pipeline to identify annotated RBP sequences in phage genomic data. These annotated RBP sequences were subsequently used to train an Extreme Gradient Boosting classifier that can accurately discriminate between phage RBPs and other phage proteins. Our classifier reaches a cross-validated F1 score of 88.9%, while our domain-based approach can iteratively improve itself by constructing additional HMMs from new predictions. This allows to identify an increasing number of domains, which could be used as building blocks for protein engineering. In summary, our methods contribute to an increased abundance of RBP sequences available to phage engineering efforts. We aim to publish this work in the ‘Virus Bioinformatics 2022’ issue of Viruses and open source our code and database for the research community to continue building upon.
Predicting viral capsid architectures from metagenomes| Antoni Luque
Most viruses protect their genome in capsids made of multiple copies of the same protein. These shells span two orders of magnitude in size and thousands of different architectures. However, the landscape and evolution of capsids across ecosystems remain poorly understood. The main challenge is obtaining empirical evidence of both the genomic and associated molecular structure of viruses from the environment. In this talk, I will outline my lab’s approach to tackling this problem for the case of icosahedral capsids, which are the most frequent in the virosphere. First, I will show how to use the generalized geometrical theory of icosahedral capsids in ChimeraX as a framework to quantify distinctive physical features in viruses from different structural lineages. Second, I will show how we combined biophysical models, bioinformatics, and statistical learning to predict the capsid architecture of tailed phages (HK97-fold structural lineage) from metagenomically assembled circular genomes and the major capsid protein sequence. This part of the talk will include a demo of how to use our open-source software and adapt it to other viral lineages. Third, I will share our findings analyzing tailed phages in various ecosystems. Our two main observations were the existence of putative mini-tailed-phages that have never been isolated and a relatively high presence of putative jumbo phages. Finally, I will discuss how this biophysical-computational approach opens a route of monitoring the ongoing evolution and selection of viral capsids across ecosystems.
Session 6: Viral diversity
Ocean viruses: Patterns, processes, and paradigms on a planetary scale | Matthew Sullivan
Microbes are recently recognized as driving the energy and nutrient transformations that fuel Earth’s ecosystems in soils, oceans and humans. Where studied, viruses appear to modulate these microbial impacts in ways ranging from mortality and nutrient recycling to extensive metabolic reprogramming during infection. As environmental virology strives to get a handle on the global virosphere (the diversity of viruses in nature), we face challenges to organize this ‘sequence space’ (create a sequence-based viral taxonomy), link these viruses to their natural hosts (who infects whom), and establish how virus populations are structured (ecological drivers) and impact natural ecosystems (their impacts). Here I will share current thinking on how to study viruses in complex communities and how these efforts are revealing new biology with particular focus on the patterns, processes and paradigms emergent from studying the Tara Oceans global datasets. These advances in viral ecogenomics provide fundamental information critical for bringing viruses into ecosystem models, and the new capabilities are empowering a new generation of eco-systems biologists.
Community typing as a way to explore virome compositional changes in IBD patients | Daan Jansen
Inflammatory bowel diseases (IBD) are a group of chronic inflammatory diseases of the gut. The pathophysiology is unknown; however, it is thought to result from an inappropriate immune reaction to the commensal gut microbiota. Community-typing is a common practice in bacteriome analysis allowing for the stratification of individuals based on their gut microbiome (eg., ‘enterotyping’). A viral counterpart of these enterotypes might allow stratification of individuals based on their gut virome. The aim of the present study is to use community-typing as a tool to explore virome compositional changes in IBD patients. Fecal samples were selected from 181 patients undergoing immunomodulatory therapy, and a baseline (pre-intervention) and a primary endpoint (post-intervention) have been collected for each patient. Viral metagenomics and deep sequencing were performed following viral enrichment with the NetoVIR protocol. We were able to condense the gut virota into 2 community-types, cluster 1 and 2. Cluster 1 showed a low alpha-diversity and a high relative abundance of Caudoviricetes [non-CrAss] phages. Cluster 2 showed a high alpha-diversity and high relative abundance of Caudoviricetes [CrAss] and Malgrandaviricetes phages. Distance-based redundancy analysis allowed us to determine the metadata affecting the virome composition. The composition was explained by several factors: patients’ individuality (75.8%), disease location (based on Montreal classification; bacteriome analysis in revision in Gastroenterology3) (1.4%), age (0.5%) and moisture (0.3%). Interestingly, virome composition was better explained by disease location than by diagnosis (Ulcerative colitis/Crohn’s disease). Moreover, virome composition was associated to therapeutic response (0.46%) in post-intervention samples. Next, we associated community-types with explanatory metadata, and found a high percentage of samples of responding patients in cluster 2. These findings suggest that viral community-typing allows for stratification of IBD patients based on their gut virome composition and might be a valuable tool to better understand IBD subtypes or as a potential future biomarker.
vAMPirus: An automated virus amplicon sequence analysis program to support investigations of viral community ecology | Alex Veglia
Amplicon sequencing is an effective and relatively economical approach for studying virus population dynamics in or ex-situ. Lacking a universal marker gene, standardized approaches for clustering environmental virus sequence diversity are uncommon. Therefore, the field of environmental virology would benefit from a bioinformatics tool developed to support the standardization of comprehensive and informed analyses for non-model/novel virus amplicon datasets. We present vAMPirus, an automated and complete DNA or RNA virus amplicon sequencing analysis program. The program intakes raw read libraries and produces HTML reports detailing results (e.g., data quality, relative abundance plots, community diversity metrics) with interactive figures and tables for the user to review. vAMPirus is integrated with the Nextflow workflow manager, allowing users to easily scale and standardize analyses across datasets. We showcase the utility of vAMPirus by leveraging an RNA virus (dinoRNAV) amplicon sequencing dataset. dinoRNAVs are hypothesized to infect symbiotic dinoflagellates (Family Symbiodiniaceae) that inhabit coral holobionts. While dinoRNAVs have been identified in diverse coral species, we lack insight into how dinoRNAVs are transmitted between coral colonies. Recent work demonstrates that fish predators of corals (corallivores) disperse high abundances of live Symbiodiniaceae cells across reefs in their feces. To test the extent to which fish feces are a reservoir of dinoRNAV diversity, targeted gene PCR and amplicon sequencing of the major capsid protein was used to investigate the presence and diversity of dinoRNAVs associated with coral and coral-eating fishes. vAMPirus analyses revealed that coral-eating fish feces are a significant reservoir of dinoRNAV diversity on the reef having higher dinoRNAV richness relative to coral tissues (Wilcoxon Rank Sum Test, p less than 0.05). Overall, we consider vAMPirus a useful and accessible analytical tool that will make future studies more comparable and thus enhance our ability to decipher the global virome and its impact in diverse systems.