Proteome and RNAome of RNA viruses

Computational methods for identifying functional RNA structure features

Irmtraud Meyer, Max Delbrück Center for Molecular Medicine, Bioinformatics of RNA Structure and Transcriptome Regulation, Berlin, Germany
Thursday, October 8th, 11.30 am

Most viral genomes share one common functional constraint and that is to keep their genome rather short for the sake of efficient packaging and swift replication. On the other hand, different viruses have to fulfil a range of diverse functional roles that reflect the variety of in-vivo constraints they face throughout their different stages of infection in different hosts. All viral genomes thus also have a pressing need to encode a variety of functional information that they require at different life stages into their rather short genomes. The emerging view, however, is that the functional range of any virus is not only determined by its encoded protein-coding genes, but also by other types of functional features that are harder to discern, e.g. functional RNA structure features.

Due to the degeneracy of the genetic code, any protein-coding region can also encode overlapping RNA structure information. This even applied to overlapping open-reading frames such as those encountered in HIV. As they have shown in the past, however, it is vital to correctly capture the known protein-context when trying to computationally detect any (partially or completely) overlapping RNA structure features. This can only be achieved via dedicated (and – typically – complex) models of RNA structure features that are able to integrate prior information on protein-coding regions into a joint, mathematically principled predictive framework. This has recently allowed them the computational identification of local RNA structure features that are key for regulating functionally important alternative splicing in influenza A.

On the computational side, RNA structure prediction – especially in the context of pathogens – is further complicated by the fact (1) that different life stages of the pathogen require different functional signals and (2) that the expression of any particular RNA structure feature in vivo typically strongly depends on its complex molecular context, e.g. RNA-binding proteins, other trans interaction partners and also the kinetics of RNA sequence synthesis. To tackle these conceptual challenges, the Meyer group has developed computational methods that allow to identify and visualise functionally relevant RNA structures beyond the one-sequence-one-structure dogma. These methods focus on identifying the RNA structure(s) that have been conserved throughput well-chosen times of evolution rather than predicting the thermodynamically most stable RNA structure, which may only be relevant in a more-artificial in-vitro setting.

The last few years have seen an exciting range of new experimental high-throughput methods for probing the RNA structurome and RNA-RNA interactome by detecting so-called duplexes in vivo. The raw data generated by these methods currently require fairly sophisticated computational analyses for pre-processing and interpretation in terms of distinct RNA structures and trans RNA-RNA interactions and also come with a range of experimental biases that need to be accounted for. In the near future, especially when combined with established SHAPE-probing methods, these novel methods should give us significant new biological insight into functionally relevant RNA structures and trans RNA-RNA interactions in the context of pathogen-host interactions. The Meyer group has recently updated its well-known R-Chie visualisation web-server to now cater not only for RNA structures, but also for trans RNA-RNA interactions and genome interactions.

Expanding Diversity and Molecular Biology of RNA Viruses

Ingrida Olendraite, University of Cambridge, Department of Pathology, Cambridge, United Kingdom
Thursday, October 8th, 12.45 pm

RNA viruses are very diverse, have high mutational rates, and employ an enormous variety of molecular strategies to transcribe and translate their genes. While human-infecting viruses have been well-characterized, such viruses make up only a small proportion of natural RNA virus diversity. To better understand the evolution of RNA viruses, and the molecular mechanisms that they employ, we are bioinformatically exploring the RNA viromes of diverse host organisms. We have been exploiting Hidden Markov Model profile-based searches to identify viral RNA-dependent RNA polymerases (RdRps) in transcriptomic datasets. The polymerase is the only protein which is common to all RNA viruses. Therefore, we can at least partly uncover global RNA virus diversity by finding their RdRp sequences [1]. When very divergent viruses are identified, we can propose new virus families and potentially make predictions about their gene expression mechanisms [2]. I will discuss the identification and analysis of over 10,000 RdRps (half of which are new). These RdRps reveal family-level evolutionary relationships, and enriched diversity within numerous virus families. I will also discuss various novel and unusual viral sequences, and the potential host diversity of new and some existing RNA virus groups.

[1] Wolf et al. (2018) Origins and evolution of the global RNA virome. PMID 30482837.
[2] Olendraite et al. (2017) Polycipiviridae: a proposed new family of polycistronic picorna-like RNA viruses. PMID 28857036.

Using ribosome profiling (RiboSeq) as a tool to analyse virus gene expression

Georgia Cook, University of Cambridge, Pathology, Cambridge, United Kingdom
Thursday, October 8th, 01.05 pm

Ribosome profiling (RiboSeq) is a next-generation sequencing-based technique which allows the positions of translating ribosomes to be mapped to a genome with sub-codon precision.

We carried out this analysis on samples derived from an infection of MARC-145 cells with the economically important arterivirus, porcine reproductive and respiratory syndrome virus (PRRSV), harvested over a timecourse of infection. The PRRSV genome is unusual in that it contains two programmed ribosomal frameshift (PRF) sites, which promote slippage of a proportion of translating ribosomes by 1 or 2 nt backwards, after which decoding continues in alternative reading frames. The distance between the 5′ end of RiboSeq reads and the first nucleotide of the ribosomal P site was determined, allowing reads to be mapped to the genome with sub-codon resolution and thus the frame of translation inferred. This permitted visualisation of changes in reading frame downstream of both PRF sites on the PRRSV genome, which became more conspicuous when read counts on the WT viral genome were normalised by those of a frameshift-defective mutant.

Analysis of the cardiovirus Theiler’s murine encephalomyelitis virus (TMEV), which also utilises PRF to regulate viral gene expression, demonstrated a decreased read density in the region downstream of the PRF site, as a result of ribosomes terminating translation upon encountering an early stop codon in the alternative reading frame. Comparison of read density upstream and downstream of the frameshift site revealed that the frameshift efficiency is ~85%, the most efficient known natural example of -1 PRF.

Calculations for PRRSV revealed that -1 PRF efficiency increases as infection progresses, apparently overturning the common assumption that PRF stimulated by RNA secondary structures occurs at a fixed efficiency.

Ribosome profiling can also reveal novel features of the viral translatome, for example we discovered a short but conserved and highly expressed upstream ORF in the 5′UTR of the PRRSV genome.

Authors: Georgia M. Cook, Katherine Brown, Pengcheng Shang, Yanhua Li, Sawsan Napthine, Adam Dinan, Ying Fang, Andrew E. Firth, Ian Brierley

Viral metagenomics and ecology

Viral ecogenomics: exploring viral diversity and virus-host interactions from metagenomes

Simon Roux, DOE Joint Genome Institute, Lawrence Berkeley National Laboratory
Thursday, October 8th, 05.00 pm

Microbes are recognized to play key roles in all ecosystems, driving nutrient and energy transfers, and directly influencing human health and disease. While microbes are the principal and most-studied components of microbiomes, all microbial processes are strongly constrained and altered by viruses. In terms of numbers alone, virus-like particles seemingly outnumber microbial cells in every ecosystem. The world’s oceans, for example, harbor an estimated 10³⁰ virus particles, with an estimated 1 cell out of 3 infected at any given time. The most intuitive impact of these many viral infections is virus-induced mortality of microbial cells, which can trigger large-scale reshuffling of microbial communities. However, viruses can also modify their host cell metabolism and alter host cell fitness, including during latent and/or chronic infections. Understanding these different virus-host interactions and their associated ecological and evolutionary drivers is thus critical to fully comprehend microbiome dynamics.

Thus far, technical challenges limited our ability to even catalog the global virosphere, leading to the denomination of these seen-but-uncharacterized viruses as “dark matter of the biological universe”. In the past three years alone however, metagenomic approaches increased viral genome databases by > 200 times, and enabled comparative genomics studies which already revealed ≥ 900 new candidate viral genera. While still incomplete and not yet evenly representing the true extent of viral diversity in nature, this comprehensive catalog of uncultivated viral genomes represents an invaluable resource to evaluate ecological and evolutionary patterns in the viral world. In addition, exploring the functional potential of uncultivated viruses can suggest new putative mechanisms by which viruses can manipulate microbial processes.

Our current work involves the development of new approaches to maximize the recovery of viral genomes from metagenomes and to make these bioinformatic tools available to the broader community of researchers. As part of the growing viral ecogenomics community, we recently outlined the current promises and pitfalls of these analyses and established the first standards to report viral genomes assembled from metagenomes (“Minimum Information about an Uncultivated Virus Genome (MIUViG)”). We also recently demonstrated how customized machine-learning-based techniques could reveal an extensive viral diversity “hidden” in publicly available genomic and metagenomic datasets, by vastly expanding the genome diversity and host range ofa family of bacteriophages (Inoviridae). Finally, we are currently exploring the use of targeted metagenomics and time-series analysis to understand the eco-evolutionary drivers and constraints on virus-host dynamics in nature. Eventually, we envision that a full viral ecogenomics toolkit will soon empower researchers to scrutinize viral communities and virus-host interactions with an unprecedented level of details and resolution, enabling us to revisit long-standing biological questions and possibly inspiring new technologies for microbiome manipulation.

Unsupervised clustering of nanopore reads produces thousands of complete phage genomes from marine samples

John Beaulaurier, Oxford Nanopore Technologies, Applications, San Francisco, United States
Thursday, October 8th, 04.40 pm

Phages are the most abundant biological entity on Earth and play key roles in host ecology, evolution, and horizontal gene transfer. A growing body of research suggests that phages, especially in marine environments, represent a massive reservoir of unexplored genetic diversity. Study of these phages was previously limited to those whose infected hosts could be grown in culture. The application of metagenomic sequencing and assembly to uncultured marine microbes has made great strides toward uncovering the diversity of marine phages, but important challenges remain. The inherent genetic complexity of phage populations poses technical difficulties for recovering complete phage genomes from natural assemblages. Specifically, sequence repeats, microdiversity, and shared gene content across multiple genomes often result in phage populations that are represented by thousands of short and fragmented contigs.

To address these challenges, we developed an assembly-free nanopore sequencing approach enabling recovery of complete dsDNA tailed phage genome sequences from marine samples. Full-length nanopore reads were decomposed into vectors of k-mer counts and dimensionality reduction enabled automated bin calling. Following an alignment-based bin refinement step, our method produced thousands of polished, high-quality draft genomes that were not recovered using short-read assembly. Additionally, by resolving the terminal repeat sequences of these linear genomes, our analyses discriminated between populations whose genomes had identical direct terminal repeats versus those with circularly permuted terminal repeats. This distinction provides new insights into native phage reproduction and genome packaging strategies. Unexpectedly, novel concatemeric DNA sequences were discovered whose repeat structures, gene content, and concatemer lengths suggest they are phage-inducible chromosomal islands. These mobile elements are packaged as concatemers in phage particles, with lengths that match the size ranges of co-occurring phage genomes. Our novel phage sequencing and analysis strategy can provide information about the genome structures, population biology, and ecology of naturally occurring phages and phage parasites.

Metagenomics analyses of West Nile Virus Outbreak samples from Germany

Pauline Dianne Santos, Friedrich Loeffler Institute, Insitute of Diagnostic Virology, Greifswald Insel Riems, Germany
Thursday, October 8th, 03.00 pm

Genetic analyses of disease outbreaks with known etiology are focused mostly on full-genome sequencing of the known pathogen, while the search for possible co-infecting pathogens or symbionts is commonly overlooked. Metagenomic workflows based on next-generation sequencing (NGS) enable the acquisition of full-genome sequences of the pathogen of interest as well as the execution of metagenomic analyses from the same sequence dataset. Our workflow was applied to selected West Nile virus outbreak samples from birds in Germany from 2018-2019. Very few of the samples were also positive for Usutu virus (UsuV) RNA by PCR. WNV full-genomes were successfully acquired in 30 out of 34 outbreak samples. The Reliable Information Extraction from Metagenomic Sequence datasets (RIEMS) analysis workflow also detected the presence of UsuV RNA in two bird samples. Furthermore, sequence reads closely related to Umatilla virus species (69-92% nucleotide identity) from two great tit (Parus major) samples, and to Orthobunyavirus species (67-89% nucleotide identity) from two snowy owl (Bubo scandiacus) samples. Complete coding sequences of the two viruses revealed that they are putative novel viruses. Other than the Flaviviridae viruses, 23 other virus families including bacteriophages were found within the overall datasets. Few reads annotated as Retroviridae and Arenaviridae families were frequently occurring in several samples. However, these annotation must be carefully interpreted since these reads may come from contaminants in the field and laboratory (reagents or handling), or erroneous annotations from the database. This study shows the importance of direct sequencing of outbreak samples not only in the full-genome sequence acquisition but also in unravelling putative new viruses. Phylogenetic analyses, RT-qPCR based screening, and isolation of these viruses, as well as confirmation of other annotated viruses are ongoing.

viromeBrowser: A Shiny app for browsing virome sequencing analysis results

David Nieuwenhuijse, ErasmusMC, Viroscience, Rotterdam, Netherlands
Thursday, October 8th, 03.20 pm

Experiments in which highly complex virome sequencing data is generated are difficult to visualize and unpack for persons without programming experience. After processing raw sequencing data by next generation sequencing (NGS) workflows usually the results consist of contigs in FASTA format coupled to an annotation file linking the contigs to a reference sequence or taxonomic identifier. The next step in the analysis is to visually inspect the annotations and filter any miss-annotations by hand and extract sequences of interest that can be used in subsequent analyses.

Our interactive tool facilitates browsing through and visualization of contig annotations and metadata from multiple samples, allows filtering of annotations by custom thresholds and makes it easy for users to extract the sequences of interest from a contig file. Predefined thresholds are provided to explore annotation quality for different downstream uses (detection, complete genome isolation, virus discovery) and metadata are used to make hypothesis-based selections for exploration of the data. The app facilitates browsing of annotations across multiple files using an interactive heatmap with a link to the metadata. Various annotation quality thresholds can be set to filter contigs from the annotation files. Further inspection of selected contigs can be done in the form of automatic open reading frame (ORF) detection. Based on metadata and other selection criteria, filtered contigs and/or predicted ORFs can be saved in nucleotide or amino acid format for further analysis.

The viromeBrowser enables scientists with no programming experience to interactively browse through virome sequence data analysis results and flexibly set thresholds depending on their needs. Another benefit of the viromeBrowser is that metadata can be added to the sequences. The viromeBrowser is written in the opensource R shiny framework making it free and easy for other shiny developers to expand.

Viral infections and immunology

Machine learning approach to predicting host taxonomic information from viral genomes: combining feature representations

Francesca Young, University of Glasgow, MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
Friday, October 9th, 03.15 pm

Gaining knowledge of the host species of newly identified viruses is an important challenge, whether it is identifying the source of a newly emerged pathogen, or understanding the impact that bacteriophage-host relationships have within microbiomes. The majority of viruses identified through metagenomics lack taxonomic information about their host species. To address this gap in our knowledge there is a need for fast, accurate computational methods for assigning putative host taxonomic information. Machine learning offers an ideal reference and alignment free approach, but to maximise prediction accuracy the viral genomes need to be represented in a format that makes the discriminative information available to the machine learning algorithm. This study is based on the premise that the host specific signature embedded in viral genomes, by the process of virus host coevolution, will be reflected in different levels of biological information. Our goal was to investigate the predictive potential of features generated from different levels of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. We compiled over a hundred binary datasets of infecting/non-infecting viruses at all taxonomic ranks of host. Twenty feature sets were generated from these viral genomes by extracting k-mer compositions at different levels of sequence representation. SVM classifiers were trained and tested to compare the predictive capacity of each of these feature sets for each dataset. Our results demonstrate that the accuracy of virus host prediction can be improved by combining kernels from a broader range of features that encapsulate the multiple layers of information held within viral genomes. This will enable researchers to make higher confidence assignments of host information for the growing numbers of metagenomically derived viruses and, for example, to potentially identify the reservoir source of a spillover event.

Single cell molecular dynamics in mice infected with West Nile virus

Neta Zuckerman, Sheba Hospital, National Virology Laboratory, Israel Ministry of Health, Ramat Gan, Israel
Friday, October 9th, 03.55 pm

West Nile virus (WNV) is a vector-borne neurotropic flavivirus spread by mosquitos and birds. WNV can lead to encephalitis or meningitis causing permanent damage to the central nervous system or death in ~1% of infected individuals. The virus infects ~10,000 individuals annually in Israel, resulting in ~80 hospitalizations due to neurological complications and ~10 deaths. Despite the clinical importance to public health, studies investigating WNV infection in the brain are limited and do not include a thorough investigation of in-vivo brain cell dynamics following infection. In this study, we utilized WNV as a neurotropic prototype virus to systematically characterize the response of brain cell populations in vivo to a neuronal infection, at the single cell level. Mice were infected with WNV, brains were harvested and dissociated into single cells at different time points following infection, and 10 genomics single cell RNA sequencing was performed. Cell types were identified and differential gene expression analysis was carried out using Seurat R package. Results show changes in cell type composition during infection and up-regulation of interferon and antigen presentation pathways with infection progression by all cell types. In addition, differences in response was observed between mouse replicates. Inference-based analysis of bystander vs. infected cells based on genes up-regulated in WNV-infected cells suggests that infected cells up-regulate antiviral, antigen presentation and toxicity pathways while bystander cells up-regulate maintenance pathways. Results will be validated via qRT-PCR in combination with protein assay analyses. These discoveries will set the basis for targeting and/or attenuating the clinical manifestations caused by infection with WNV and other neurotropic viruses.

Virus evolution and classification

Leveraging high-throughput sequencing data to investigate viral diversity

Niko Beerenwinkel, ETH Zürich, Department of Biosystems Science and Engineering, Basel, Switzerland; SIB Swiss Institute of Bioinformatics, Switzerland
Friday, October 9th, 10.00 am

High-throughput sequencing (HTS) technologies are becoming increasingly relevant not only in viral genomics but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which impacts pathogenesis, virulence, and disease progression. However, there are two major hurdles in analyzing viral diversity. First, the existence of technical errors confounds the identification of true biological variants and requires the usage of sophisticated statistical models. Second, the large data volumes represent a computational challenge calling for highly performant software. To support HTS-based viral genomics, we develop a bioinformatics pipeline called V-pipe. The pipeline integrates various computational tools for the automated analysis of viral high-throughput sequencing data. An important component of our workflow is the alignment of HTS reads. We developed a novel read mapper called ngshmmalign, tailored to small and highly diverse genomes. The tool uses profile hidden Markov models enabling us to account for position-specific structural variants that are commonly found in the envelope proteins of viruses such as HIV-1 and HCV.

To provide the flexibility required for adapting to the continuously changing technology landscape, V-pipe offers a modular and extensible framework. Users can design custom workflows with the available tools or introduce their own methods. In addition, V-pipe provides a benchmarking suite, enabling the automated testing of different workflows in a standardized environment.

V-pipe uses the workflow management system Snakemak that enables the execution of the workflow components in local environments as well as in high-performance computing clusters. To ease the deployment of our tool, we provide conda environments for automatically downloading and installing the required tools. V-pipe is freely available for download from our project repository, https://github.com/cbg-ethz/V-pipe.

Parallel and scalable workflow for the identification and analysis of Phages in sequencing data

Mike Marquet, University Hospital Jena, Jena, Germany
Friday, October 9th, 10.30 am

Phages will be increasingly used as platforms for antigen display, in pathogen detection, or as vaccines, e.g., as an alternative for the treatment of multiresistant bacteria. Long read sequencing technologies, such as nanopore sequencing, allows for complete phage-genomes sequencing anytime, with comparatively low investment. Sequencing data, in general, allows us to study the occurrence, spread and the type of bacteriophages, which in return increases the demand for automated pipelines. Numerous phage prediction tools have been introduced. They all differ in dependencies, their language, the usability, the output data and more, massively increasing the difficulties for a broader audience to use.

The tool “What-the-Phage” (WtP) is written in Nextflow and utilizes Docker containers for a simplistic workflow execution in any Linux environment. All containers are written, tested and stored on hub.docker.com/u/multifractal. WtP automatically utilizes these containers which in return means that the Installation of specific tools is no longer needed. All dependencies or databases are automatically downloaded. WtP is freely available at github.com/replikation/What_the_Phage, which includes a simple installation routine for Nextflow/Docker.

We established a reproducible, scalable and easy-to-use workflow for phage identification and analysis. Our tool combines currently six established phage identification tools: Virfinder, PPR-Meta, Virsorter, Deepvirfinder, Metaphinder, and MARVEL. WtP analyses and summarizes the results gathered from sequencing data or direct nanopore raw reads. For this, each sample is combined into heatmaps for comfortable results interpretation. Moreover, multiple samples are computed in parallel and if less hardware is available, WtP decreases the parallelization automatically.

Our ongoing effort is to incorporate additional identification tools, more downstream analysis, and visual phage annotation. WtP is a highly robust and stable pipeline for the identification and analysis of phages which can easily handle both single and multi-sample inputs.

Re-assessing the diversity of negative strand RNA viruses in insects

Sofia Paraskevopoulou, Charité-Universitätsmedizin Berlin, Institute of Virology, Berlin, Germany
Friday, October 9th, 10.50 am

The spectrum of viruses in insects is important for subjects as diverse as public health, veterinary medicine, food production, and biodiversity conservation. The traditional interest in vector-borne diseases of humans and livestock has drawn the attention of virus studies to hematophagous insect species. However, these represent only a tiny fraction of the broad diversity of Hexapoda, the most speciose group of animals. In this talk, I will present results on computational assessment of the diversity of negative strand RNA viruses within the largest and most representative collection of insect transcriptomes, from samples representing all 34 extant orders of Hexapoda and several outgroups, altogether representing 1243 species. Based on profile hidden Markov models, 488 viral RNA-directed RNA polymerase sequences were detected, with similarity to negative strand RNA viruses. These were identified in individuals of 324 arthropod species, showing similarity to genomes of viruses classified in Bunyavirales, Articulavirales, and several orders within Haploviricotina. Coding-complete genomes or nearly-complete subgenomic assemblies were obtained in 61 cases. Based on phylogenetic topology and the availability of coding-complete genomes, we estimate that at least 20 novel viral genera in seven families need to be defined, only two of them monospecific. Seven additional viral clades emerge when adding sequences from the present study to formerly monospecific lineages, potentially requiring the taxonomic assignment of up to seven additional genera. For segmented viruses, cophylogenies between genome segments were generally improved by the inclusion of viruses from the present study, suggesting that in silico misassembly of segmented genomes is rare or absent. Contrary to previous assessments, significant virus-host codivergence is identified in major phylogenetic lineages based on two different approaches of codivergence analysis in a hypotheses testing framework. All in all, these results indicate that basing taxonomic decisions on genome information alone is challenging due to technical uncertainties.

Reducing haystacks to needles: Comparative genomics based on viral clusters

Kevin Lamkiewicz, Friedrich Schiller University Jena, Jena, Germany
Friday, October 9th, 11.10 am

Most viruses are still unknown. However, for some species like Influenza A virus or HIV, we are faced with an enormous amount of data, having databases with up to millions (and sometimes redundant) sequence entries.

To give confident insights in conserved elements of viral genomes, researchers usually base their analyses on multiple sequence alignments (MSA) of closely related genomes. However, building MSAs on hundreds of thousands of sequences is not practicable in terms of computational time, memory or storage, whereas picking representative sequences is not trivial.

Here, we present a bioinformatics workflow that is able to deal with a vast amount of viral sequences and assigns each virus to a cluster. The assignments are based on sequence information (k-mer frequencies) of the viruses. The definition of what belongs to a cluster is crucial, therefore we rely on state-of-the-art deep-learning techniques like UMAP and HDBSCAN. For each defined cluster of viruses, one representative genome is picked; resulting in a set consisting of around 15 viruses instead of several thousand.

We show that our approach to cluster the viruses and determine representative sequences agrees with phylogenetic trees and thus the evolutionary relation between given viral genera and species (Flaviviruses, Coronaviruses, Influenza A viruses). We further outline the potential of clustering viral families. Moreover, considering conserved RNA secondary structures, we provide an example of how representative sequences can help in identifying and understanding important regions of the viral genome for Dengue viruses and Flaviviruses in general.

With our in silico pipeline, we are able to determine a small set of representative viruses based on millions of different strains and species. Thus, the process of selecting genomes for multiple sequence alignments and further downstream analyses is made easily accessible and may improve comparative genomics for viral genomes.

International Virus Bioinformatics Meeting 2020 » Oral presentations