CN112863599B

CN112863599B - Automatic analysis method and system for virus sequencing sequence

Info

Publication number: CN112863599B
Application number: CN202110271331.9A
Authority: CN
Inventors: 刘健; 孙嘉良; 陈娇
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-10-14
Anticipated expiration: 2041-03-12
Also published as: CN112863599A

Abstract

The invention discloses an automatic analysis method and system of a virus sequencing sequence, which comprises the following steps: performing quality control and sequence assembly on the virus sequencing sequence to obtain a virus genome long sequence; after coding the virus genome long sequence, adopting a pre-trained deep learning network model to carry out type identification; annotation of viral sequencing sequences was performed based on sequence alignment of the viral genome long sequence to the reference genome. Aiming at the problems of a large amount of increased virus sequencing data and a large amount of occupied hard disk space, the invention introduces deep learning to construct an identification model, and provides a virus annotation function while realizing virus type identification.

Description

Automatic analysis method and system for virus sequencing sequence

Technical Field

The invention relates to the technical field of gene sequencing analysis, in particular to an automatic analysis method and system of a virus sequencing sequence.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

A plurality of new viruses with large-scale lethality for human beings, such as SARS (severe acute respiratory syndrome), influenza A virus H1N1, MERS (middle east respiratory syndrome), ebola virus and the like, have appeared in the last two decades, but the current research on virus identification is not enough. Existing virus identification tools are usually identified based on BLAST alignment with genome databases or protein databases, but as virus data grows in multiples or even exponential order, the speed processing of this approach becomes progressively slower, so in the face of the large growing amount of virus sequencing data, existing approaches have failed to meet the virus identification requirements; in addition, due to the rapid increase in the amount of virus sequencing data, the storage of databases used in sequence-based alignment methods also takes up more and more hard disk space.

Disclosure of Invention

Aiming at the problems that a large amount of increased virus sequencing data volume and a large amount of occupied hard disk space are provided, the invention introduces deep learning to construct an identification model, realizes virus type identification and provides a virus annotation function.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for automated analysis of viral sequencing sequences, comprising:

performing quality control and sequence assembly on the virus sequencing sequence to obtain a virus genome long sequence;

after coding the virus genome long sequence, adopting a pre-trained deep learning network model to carry out type identification;

annotation of viral sequencing sequences was performed based on sequence alignment of the viral genome long sequence to the reference genome.

In a second aspect, the present invention provides an automated analysis system for sequencing a virus, comprising:

the data preprocessing module is configured to perform quality control and sequence assembly on the virus sequencing sequence to obtain a virus genome length sequence;

the identification module is configured to encode the virus genome long sequence and then carry out type identification by adopting a pre-trained deep learning network model;

an annotation module configured to perform annotation of the viral sequencing sequence according to the sequence alignment of the viral genome long sequence and the reference genome.

In a third aspect, the invention provides computer readable instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

aiming at the problem of species identification and identification of a single species sequencing sequence, the invention provides a deep learning-based multi-classification classifier, aiming at the quantity of a large amount of increased virus sequencing data, a deep learning method is introduced to identify the types of viruses, and compared with the traditional identification method which needs to be compared with a large amount of virus genomes, the invention can greatly improve the identification speed.

The invention utilizes the identification model obtained by deep learning method training to replace a large amount of virus databases occupying hard disk space, so that the hard disk space required to be occupied is obviously reduced.

The invention not only realizes the identification of virus species through deep learning, but also provides a virus annotation function, and realizes several annotation functions of evolutionary tree analysis, traceability prediction, mutation detection and protein function annotation.

The invention introduces the identification and classification method of deep learning, the speed of which can not be obviously slowed down along with the increase of data in a real database, and virus data characteristics are abstracted, thereby solving the problem that the database based on the prior method occupies a large amount of hard disk space and obviously improving the analysis efficiency of virus identification.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of a method for automated analysis of a virus sequencing sequence provided in example 1 of the present invention;

fig. 2 is a diagram of a deep learning network model structure provided in embodiment 1 of the present invention;

fig. 3 is a branch flow chart of the network model provided in embodiment 1 of the present invention.

The specific implementation mode is as follows:

the invention is further described with reference to the following figures and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

As shown in fig. 1, this example provides a method for automated analysis of viral sequencing sequences, comprising:

s1: performing quality control and sequence assembly on the virus sequencing sequence to obtain a virus genome long sequence;

s2: after coding the virus genome long sequence, adopting a pre-trained deep learning network model to carry out type identification;

s3: annotation of viral sequencing sequences was performed based on sequence alignment of the viral genome long sequence to the reference genome.

In view of the high correlation between the accuracy of the similarity calculation of the accurate and complete data set and the gene sequence and the efficiency of deep learning, this example aims to obtain a high-quality data set in which 9569 virus genome sequences belonging to 137 families, respectively, were downloaded from NCBI (National Center for Biotechnology Information) FTP;

in the step S1, the method specifically includes:

s1-1: the quality control aims at filtering low-quality sequences, and the low-quality sequences mean that wrong bases can be contained in the sequences, so the quality of the obtained genome data is evaluated, a quality evaluation report containing indexes such as high-quality base proportion, average quality and GC content is generated, and the virus genome sequences are subjected to the operations of de-adapter and primer sequence;

preferably, external software such as fastp, fastQC, trimmatic, cutadapt, and simple is used for quality control operations.

S1-2: the sequence assembly is used for assembling the short sequences after the quality control into virus genome long sequences contigs;

preferably, external assembly software such as MEGAHIT, velvet, SPAdes and Canu is adopted for sequence assembly;

preferably, after obtaining the assembled long sequence contigs, the present embodiment may further determine the assembly quality by using evaluation indexes such as the total contigs length, N50, and the average length of contigs, so as to determine the reliability of using contigs.

In this embodiment, for the training set used in the deep learning network model, ONE thousand sequences with a length of 5000 in 137 families were randomly selected, and the ONE-HOT coding was used to encode the base sequences in the viral genome and input the encoded base sequences into the deep learning network model.

Preferably, in the present embodiment, a new identification method based on deep learning is proposed, and a multi-class convolutional neural network model including multiple parallel branch networks is constructed using a multi-class model of the convolutional neural network CNN and a residual error network, as shown in fig. 2 to 3, the whole model is composed of different parallel branches, each branch is similar to a small independent network, and each branch uses a different architecture, which can help the neural network to learn richer features of a genome sequence.

The multi-classification convolution neural network model has the specific structure as follows:

(1) On the main branch with the deepest depth, the number of layers is set to be deeper than that of other branches, so that the training result is more accurate, and the activation functions in all the convolutional layers are set to be ReLu;

(2) To alleviate the overfitting problem, the present embodiment sets the regularizer parameter (regularizer) in the hidden layer to 0.001;

(3) To counteract the problem of gradient disappearance due to too deep a depth, the present embodiment adds a residual join on the main branch;

(4) This example selects Nadam as the optimizer (optimizer) of model training, nadam is RMSprop algorithm with Nesterov momentum;

(5) The present embodiment selects the classification cross entropy (canonical cross entropy) as the loss function;

(6) At the top of all the branches, the present embodiment combines the outputs of all the branches using a connecting layer, then passes through two fully connected full connecting layers (sense layers), and finally outputs 137 scores after the softmax layer, and takes softmax as an activation function to represent the final classification result.

Preferably, the training of the deep learning network model comprises: and constructing a training set after performing feature engineering on the reference genome, and training the network model by adopting the training set.

Preferably, the virus sequencing sequence is identified according to the trained deep learning network model, the probability that the virus sequencing sequence belongs to each family (biological classification level) is output, and the family with the highest probability is taken as the type of the virus sequencing sequence.

When the virus sequencing sequence is identified according to the multi-classification convolutional neural network model, the deep learning identification and the traditional sequence comparison method can be integrated, the virus identification is realized by using the sequence comparison method of the traditional BLAST software, and the virus sequencing sequence is combined with the deep learning method to realize complementation, so that the final identification accuracy is improved; for the sequence alignment method, this example combines the sequence indices that construct the reference genome for the sequence alignment function of the BLAST software;

preferably, BLAST is aligned with the database to obtain a result file containing the parameters Query id, subject id,% identity, alignment length, etc., to predict the type of virus sequenced sequence based on the reference genome and the class to which the reference genome belongs.

In this embodiment, the similarity (% identity) and alignment length (alignment length) of each reference genome (subject acc. Ver) are evaluated by using a conventional sequence alignment method, and the product of the alignment length and the similarity is used as an evaluation score, i.e., the sequence similarity between the virus sequencing sequence and the reference genome;

preferably, the evaluation score obtained by multiplying the alignment length by the similarity is:

wherein, the identity is the sequence similarity, the alignment length is the length of the contigs aligned with the reference genome, and the accessoversion is the sequence number of the reference genome aligned with the input sequence.

It is understood that the similarity calculation can also be performed by using a sequence alignment method.

In step S3, the annotation of the virus sequencing sequence includes the construction of a phylogenetic tree, specifically: according to the sequence similarity between the virus sequencing sequence and each reference gene sequence, selecting the reference gene sequence of N sites before the sequence similarity, and constructing a phylogenetic tree;

preferably, the genetic distance between the reference genomic sequences is calculated by MEGA software, and the phylogenetic tree is drawn according to the genetic distance by using the ete module of Python.

In this embodiment, the annotation of the virus sequencing sequence further includes mutation detection, specifically: obtaining a reference gene sequence with the highest sequence similarity, and comparing long sequence contigs assembled by virus sequencing sequences with the reference gene sequence; and judging the possible gene variation information of the virus sequencing sequence relative to the reference genome according to the positions of different bases in the comparison result.

Preferably, the positions of different bases are extracted from the alignment results using FreeBayes to generate a VCF file.

In this embodiment, the annotation of the virus sequencing sequence further includes a source-tracing prediction, specifically: selecting a reference gene sequence N bits before the sequence similarity, and because the possibility that the sequences with higher similarity have the same host and origin is higher, the present example speculates the host and origin of the virus sequencing sequence by using the host and origin information of the reference genome obtained from NCBI; specifically, a reference gene sequence with N positions before the sequence similarity is selected, and the information of a host and a source of the reference gene sequence is obtained from a local collection traceability information data set, so that a traceability prediction result is obtained.

In this embodiment, the annotation of the viral sequencing sequence further includes protein function annotation, specifically: combining the assembled genome long sequence contigs with information in KEGG and GeneOntology to generate protein annotation information for contigs, wherein the protein annotation information comprises annotation information related to the retrieved genome, such as the gene name, the best matching protein and the predicted gene name;

preferably, the present embodiment integrates EggNOG-mapper software as a component of the annotation function.

In the embodiment, identification of virus species is realized through deep learning, and besides, a plurality of annotation functions such as evolutionary tree analysis, traceability prediction, mutation detection and protein function annotation are also realized.

Example 2

The present embodiment provides an automated analysis system for virus sequencing sequences, comprising:

the data preprocessing module is configured to perform quality control and sequence assembly on the virus sequencing sequence to obtain a virus genome long sequence;

It should be noted that the modules correspond to the steps described in embodiment 1, and the modules are the same as the corresponding steps in the implementation examples and application scenarios, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In further embodiments, there is also provided:

computer readable instructions which, when executed by a processor, perform the method of embodiment 1.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as a combination of computer software and electronic hardware. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A method for automated analysis of a viral sequencing sequence, comprising:

obtaining a virus genome long sequence after performing quality control and sequence assembly on a virus sequencing sequence;

integrating a deep learning network model and a traditional sequence comparison method, identifying the virus type by using the traditional sequence comparison method, and realizing complementation by combining the deep learning network model, thereby improving the final identification accuracy;

in the traditional sequence comparison method, the similarity and the alignment length of a reference genome sequence are evaluated, and the sequence similarity of a virus genome long sequence and the reference genome sequence is obtained according to the product of the alignment length and the similarity; predicting the type of the virus sequencing sequence according to the reference genome sequence and the category to which the reference genome sequence belongs;

annotating the virus sequencing sequence according to the sequence similarity obtained by comparing the sequence of the virus genome long sequence with the sequence of the reference genome sequence;

the deep learning network model is a multi-classification convolutional neural network model which is constructed by using a multi-classification model of a convolutional neural network and a residual error network and comprises a plurality of parallel branch networks;

in the multi-classification convolutional neural network model:

one branch network in the multiple parallel branch networks has a depth larger than that of other branch networks, and residual connection is added on a main branch network with the deepest depth;

and at the top of all the branch networks, combining the outputs of all the branch networks by using a connecting layer, then passing through two fully-connected layers, and finally outputting the classification result through a softmax layer.

2. The method of claim 1, wherein the quality control is performed by performing de-adaptor and de-primer operations on the virus sequencing sequence.

3. The method of claim 1, wherein the sequence assembly is performed by assembling short sequences into long sequences to obtain long sequences of the viral genome.

4. The method of claim 1, wherein the base sequence of the long sequence of the genome of the virus is encoded.

5. The method of claim 1, wherein the reference genome sequence is subjected to feature engineering to construct a training set, and the deep learning network model is trained by using the training set.

6. The method of claim 1, wherein the type identification comprises: and identifying the virus sequencing sequence according to a pre-trained deep learning network model, outputting the probability that the virus sequencing sequence belongs to each family, and taking the family with the highest probability as the type of the virus sequencing sequence.

7. The method of claim 1, wherein the annotation of the viral sequencing sequence comprises obtaining a reference genomic sequence N bits before the sequence similarity, and calculating the genetic distance between the reference genomic sequences to construct the phylogenetic tree.

8. The method of claim 1, wherein the annotation of the viral sequencing sequence comprises obtaining a reference genomic sequence with the highest sequence similarity, comparing the viral sequencing sequence with the reference genomic sequence, and determining the genetic variation information of the viral sequencing sequence relative to the reference genomic sequence according to the positions of different bases in the comparison result.

9. The method of claim 1, wherein the annotation of the viral sequencing sequence comprises a protein function annotation comprising a retrieved gene name, a best-matching protein, and a predicted gene name.

10. An automated analysis system for viral sequencing sequences, comprising:

in the multi-classification convolutional neural network model:

combining the outputs of all the branch networks by using a connecting layer at the tops of all the branch networks, then passing through two fully-connected full-connecting layers, and finally outputting a classification result through a softmax layer;

an annotation module configured to perform annotation of the viral sequencing sequence according to sequence similarity obtained from sequence alignment of the viral genome long sequence and the reference genome sequence.

11. Computer readable instructions, wherein said computer readable instructions, when executed by a processor, perform the method of any of claims 1-9.