CN114334010B - Automatic identification method and system for classification of bunyas related virus species - Google Patents
Automatic identification method and system for classification of bunyas related virus species Download PDFInfo
- Publication number
- CN114334010B CN114334010B CN202111610163.8A CN202111610163A CN114334010B CN 114334010 B CN114334010 B CN 114334010B CN 202111610163 A CN202111610163 A CN 202111610163A CN 114334010 B CN114334010 B CN 114334010B
- Authority
- CN
- China
- Prior art keywords
- rdrp
- sequence
- automatic identification
- conserved domain
- tool
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 241000700605 Viruses Species 0.000 title claims abstract description 80
- 238000000034 method Methods 0.000 title claims abstract description 53
- 241000557639 Araucaria bidwillii Species 0.000 title claims abstract description 36
- 101710118046 RNA-directed RNA polymerase Proteins 0.000 claims abstract description 119
- 241000150350 Peribunyaviridae Species 0.000 claims abstract description 45
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 29
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 29
- 238000012216 screening Methods 0.000 claims abstract description 19
- 241000713112 Orthobunyavirus Species 0.000 claims description 42
- 241000894007 species Species 0.000 claims description 12
- 230000015654 memory Effects 0.000 claims description 9
- 241000150347 Bunyavirales Species 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 108091026890 Coding region Proteins 0.000 claims description 6
- 101710141454 Nucleoprotein Proteins 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000002869 basic local alignment search tool Methods 0.000 description 37
- 238000002360 preparation method Methods 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 2
- 108060004795 Methyltransferase Proteins 0.000 description 2
- 241000150452 Orthohantavirus Species 0.000 description 2
- 238000010224 classification analysis Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 230000006806 disease prevention Effects 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000712892 Arenaviridae Species 0.000 description 1
- 201000003075 Crimean-Congo hemorrhagic fever Diseases 0.000 description 1
- 101001065501 Escherichia phage MS2 Lysis protein Proteins 0.000 description 1
- 241000712902 Lassa mammarenavirus Species 0.000 description 1
- 206010037660 Pyrexia Diseases 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 108700010779 bunyavirales proteins Proteins 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 241001493065 dsRNA viruses Species 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 150000007523 nucleic acids Chemical group 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 206010043554 thrombocytopenia Diseases 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an automatic identification method and system for classification of bunyas related virus species, wherein the method comprises the following steps: obtaining FASTA format files of all protein genome sequences of the bunyas-related virus from a GenBank database linked to a Taxonomy database; performing batch annotation on the FASTA format file by using a CDD tool based on NCBI accession number and generating a core conserved domain sequence data packet of RdRp of the bunyaviridae order; installing BLAST tools and CDD tools and constructing an automatic identification webpage; acquiring a virus sequence to be identified submitted by a user in an automatic identification webpage; screening a virus sequence to be identified by using a CDD tool, and extracting a core conserved domain sequence of RdRp of the bunyaviridae order from the virus sequence to be identified; the method can rapidly acquire virus classification information by utilizing BLAST tools to acquire classification information of a plurality of similar sequences from a BLAST local library.
Description
Technical Field
The invention belongs to the technical field of virus classification treatment, and particularly relates to an automatic identification method and system for classification of bunyas related virus species.
Background
The bunyaviridae order comprises genus 54 of 12 families, wherein fever with thrombocytopenia syndrome virus, hantavirus, crimia-congo hemorrhagic fever virus and lassa virus in four families of the white fiber virus family, hantavirus family, inner ro virus family and arenaviridae family can infect human beings, and death of human beings can be caused when serious, so that classification information of the bunyavirus is acquired rapidly, the source of the virus is clear, and the prevention of the bunyavirus is important. RNA-dependent RNA polymerase (RNA-dependent RNA polymerase, rdRp) has been widely used in the classification of RNA viruses as a conserved gene-encoded protein, and is a rational choice for determining classification information. The international committee on classification of viruses (The International Committee on Taxonomy of Viruses, ICTV for short) database, the national center for biotechnology information (National Center for Biotechnology Information, NCBI for short) database of Taxonomy and ViralZone are currently widely used databases of classification of viruses. The virus classification information which can be directly obtained based on the data of the databases is known viruses, for the newly developed bunyas related viruses, conserved Domain Database (CDD conserved domain database) tools are required to be used for obtaining core conserved domains, basic Local Alignment Search Tool (BLAST) tools are combined for obtaining sequence information with higher similarity, and the Taxonomy databases are linked for obtaining corresponding classification information, so that the analysis flow has higher requirements on the professional degree, the result processing process is complicated, and the classification information cannot be directly obtained by submitting genome sequences.
Disclosure of Invention
The invention aims to provide an automatic identification method and system for classification of bunyas related viruses, which are used for solving the problems that the analysis flow of new bunyas related viruses has higher requirements on the specialty, the result processing process is complex, and classification information cannot be obtained directly by submitting genome sequences, and the technical problems to be solved by the invention are realized by the following technical scheme:
in one aspect, the present invention provides an automatic identification method for bunyas-associated virus species classification, comprising:
obtaining FASTA format files of all protein genome sequences of the bunyas-related virus from a GenBank database linked to a Taxonomy database;
performing batch annotation on the FASTA format file by using a CDD tool based on NCBI accession number and generating a Bunyavirus RdRp core conserved domain sequence data packet;
installing a BLAST tool and the CDD tool, constructing an automatic identification webpage, and taking the Bryavirus RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification webpage;
acquiring a virus sequence to be identified submitted by a user in the automatic identification webpage;
screening the virus sequence to be identified by using the CDD tool, and extracting a core conserved domain sequence of the RdRp of the order bunyaviridae to be identified from the virus sequence to be identified if the screening result is bunyaviridae;
and obtaining classification information of a plurality of similar sequences from the BLAST local library by utilizing the BLAST tool according to the core conserved domain sequence of the RdRp of the bunyaviridae to be identified.
Preferably, the step of obtaining the FASTA format file of all protein genome sequences of the bunyas-related virus from the GenBank database linked to the taxonom database includes:
searching "bunyavirales" in the Taxonomy database to obtain a list containing N protein genome sequences;
the FASTA format file for each of the protein genomic sequences in the list was downloaded by linking to the GenBank database through the Taxonomy database.
Preferably, the step of batch annotating the FASTA format file with a CDD tool based on the NCBI accession number and generating a bunyaviridae RdRp core conserved domain sequence data packet includes:
screening out standard RdRp sequences and potential RdRp sequences from the FASTA format files of the N protein genome sequences based on the sequence names of the FASTA format files;
performing batch annotation on the standard RdRp sequence and the potential RdRp sequence by utilizing the CDD tool based on the NCBI accession number to obtain an annotation result, and finding a corresponding RdRp core conserved domain interval of the bunyaviridae according to the annotation result;
and processing the corresponding core conserved domain interval of the RdRp of the bunyaviridae by using a Python script to obtain the sequence data packet of the core conserved domain of the RdRp of the bunyaviridae.
Preferably, the step of batch annotating the standard RdRp sequence and the potential RdRp sequence with the CDD tool based on the NCBI accession number to obtain an annotation result, and finding a corresponding core conserved domain interval of RdRp of bunyaviridae according to the annotation result further comprises:
if the standard RdRp sequence does not find a corresponding core conserved domain interval of the RdRp of the Bunyavirus order, obtaining an NCBI annotated RdRp coding region interval of the standard RdRp sequence;
the BLAST tool or alignment with the homoviridae sequence is used to obtain the corresponding core conserved domain interval of RdRp of the Bunyavirus order from the NCBI annotated RdRp coding region interval of the standard RdRp sequence.
Preferably, the step of processing the corresponding core domain segment of RdRp of the order bunyaviridae by using the Python script to obtain the data packet of the core domain sequence of RdRp of the order bunyaviridae further includes:
extracting a sequence ID in the RdRp core conserved domain sequence data packet of the bunyaviridae, wherein the sequence ID is NCBI accession number;
according to the sequence ID, batch downloading corresponding GenPept format files by using the Batch entry of NCBI;
extracting corresponding sequence-belonging classification information from the downloaded NCBI annotation information of the GenPept format file respectively;
and integrating the extracted classification information of the sequence by using the Python script.
Preferably, the step of installing BLAST tools and CDD tools and constructing an automatically identified web page comprises:
installing the BLAST tool in the Ubuntu system using the command "apt install ncbi-blast+" wherein the CDD tool is embedded in the BLAST tool;
and (3) using Python flash to build the automatic identification webpage, using PSSM matrix data downloaded from NCBI as CDD local library of the automatic identification webpage, and using the Bryavirus RdRp core conserved domain sequence data packet as BLAST local library of the automatic identification webpage.
Preferably, the classification information of the similar sequences comprises NCBI accession number, similarity, genome coverage, and classification information to which the sequences belong.
In another aspect, the present invention also provides an automatic identification system for bunyas-associated virus species classification, comprising:
the data acquisition module is configured to acquire FASTA format files of all protein genome sequences of the bunyas-related virus from a GenBank database linked to the Taxonomy database;
the data packet generation module is configured to carry out batch annotation on the FASTA format file by using a CDD tool based on NCBI accession number and generate a bunyaviridae RdRp core conserved domain sequence data packet;
the identification webpage construction module is configured to install a BLAST tool and the CDD tool and build an automatic identification webpage, and takes the Bryavirus RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification webpage;
the user input module is configured to acquire a virus sequence to be identified submitted by a user on the automatic identification webpage;
the preliminary screening module is configured to screen the virus sequence to be identified by utilizing the CDD tool, and if the screening result is bunyavirus, the core conserved domain sequence of the RdRp of the bunyaviridae to be identified is extracted from the virus sequence to be identified;
and an identification module configured to obtain classification information of a plurality of similar sequences from the BLAST local library using the BLAST tool according to the bunyaviridae RdRp core conserved domain sequence to be identified.
In still another aspect, the present invention further provides an electronic device, including: a processor and a memory having stored thereon computer readable instructions which when executed by the processor implement the automatic identification method for bunyate-related virus species classification as described above.
In yet another aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an automatic identification method for bunyaia-related virus species classification as described above.
The automatic identification method for the bunyavirus genus classification comprises the steps of firstly obtaining FASTA format files of all protein genome sequences of the bunyavirus from an existing database, then carrying out batch annotation by using a CDD tool and generating a data packet of a bunyavirus RdRp core conserved domain sequence, and further completing data preparation of the automatic identification method; then installing BLAST tools and CDD tools and building automatic identification webpages, namely building a rapid automatic identification tool aiming at classification of bunyaviridae genus species; finally, the user can quickly acquire accurate virus classification information within 1min by submitting the genome sequence. The automatic identification method for classification of the bunyavirus genus species uses the database of the bunyavirus order RdRp core conservation area as an entry point, integrates GenBank and Taxonomy database information, builds a personalized webpage platform, establishes a rapid automatic identification tool for classification of the bunyavirus order species, reduces the complexity of analysis, greatly shortens the virus classification analysis time and improves the identification efficiency. In addition, the automatic identification method for classification of the bunyavirus genus species is based on widely used BLAST and CDD tools in NCBI, and the output result has higher reliability, is suitable for popularization in scientific research units, disease prevention control centers and other units, assists a unit personnel to quickly and clearly acquire the genus status of the bunyavirus, and provides technical support for preventing and controlling the diseases caused by the bunyavirus in China.
Drawings
FIG. 1 is a flow chart of some embodiments of the automatic identification method for bunyas-associated virus classification of the present invention;
FIG. 2 is a flow chart of some embodiments of a step 100 of the automatic identification method for bunyas-associated virus classification of the present invention;
FIG. 3 is a flow chart of some embodiments of a step 200 of the automatic identification method for bunyas-associated virus classification of the present invention;
FIG. 4 is a flow chart of some embodiments of a step 300 of the automatic identification method for bunyas-associated virus classification of the present invention;
FIG. 5 is a flowchart of the generation of a packet of a core conserved domain sequence of RdRp of the order bunyaviridae in accordance with an embodiment of the present invention;
FIG. 6 is a flowchart of an embodiment of the present invention for automatically identifying web page construction;
FIG. 7 is a flowchart of a function implementation of automatically identifying web pages according to an embodiment of the present invention;
FIG. 8 is a block diagram of some embodiments of an automatic identification system for bunyas-associated virus classification of species of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
Referring to fig. 1, an embodiment of the present invention provides an automatic identification method for bunyas-related virus genus classification, including:
step 100: obtaining FASTA format files of all protein genome sequences of the bunyas-related virus from a GenBank database linked to a Taxonomy database;
this example is a database of existing national center for biotechnology information (National Center for Biotechnology Information, NCBI for short) Taxonomy for all protein genomic sequences of bunyas-associated viruses, and a corresponding fasta format file is downloaded from the GenBank database, wherein the fasta format is a text-based format for representing nucleic acid sequences or amino acid sequences.
Step 200: performing batch annotation on the FASTA format file by using a CDD tool based on NCBI accession number and generating a core conserved domain sequence data packet of RdRp of the bunyaviridae order;
in this example, for the obtained protein genome sequence, batch annotation can be performed based on the NCBI accession number by using the CDD tool in NCBI, and further, the core conserved domain interval of RdRp of bunyaviridae is obtained according to the annotation result.
The data preparation process of the automatic recognition method is completed through steps 100 and 200 in this embodiment.
Step 300: installing BLAST tools and CDD tools, constructing an automatic identification webpage, and taking a Bryavirus RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification webpage;
the step realizes the localization and integration process of the CDD tool and the BLAST tool, and establishes an automatic identification webpage.
Step 400: acquiring a virus sequence to be identified submitted by a user in an automatic identification webpage;
when the user has the identification requirement, the virus sequence to be identified is only required to be submitted on the front-end interface which is the automatic identification webpage.
Step 500: screening a virus sequence to be identified by using a CDD tool, and if the screening result is bunyavirus, extracting a bunyavirus RdRp core conserved domain sequence to be identified from the virus sequence to be identified;
the back end of the auto-id web page in this example was initially screened using a CDD tool to determine if the genome contained a bunyaviridae RdRp core conserved domain (i.e., the bunya_rdrp core conserved domain).
Step 600: and obtaining classification information of a plurality of similar sequences from a BLAST local library by using a BLAST tool according to the core conserved domain sequence of RdRp of the bunyaviridae to be identified.
In this embodiment, BLAST is performed on the obtained bunya_rdrp core conserved domain sequence based on the local library constructed by the bunyaviridae RdRp core conserved domain sequence data packet to obtain classification information of a plurality of similar sequences.
The automatic identification method for the bunyavirus genus classification comprises the steps of firstly obtaining FASTA format files of all protein genome sequences of the bunyavirus from an existing database, then carrying out batch annotation by using a CDD tool and generating a bunyavirus RdRp core conserved domain sequence data packet, and further completing data preparation of the automatic identification method; then installing BLAST tools and CDD tools and building automatic identification webpages, namely building a rapid automatic identification tool aiming at classification of bunyaviridae genus species; finally, the user can quickly acquire accurate virus classification information within 1min by submitting the genome sequence. The embodiment of the invention provides an automatic identification method for classification of bunyavirus genus species, which takes a database of the bunyavirus order RdRp core conservation area as an entry point, integrates GenBank and Taxonomy database information, builds a personalized webpage platform, establishes a rapid automatic identification tool for classification of the bunyavirus order genus species, reduces the complexity of analysis, greatly shortens the virus classification analysis time and improves the identification efficiency. In addition, the automatic identification method for classification of the bunyavirus genus species is based on widely used BLAST and CDD tools in NCBI, and the output result has higher reliability, is suitable for popularization in scientific research units, disease prevention control centers and other units, assists a unit personnel to quickly and clearly acquire the genus status of the bunyavirus, and provides technical support for preventing and controlling the diseases caused by the bunyavirus in China.
In some embodiments, referring to fig. 2 and 5, a method for automatically identifying bunyas-related virus species classification according to the present invention includes:
step 101: searching "bunyavirales" in the Taxonomy database to obtain a list containing N protein genome sequences;
step 102: the FASTA format file for each protein genomic sequence in the list was downloaded by linking to the GenBank database through the Taxonomy database.
Specifically, in this embodiment, n=53507 bunyavirales protein genomic sequences are displayed by searching in the database of the Taxonomy of NCBI by using "bunyavirales" as a keyword, and the database of the GenBank is linked (protein subtree linked) by point opening Protein Subtree links (protein subtree linked), so that FASTA format files of all protein genomic sequences in the result list are downloaded.
In some embodiments, referring to fig. 3 and 5, a method 200 for automatically identifying bunyas-associated virus species classification according to the present invention includes:
step 201: screening out standard RdRp sequences and potential RdRp sequences from the FASTA format file of the N protein genome sequences based on the sequence names of the FASTA format file;
step 202: carrying out batch annotation on the standard RdRp sequence and the potential RdRp sequence by using a CDD tool based on NCBI accession numbers to obtain annotation results, and finding out corresponding RdRp core conserved domain intervals of the bunyaviridae according to the annotation results;
step 203: and processing the corresponding RdRp core conserved domain interval of the bunyaviridae by using a Python script to obtain the RdRp core conserved domain sequence data packet of the bunyaviridae.
Specifically, in this embodiment, standard RdRp and potential RdRp are selected by a Sequence title, where the Sequence title is a Sequence name in a FASTA format file downloaded in GenBank, and the Sequence name includes a protein name. The sequences explicitly noted as RdRp proteins in the sequence name are defined as standard RdRp, proteins in the genome of the Protein that may contain a conserved region of the RdRp core are defined as potential RdRps, e.g.L Protein/Polymerase (for bunyaviridae, mainly the L fragment encodes RdRp, the genomic sequence of these proteins may contain RdRp is determined by looking up the data and thus is defined as potential RdRp). Artificial screening means may be employed here, for example: the BioEdit software opens the protein package and Find in title the standard RdRp and potential RdRp.
In this embodiment, batch annotation using CDD tool refers to finding RdRp core conserved domain using CDD tool based on standard RdRp and potential RdRp, because even the protein explicitly annotated as RdRp still has a large difference in length and poor sequence similarity, to ensure sequence conservation of the final RdRp packet, one-step CDD is performed to find RdRp core conserved domain based on all standard RdRp and potential RdRp. The batch annotation here combines the sequence information (sequence NCBI accession number; sequence length; protein name) of the standard RdRp and the potential RdRp into an Excel table, uses CDD tools to make batch annotation based on NCBI accession number, and searches Bunya_RdRp in the conserved domain provided by CDD results, and finds out to represent that the RdRp core conserved domain interval can be obtained.
In some embodiments, referring to fig. 5, step 202 in an automatic identification method for bunyas-related virus classification according to the present invention further comprises:
if the standard RdRp sequence does not find the corresponding core conserved domain interval of the RdRp of the Bunyavirus, obtaining an NCBI annotated RdRp coding region interval of the standard RdRp sequence;
the corresponding core conserved domain interval of RdRp of the Bunyavirus order is obtained from NCBI annotated RdRp coding region intervals of standard RdRp sequences using BLAST tools or alignments with the same viral family sequences.
In some embodiments, referring to fig. 3, a method for automatically identifying bunyas-related virus species classification according to the present invention further comprises, after step 203:
step 204: extracting sequence ID in a core conserved domain sequence data packet of RdRp of bunyaviridae, wherein the sequence ID is NCBI accession number;
step 205: the corresponding GenPept format files are downloaded in batches by using the Batch entry of NCBI according to the sequence ID;
step 206: extracting corresponding sequence-belonging classification information from NCBI annotation information of the downloaded GenPept format file respectively;
step 207: and integrating the classification information of the extracted sequence by using a Python script.
Specifically, in this embodiment, the sequence Accession ID in the bunya_rdrp core conserved domain sequence packet is extracted, where the sequence ID is NCBI Accession number, such as yp_009666929.1; and downloading corresponding GenPept (full) format files in batches by using the Batch entry of NCBI, and integrating classification information belonging to the sequence by using the Python script based on NCBI annotation information provided by the GenPept (full) format files.
In some embodiments, referring to fig. 4 and 6, a method for automatically identifying bunyas-related virus species classification 300 of the present invention includes:
step 301: installing a BLAST tool in the Ubuntu system by using the command of 'apt install ncbi-blast+', wherein the CDD tool is embedded in the BLAST tool;
step 302: and constructing an automatic identification webpage by using Python flash, taking PSSM matrix data downloaded from NCBI as a CDD local library of the automatic identification webpage and taking a Bryavirus RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification webpage.
Specifically, in this embodiment, a BLAST tool is first installed in the local Ubuntu system using the command "apt install ncbi-blast+" where the CDD tool is embedded in the BLAST tool and executed by the "rpsblast+" command. Since the operation of the CDD tool relies on a library of conserved domains, which requires PSSM files provided by CDDs, PSSM is a matrix format file that searches for conserved regions based on aligned protein sequences. And then constructing an automatic identification webpage by using Python flash, downloading PSSM matrix data used by CDD from NCBI and defining the PSSM matrix data as a CDD local library of the personalized webpage, and defining a Bryavirus RdRp core conserved region sequence data packet as a BLAST local library of the automatic identification webpage, wherein MySQL is used for storing the Bryavirus RdRp core conserved region sequence data packet.
Optionally, the classification information of the similar sequences in the automatic identification method for bunyas related virus genus classification according to the embodiment of the present invention includes NCBI accession number, similarity, genome coverage, and classification information to which the sequences belong.
Fig. 7 is a process of functional implementation, as shown in fig. 7, including the steps of:
(1) And submitting a virus sequence by a user through a webpage, performing primary screening by using a CDD tool at the rear end, determining whether the genome contains a Bunya-RdRp core conserved domain, and if the primary screening result is bunyavirus, acquiring the Bunya-RdRp core conserved domain sequence.
(2) And (3) BLAST is carried out on the obtained Bunya-RdRp core conserved domain sequence and a local library constructed based on a Bryavirus RdRp core conserved domain sequence data packet, so as to obtain the sequence ID, similarity, genome Cover condition and corresponding belonging classification information of the first 10 sequences with the highest sequence similarity.
Optionally, the automatic identification method for bunyas related virus genus classification in the embodiment of the present invention further includes a front-end interface, where the front-end interface includes two main interfaces of data uploading and result outputting:
(1) Webpage name: bunyavirus Classifying Tool;
(2) Uploading data: sequence entry box and "select file" link: the user can submit the sequence through two means of pasting the sequence and uploading the file;
"subset" links: starting to operate;
(3) Results interface: genome RdRp annotation map, rdRp interval, top ten sequences with highest similarity and corresponding annotation information.
On the other hand, referring to fig. 8, an embodiment of the present invention further provides an automatic identification system 1 for classification of bunyas-related virus species, including:
a data acquisition module 10 configured to acquire FASTA format files of all protein genomic sequences of the bunyas-associated virus from a GenBank database linked to the taxonom database;
a data packet generation module 20 configured to batch annotate the FASTA format file with the CDD tool based on the NCBI accession number and generate a bunyaviridae RdRp core conserved domain sequence data packet;
an identification web page construction module 30 configured to install a BLAST tool and a CDD tool and build an automatic identification web page, taking the brivia order RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification web page;
a user input module 40 configured to obtain a virus sequence to be identified submitted by a user in the automatic identification web page;
a preliminary screening module 50 configured to screen the virus sequence to be identified using a CDD tool, and if the screening result is bunyavirus, extracting a bunyaviridae RdRp core conserved domain sequence to be identified from the virus sequence to be identified;
and an identification module 60 configured to obtain classification information of a plurality of similar sequences from a BLAST local library using a BLAST tool according to the core conserved domain sequence of the order RdRp of bunyaviridae to be identified.
The specific details of each module of the automatic identification system for bunyavirus classification in the foregoing description have been described in detail in the corresponding automatic identification method for bunyavirus classification, and thus will not be described herein.
In still another aspect, an embodiment of the present invention further provides an electronic device, including: the automatic identification method for bunyavirus genus classification according to the above embodiment is implemented by a processor and a memory, wherein the memory stores computer readable instructions, and the computer readable instructions when executed by the processor.
In particular, the memory and the processor can be general-purpose memories and processors, which are not limited herein, and when the processor executes the computer readable instructions stored in the memories, the automatic identification method for bunyavirus genus classification described in the above embodiments can be executed.
In yet another aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the automatic identification method for bunyaia-related virus genus classification according to the above embodiment.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-only memory (ROM), random-access memory (random accessmemory, RAM), magnetic or optical disk, and the like.
It should be noted that the foregoing detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly indicates otherwise. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or groups thereof.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Spatially relative terms, such as "above … …," "above … …," "upper surface at … …," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial location relative to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "above" or "over" other devices or structures would then be oriented "below" or "beneath" the other devices or structures. Thus, the exemplary term "above … …" may include both orientations of "above … …" and "below … …". The device may also be positioned in other different ways, such as rotated 90 degrees or at other orientations, and the spatially relative descriptors used herein interpreted accordingly.
In the above detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components unless context indicates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. An automatic identification method for classification of bunyas-associated virus species, comprising:
obtaining FASTA format files of all protein genome sequences of the bunyas-related virus from a GenBank database linked to a Taxonomy database;
performing batch annotation on the FASTA format file by using a CDD tool based on NCBI accession number and generating a Bunyavirus RdRp core conserved domain sequence data packet;
installing a BLAST tool and the CDD tool, constructing an automatic identification webpage, and taking the Bryavirus RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification webpage;
acquiring a virus sequence to be identified submitted by a user in the automatic identification webpage;
screening the virus sequence to be identified by using the CDD tool, and extracting a core conserved domain sequence of the RdRp of the order bunyaviridae to be identified from the virus sequence to be identified if the screening result is bunyaviridae;
and obtaining classification information of a plurality of similar sequences from the BLAST local library by utilizing the BLAST tool according to the core conserved domain sequence of the RdRp of the bunyaviridae to be identified.
2. The automatic identification method for bunyavirus genus species classification as claimed in claim 1, wherein the step of obtaining FASTA format files of all protein genome sequences of the bunyavirus from GenBank database linked to the taxonom database comprises:
searching "bunyavirales" in the Taxonomy database to obtain a list containing N protein genome sequences;
the FASTA format file for each of the protein genomic sequences in the list was downloaded by linking to the GenBank database through the Taxonomy database.
3. The method for automatic identification of bunyavirus genus species classification as claimed in claim 2, wherein the step of batch annotating the FASTA format file with CDD tool based on NCBI accession number and generating a bunyavirales RdRp core conserved domain sequence data packet comprises:
screening out standard RdRp sequences and potential RdRp sequences from the FASTA format files of the N protein genome sequences based on the sequence names of the FASTA format files;
performing batch annotation on the standard RdRp sequence and the potential RdRp sequence by utilizing the CDD tool based on the NCBI accession number to obtain an annotation result, and finding a corresponding RdRp core conserved domain interval of the bunyaviridae according to the annotation result;
and processing the corresponding core conserved domain interval of the RdRp of the bunyaviridae by using a Python script to obtain the sequence data packet of the core conserved domain of the RdRp of the bunyaviridae.
4. The method for automatic identification of bunyaviridae genus species classification as claimed in claim 3, wherein the step of batch annotating the standard RdRp sequence and the potential RdRp sequence with the CDD tool based on the NCBI accession number to obtain an annotation result, and finding a corresponding bunyaviridae order RdRp core conserved domain interval according to the annotation result further comprises:
if the standard RdRp sequence does not find a corresponding core conserved domain interval of the RdRp of the Bunyavirus order, obtaining an NCBI annotated RdRp coding region interval of the standard RdRp sequence;
the BLAST tool or alignment with the homoviridae sequence is used to obtain the corresponding core conserved domain interval of RdRp of the Bunyavirus order from the NCBI annotated RdRp coding region interval of the standard RdRp sequence.
5. The method for automatically identifying a bunyavirales-related virus species classification as claimed in claim 3, wherein said step of processing said corresponding bunyavirales RdRp core conserved domain interval using a Python script to obtain said bunyavirales RdRp core conserved domain sequence packet further comprises:
extracting a sequence ID in the RdRp core conserved domain sequence data packet of the bunyaviridae, wherein the sequence ID is NCBI accession number;
according to the sequence ID, batch downloading corresponding GenPept format files by using the Batch entry of NCBI;
extracting corresponding sequence-belonging classification information from the downloaded NCBI annotation information of the GenPept format file respectively;
and integrating the extracted classification information of the sequence by using the Python script.
6. The automatic identification method for bunyas-associated virus species classification according to claim 1, wherein said step of installing BLAST tool and said CDD tool and constructing an automatic identification web page comprises:
installing the BLAST tool in the Ubuntu system using the command "apt install ncbi-blast+" wherein the CDD tool is embedded in the BLAST tool;
and (3) using Python flash to build the automatic identification webpage, using PSSM matrix data downloaded from NCBI as CDD local library of the automatic identification webpage, and using the Bryavirus RdRp core conserved domain sequence data packet as BLAST local library of the automatic identification webpage.
7. The method of claim 1, wherein the classification information for the similar sequences includes NCBI accession number, similarity, genome coverage, and classification information to which the sequences belong.
8. An automatic identification system for bunyas-associated virus species classification, comprising:
the data acquisition module is configured to acquire FASTA format files of all protein genome sequences of the bunyas-related virus from a GenBank database linked to the Taxonomy database;
the data packet generation module is configured to carry out batch annotation on the FASTA format file by using a CDD tool based on NCBI accession number and generate a bunyaviridae RdRp core conserved domain sequence data packet;
the identification webpage construction module is configured to install a BLAST tool and the CDD tool and build an automatic identification webpage, and takes the Bryavirus RdRp core conserved domain sequence data packet as a BLAST local library of the automatic identification webpage;
the user input module is configured to acquire a virus sequence to be identified submitted by a user on the automatic identification webpage;
the preliminary screening module is configured to screen the virus sequence to be identified by utilizing the CDD tool, and if the screening result is bunyavirus, the core conserved domain sequence of the RdRp of the bunyaviridae to be identified is extracted from the virus sequence to be identified;
and an identification module configured to obtain classification information of a plurality of similar sequences from the BLAST local library using the BLAST tool according to the bunyaviridae RdRp core conserved domain sequence to be identified.
9. An electronic device, comprising: a processor and a memory having stored thereon computer readable instructions which when executed by the processor implement the automatic identification method for bunyas-associated virus species classification as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the automatic identification method for bunyaia-related virus species classification according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111610163.8A CN114334010B (en) | 2021-12-27 | 2021-12-27 | Automatic identification method and system for classification of bunyas related virus species |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111610163.8A CN114334010B (en) | 2021-12-27 | 2021-12-27 | Automatic identification method and system for classification of bunyas related virus species |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114334010A CN114334010A (en) | 2022-04-12 |
CN114334010B true CN114334010B (en) | 2024-03-22 |
Family
ID=81013606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111610163.8A Active CN114334010B (en) | 2021-12-27 | 2021-12-27 | Automatic identification method and system for classification of bunyas related virus species |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114334010B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103209706A (en) * | 2010-09-20 | 2013-07-17 | 农业研究基金会 | Methods to produce bunyavirus replicon particles |
CN112725534A (en) * | 2021-01-25 | 2021-04-30 | 中国疾病预防控制中心病毒病预防控制所 | Primer probe, target combination, kit and method for detecting karya virus, hazara virus and epstein-barr virus |
CN112863599A (en) * | 2021-03-12 | 2021-05-28 | 南开大学 | Automatic analysis method and system for virus sequencing sequence |
CN113539378A (en) * | 2021-07-16 | 2021-10-22 | 明科生物技术(杭州)有限公司 | Data analysis method, system, equipment and storage medium of virus database |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RS57264B1 (en) * | 2013-05-21 | 2018-08-31 | Stichting Wageningen Res | Bunyaviruses with segmented glycoprotein precursor genes and methods for generating these viruses |
-
2021
- 2021-12-27 CN CN202111610163.8A patent/CN114334010B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103209706A (en) * | 2010-09-20 | 2013-07-17 | 农业研究基金会 | Methods to produce bunyavirus replicon particles |
CN112725534A (en) * | 2021-01-25 | 2021-04-30 | 中国疾病预防控制中心病毒病预防控制所 | Primer probe, target combination, kit and method for detecting karya virus, hazara virus and epstein-barr virus |
CN112863599A (en) * | 2021-03-12 | 2021-05-28 | 南开大学 | Automatic analysis method and system for virus sequencing sequence |
CN113539378A (en) * | 2021-07-16 | 2021-10-22 | 明科生物技术(杭州)有限公司 | Data analysis method, system, equipment and storage medium of virus database |
Non-Patent Citations (2)
Title |
---|
布尼亚病毒目新分类概述;唐霜;沈姝;史君明;方耀辉;王华林;胡志红;邓菲;;生物多样性;20180828(09) * |
布尼亚病毒科全基因组序列比对分析;刘雅婷;张文超;李正跃;李成云;朱有勇;李永忠;;云南农业大学学报;20080515(03) * |
Also Published As
Publication number | Publication date |
---|---|
CN114334010A (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hoff et al. | Predicting genes in single genomes with AUGUSTUS | |
Nikolayeva et al. | edgeR for differential RNA-seq and ChIP-seq analysis: an application to stem cell biology | |
Kircher | Analysis of high-throughput ancient DNA sequencing data | |
Kim et al. | HISAT: a fast spliced aligner with low memory requirements | |
Pond et al. | Windshield splatter analysis with the Galaxy metagenomic pipeline | |
Berendzen et al. | The legume information system and associated online genomic resources | |
Portik et al. | SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets | |
CN114334010B (en) | Automatic identification method and system for classification of bunyas related virus species | |
JP2005235209A (en) | Sequence indexing method and system | |
US8189931B2 (en) | Method and apparatus for matching of bracketed patterns in test strings | |
Kim et al. | ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches | |
JP2024512651A (en) | Nucleotides for Sequencing - Machine Learning Model for Detecting Bubbles in Sample Slides | |
JP2007207113A (en) | Genealogical tree display system | |
Jones et al. | afterParty: turning raw transcriptomes into permanent resources | |
US20190050531A1 (en) | Dna sequence processing method and device | |
EP3693970A1 (en) | Biological sequence information handling | |
Medina‐Aunon et al. | Protein Information and Knowledge Extractor: Discovering biological information from proteomics data | |
Lichtenberg et al. | Prot-Class: a bioinformatics tool for protein classification based on amino acid signatures | |
US20160070856A1 (en) | Variant-calling on data from amplicon-based sequencing methods | |
Xiong et al. | InsertionMapper: a pipeline tool for the identification of targeted sequences from multidimensional high throughput sequencing data | |
DeGrasse et al. | A functional proteomic study of the Trypanosoma brucei nuclear pore complex: an informatic strategy | |
Maciel et al. | Step-by-Step Bioinformatics Analysis of Schistosoma Mansoni Long non-Coding RNA Sequences | |
Gökkaya | Distributed stream-processing framework for graph-based sequence alignment | |
Minkley et al. | Suffix tree searcher: exploration of common substrings in large DNA sequence sets | |
CN113611365B (en) | Coronavirus information data processing method and device, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |