CN116013420A

CN116013420A - Virulence factor database construction method, device, equipment and medium

Info

Publication number: CN116013420A
Application number: CN202310105444.0A
Authority: CN
Inventors: 张智; 周晴; 霍彩琴
Original assignee: CapitalBio Corp
Current assignee: CapitalBio Corp
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2023-04-25

Abstract

A virulence factor database construction method, device, equipment and medium relate to the technical field of trust analysis. The method comprises the following steps: acquiring an initial sequence; adding species annotation information to the initial sequence; determining the gene name of the initial sequence; adding virulence gene type annotation information to the initial sequence; a virulence factor database is constructed based on the initial sequence, the species annotation information, the gene name, and the virulence gene type annotation information. Therefore, a microbial virulence gene database with comprehensive sequence, accurate and complete information, standardized annotation program and term standard can be constructed. Furthermore, the virulence factor database can also provide basic data for clinical pathogen virulence detection and virulence gene research.

Description

Virulence factor database construction method, device, equipment and medium

Technical Field

The application relates to the technical field of trust analysis, in particular to a virulence factor database construction method, a virulence factor database construction device, virulence factor database construction equipment and virulence factor database construction media.

Background

Virulence factors are a class of important gene products that can help pathogens escape host defense mechanisms, allowing the host to infect diseases. Pathogenic bacteria rely on the interaction of a series of virulence factors to infect the host and propagate and spread in the host environment. Therefore, the virulence factors of pathogenic bacteria are researched, and an accurate, comprehensive and continuously updated virulence factor database is constructed, so that the method has very positive and important effects on aspects of drug development, clinical medication and the like.

There are currently databases such as the VFDB database, the Victors database, the DFVF database, and the like. However, the above databases have the problems of incomplete sequence recording and incomplete and accurate annotation information.

Disclosure of Invention

The application provides a virulence factor database construction method, device, equipment and medium, which can construct a virulence factor database with comprehensive sequences and comprehensive and accurate annotation information.

The application discloses the following technical scheme:

in a first aspect, the present application discloses a method for constructing a virulence factor database, the method comprising:

acquiring an initial sequence;

adding species annotation information to the initial sequence;

determining the gene name of the initial sequence;

adding virulence gene type annotation information to the initial sequence;

a virulence factor database is constructed based on the initial sequence, the species annotation information, the gene name, the virulence gene type annotation information.

Preferably, the initial sequence comprises an initial nucleic acid sequence and an initial protein sequence; the acquiring the initial sequence includes:

acquiring a virulence gene sequence, wherein the virulence gene sequence comprises a virulence gene nucleic acid sequence and a virulence gene protein sequence;

preserving one of a plurality of virulence gene sequences with the comparison consistency higher than a standard value in the virulence gene nucleic acid sequences so as to obtain an initial nucleic acid sequence;

and obtaining an initial protein sequence corresponding to the initial nucleic acid sequence within the range of the virulence gene sequence.

Preferably, the method further comprises:

deleting the virulence gene protein sequence with the comparison consistency with the initial protein sequence higher than a standard value within the range of the virulence gene protein sequence;

and adding the virulence gene protein sequence with the comparison consistency higher than the standard value with the initial protein sequence into the initial protein sequence.

Preferably, the adding species annotation information to the initial sequence includes:

acquiring a first comparison result of the initial sequence and a genome database;

a first comparison result that the comparison consistency rate is larger than a first preset threshold value in the first comparison result is reserved;

setting the initial sequence corresponding to the reserved first comparison result as a first filtering sequence;

species annotation information is added to the first filter sequence.

Preferably, the obtaining the first alignment of the initial sequence and the genome database includes:

comparing the initial sequence with a genome database to output an alignment intermediate value;

and acquiring the comparison intermediate value smaller than the comparison expected value as a first comparison result.

Preferably, the determining the gene name of the initial sequence includes:

obtaining a second comparison result of the initial sequence and a gene set database;

a second comparison result that the comparison consistency rate is larger than a second preset threshold value in the second comparison result is reserved;

setting the initial sequence corresponding to the reserved second comparison result as a second filtering sequence;

determining the gene name of the second filter sequence.

Preferably, the initial sequence is from one or more of a VFDB database, a Victors database, a DFVF database, a PATRIC database.

In a second aspect, the present application discloses a virulence factor database construction device, the device comprising: the system comprises an acquisition module, a first information module, a second information module, a third information module and a construction module;

the acquisition module is used for acquiring an initial sequence;

the first information module is used for adding species annotation information to the initial sequence;

the second information module is used for determining the gene name of the initial sequence;

the third information module is used for adding virulence gene type annotation information to the initial sequence;

the construction module is used for constructing a virulence factor database based on the initial sequence, the species annotation information, the gene name and the virulence gene type annotation information.

In a third aspect, the present application discloses a virulence factor database construction apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the method according to the first aspect.

In a fourth aspect, the present application discloses a computer storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to the first aspect.

Compared with the prior art, the application has the following beneficial effects:

the application provides a virulence factor database construction method, device, equipment and medium, wherein after an initial sequence is acquired, species annotation information and virulence gene type annotation information are added to the initial sequence, and the gene name of the initial sequence is determined, so that the virulence factor database is constructed based on the initial sequence, the species annotation information, the gene name and the virulence gene type annotation information. Therefore, a microbial virulence gene database with complete sequence, accurate and complete information, standardized annotation programs and terms can be constructed. Furthermore, basic data can be provided for clinical pathogen virulence detection and virulence gene research.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flowchart of a method for constructing a virulence factor database according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for adding species annotation information according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for determining a name of a gene according to an embodiment of the present application;

fig. 4 is a schematic diagram of a virulence factor database construction device according to an embodiment of the present application.

Detailed Description

Technical terms related to the present application are described first.

There are currently databases such as the VFDB database, the Victors database, the DFVF database, and the like. Among them, the VFDB database is used to manage information about virulence factors of bacterial pathogens, which is updated weekly, and by 2022, 12 and 27 days, which includes 32439 virulence gene sequences from 32 species, divided into 14 major classes. The Victors database has recorded 5296 virulence gene sequences including 4648 bacterial gene sequences, 179 viral gene sequences, 105 parasite gene sequences and 364 fungal gene sequences, corresponding to 51 bacterial species, 54 viruses, 13 parasites and 8 fungal species.

However, the databases described above all suffer from the following drawbacks: first, the sequence listing is not complete. The sequence coverage of existing databases is less than the range of all known virulence sequences, e.g., VFDB databases only harbor virulence factor genes for bacterial species, DFVF databases only harbor virulence factor genes for fungal species, etc., and there may be redundancy if sequences in the individual databases are directly integrated. Second, annotation information is not sufficiently comprehensive and accurate. The virulence gene species information of the existing database is only the corresponding species information when the virulence gene is found, and not all species information with the virulence gene. Third, annotated virulence genotype entries are not uniform resulting in clutter.

In view of the above drawbacks, the present application provides a method, an apparatus, a device, and a medium for constructing a virulence factor database, wherein after an initial sequence is obtained, species annotation information and virulence gene type annotation information are added to the initial sequence, and a gene name of the initial sequence is determined, so that the virulence factor database is constructed based on the initial sequence, the species annotation information, the gene name and the virulence gene type annotation information. Therefore, a microbial virulence gene database with complete sequence, accurate and complete information, standardized annotation programs and terms can be constructed. Furthermore, basic data can be provided for clinical pathogen virulence detection and virulence gene research.

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, a flowchart of a virulence factor database construction method is provided in an embodiment of the present application. The method comprises the following steps:

s101: and acquiring a virulence gene sequence set.

The set of virulence gene sequences includes a set of virulence gene nucleic acid sequences and a set of virulence gene protein sequences. Wherein, a plurality of virulence gene nucleic acid sequences together form a virulence gene nucleic acid sequence set, and the virulence gene nucleic acid sequence is a base composition sequence of the virulence gene, namely a corresponding DNA sequence, such as each letter in the virulence gene nucleic acid sequence ATCGGGATTC … … represents one base. A plurality of virulence gene protein sequences together form a virulence gene protein sequence set. The virulence gene protein sequence is the amino acid composition sequence of the virulence gene, i.e., each letter in the corresponding protein sequence, e.g., MNRREFLA … … virulence gene protein sequence, represents an amino acid.

In some specific implementations, the set of virulence gene sequences may be obtained automatically by one or more of the VFDB database, victors database, DFVF database, PATRIC database (collectively referred to as the source database), or may be manually entered by one of skill in the art. And when a new database appears, a new virulence gene nucleic acid sequence and a new virulence gene protein sequence can be obtained from the new database, so that a new virulence gene sequence set is constructed.

S102: redundancy in the set of virulence gene nucleic acid sequences is removed to obtain an initial set of nucleic acid sequences.

And (3) comparing the virulence gene nucleic acid sequences in the range of the virulence gene nucleic acid sequence set obtained in the step (S101) to remove redundancy, namely deleting repeated virulence gene nucleic acid sequences, wherein the virulence gene nucleic acid sequence set after removing redundancy is the initial nucleic acid sequence set.

In the virulence factor database construction method disclosed in the application, the software used for removing redundancy is CD-hit, and the main purpose is to bring together completely identical virulence gene nucleic acid sequences. The parameters used to remove redundancy in CD-hit are: cd-hit-est-G1-aL1-aS1-c1-r1, wherein, -G1 means global alignment, -aL1 means alignment coverage of long sequences is 100%, -aS1 means alignment coverage of short sequences is 100%, -c1 means alignment consistency is 100%, i.e. identical, -r1 means both forward and reverse alignment at clustering.

It should be noted that the above redundancy removal parameter is to remove redundancy for completely identical sequences, i.e., the redundancy removal operation is performed only when the comparison of the nucleic acid sequences of the plurality of virulence genes is 100% identical to obtain the initial nucleic acid sequence set. In practical applications, when the comparison consistency of the two sequences reaches other standard values, the redundancy removing operation may be performed to obtain the initial nucleic acid sequence set, that is, the skilled person may empirically set other standard values, for example, 95%, 90%, etc., and the application is not limited to specific standard values. Also, redundancy may be removed by clustering or the like, and is not limited to the application of CD-hit software. The present application is not limited to a specific redundancy removal method.

S103: an initial set of protein sequences corresponding to the initial set of nucleic acid sequences is obtained.

After the initial Set of nucleic acid sequences is obtained in step S102, an initial Protein sequence corresponding to each initial nucleic acid sequence in the initial Set of nucleic acid sequences is obtained as an initial Set of Protein sequences, and the initial Set of Protein sequences is labeled as protein_set_a.

In a specific implementation, if the source databases such as the VFDB database, the Victors database, the DFVF database, the PATRIC database, etc. only provide the virulence gene protein sequence set, but not the virulence gene nucleic acid sequence set, i.e., the virulence gene nucleic acid sequence set is not obtained in step S101, the complete initial protein sequence set cannot be obtained. At this time, if the initial Protein sequence Set protein_set_a exists, the virulence gene Protein sequence Set in the source database and the initial Protein sequence Set protein_set_a may be aligned to perform redundancy elimination processing, and the virulence gene Protein sequence Set with redundancy elimination may be added to the initial Protein sequence Set protein_set_a. If the original Protein sequence Set protein_set_a does not exist, in the range of the virulence gene Protein sequence Set obtained in S101, comparing the virulence gene Protein sequences therein to perform redundancy elimination treatment, wherein the virulence gene Protein sequence Set after redundancy elimination is the original Protein sequence Set protein_set_a.

In some specific implementations, after the initial set of nucleic acid sequences and the initial set of protein sequences (hereinafter collectively referred to as the initial set of sequences) are obtained, each (or each pair of) sequences therein may be assigned a login ID number. For example, the login ID number may be cvfdb_s000001, from which it can be known that the initial nucleic acid sequence and the corresponding initial protein sequence under the login ID number are both from the VFDB database, and in the database disclosed in the present invention, the initial nucleic acid sequence ID is cvfdb_sn00001, and the initial protein sequence ID is cvfdb_sp00001.

S104: species annotation information is added to the initial sequence set.

In the virulence factor database construction method disclosed in the present application, species annotation information may be directly added to the initial sequence set, or the initial sequence and the filtered sequence set may be filtered first, and then the species annotation information may be added to the filtered sequence set. Referring to fig. 2, a flowchart of a method for adding species annotation information is provided in an embodiment of the present application. The method comprises the following steps:

s1041: the initial set of sequences is aligned to a genomic database to obtain a first alignment.

The initial sequence set was aligned with data in the NCBI microbial genome database (including bacteria, fungi, parasites and viruses) to obtain a first alignment. It should be noted that, not only the initial sequence set may be compared with the data in the NCBI microbial genome database, but also the initial sequence set may be compared with a database such as Ensemble, which can provide information about the reference genome and species, and the specific comparison object is not limited in this application.

In some specific implementations, the software used for the alignment may be blastn, i.e., the initial set of sequences is aligned with the data in the NCBI microbial genome database in blastn and the first alignment is output.

Illustratively, the parameters used for the comparison in blastn may be: blastn-taskmegablast-e1e-10. The task is to select a "megablast" program in blastn for operation, and the e1e-10 is to compare the expected value, wherein the expected value is a value for measuring the reliability of the comparison result, and the smaller the expected value is, the more reliable the comparison result is, so that the comparison result smaller than or equal to the expected value can be output, and the comparison result larger than the expected value can be filtered. Thus, the initial nucleic acid sequences and initial protein sequences of the NCBI microbial genome database can be filtered out of alignment.

S1042: and filtering the first comparison result again, and reserving the first comparison result with the comparison consistency rate larger than a first preset threshold value.

In the first comparison result of the output, the full-length comparison result with identity of >95% can be reserved. Where identity refers to the alignment uniformity ratio, such as identity=95% means that 95% of the alignment region is completely uniform. It should be noted that, the first preset threshold may be 95% as described above, or may be set by a person skilled in the art according to experience, for example, 90%, 85%, etc., and the present application is not limited to the specific first preset threshold.

It should be noted that, if there are multiple comparison results in one initial nucleic acid sequence or one initial protein sequence, the comparison result with the highest identity in the comparison results may be retained.

Thus, the initial nucleic acid sequences and initial protein sequences that are aligned to the NCBI database, but not the full length alignment, or that have an alignment identity less than a first predetermined threshold, can also be filtered out.

S1043: and setting an initial sequence corresponding to the filtered first comparison result as a first filtering sequence set, and adding species annotation information to the sequences in the first filtering sequence set.

And setting the initial sequence corresponding to the filtered first comparison result as a first filtering sequence set, respectively acquiring each nucleic acid sequence in the first filtering sequence set and the number of genomes and species on each protein sequence based on the data of the NCBI microbial genome database, and extracting specific genome names and species names to carry out species information annotation.

Illustratively, the CVFDB_SN00001 sequence and NCBI microbial genome database can be subjected to megablast alignment, the alignment is filtered, and after the full-length alignment of 100% identity is reserved, 90 genomes such as GCA_000341385.1, GCA_000302135.1, GCA_021826825.1 and the like are subjected to total alignment, and the genome is annotated to 2 species of Acinetobacter_baumannii and Proteobacteria_bacterium.

S105: the gene name of the initial sequence is determined.

In the virulence factor database construction method disclosed in the present application, the gene name of the initial sequence set may be directly determined, the initial sequence set may be filtered first, then the gene name of the filtered sequence set may be determined, the gene name of the first filtered sequence set in step S1043 may be directly determined, or the first filtered sequence set may be filtered first, and then the gene name of the filtered sequence set may be determined.

Referring to fig. 3, a flowchart of a method for determining a name of a gene according to an embodiment of the present application is shown. The method comprises the following steps:

s1051: the initial sequence is aligned to the gene set database to obtain a second alignment.

In some specific implementations, the initial sequence may be aligned to sequences in the genome database using blat software to obtain a second alignment. Illustratively, the parameters used for the comparison in the blat may be: -out=blast 8.

S1052: and filtering the second comparison result again, and reserving the second comparison result that the comparison consistency rate is larger than a second preset threshold value.

Second, the second alignment was filtered again, leaving the full length alignment of e_value <1e-20 with identity=100%. It should be noted that, a comparison result that the comparison consistency ratio is greater than the second preset threshold may be retained, and the second preset threshold may be set by a person skilled in the art according to experience, for example, 100%, 99%, etc., and the application is not limited to a specific second preset threshold.

S1053: and setting the initial sequence corresponding to the filtered second comparison result as a second filtering sequence set, and determining the gene name of the sequence in the second filtering sequence set.

Then, in the second alignment result which is kept, annotation information corresponding to the target sequence on the alignment in the GenBank database and the RefSeq database is searched, and then the virulence gene name in the annotation information is extracted as the gene name of the virulence sequence.

If one sequence is aligned to a plurality of target sequences, the target sequence annotation information from the RefSeq database is preferentially retrieved. The reason for this is that the RefSeq database has been manually checked, and the annotation information has the highest reliability.

After the annotation of the gene name is completed according to the steps, consistency comparison can be carried out with the annotation result of the similar database. For inconsistent items, a secondary manual check may be performed to further facilitate annotation accuracy of the present invention.

Illustratively, a blank alignment can be performed between the CVFDB_SP00001 sequence and the NCBIRefSeq protein sequence library, with the CVFDB_SP00001 sequence aligned at 100% identity to the refSeqAC sequence number WP_ 001081735.1. The full name of the gene is extracted according to the definition "phospholipaseC, phosphocholine-specific [ Acinetobacter baumannii ]" of WP_001081735.1 in NCBI database: the gene name of the cvfdb_s000001 entry is noted as phospholipaseC.

S106: virulence genotype annotation was performed on the initial sequence.

In the virulence factor database construction method disclosed in the application, the description of the gene can be directly annotated by combining the initial sequence set with the NCBI database, or the description of the gene can be directly annotated by combining the second filtering sequence set in the step S1053 with the NCBI microbial genome database.

It will be appreciated that in addition to the virulence gene type annotation of the description of the gene in connection with the NCBI microbial genome database, the virulence gene type annotation can also be made in connection with databases providing the name of the gene, such as the NT library (Nucleotedsequence database), the RefSeq protein library, genBank, swissProt, etc. The application is not limited to a particular database.

Referring to table 1, a virulence gene type glossary is provided in the examples of the present application. Some commonly used virulence gene type designations are presented in this table.

TABLE 1 virulence Gene type glossary

It should be noted that after the virulence gene type annotation is performed on the initial sequence according to the steps, consistency comparison can be performed with similar database annotation results, and secondary manual verification is performed on inconsistent items so as to further promote annotation accuracy of the invention.

Illustratively, the protein sequence of the cvfdb_s000001 entry may correspond to wp_001081735.1 in the RefSeq database, the gene product is phospholipase C (phospholipaseC), and by combining with the annotation information of NCBI and document retrieval, it is known that phospholipase C is a lipase, which can catalyze the decomposition of phospholipids in the host cell membrane, and promote bacteria to invade the host cell to exert virulence, so that the virulence gene type of the entry is exoxin (Exotoxin).

S107: a virulence factor database is constructed based on the species annotation information, the gene name, and the virulence gene type annotation information.

And designing and realizing a MySQL relational database by taking the virulence sequence as an entity and taking the gene name, the species annotation information and the virulence gene type annotation information as attributes, so as to construct a virulence factor database.

Illustratively, various information of the CVFDB_S000001 entry may be as shown in Table 2. Table 2 is an example table of annotation information provided in an embodiment of the present application.

Table 2 annotation information example

/>

The application provides a virulence factor database construction method, which comprises the steps of adding species annotation information and virulence gene type annotation information to an initial sequence after the initial sequence is acquired, and determining the gene name of the initial sequence, so that the virulence factor database is constructed based on the initial sequence, the species annotation information, the gene name and the virulence gene type annotation information. Therefore, a microbial virulence gene database with complete sequence, accurate and complete information, standardized annotation programs and terms can be constructed. Furthermore, basic data can be provided for clinical pathogen virulence detection and virulence gene research.

It should be noted that although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

Referring to fig. 4, a schematic diagram of a virulence factor database construction device according to an embodiment of the present application is shown. The virulence factor database construction apparatus 400 includes: an acquisition module 401, a first information module 402, a second information module 403, a third information module 404, and a construction module 405;

an acquisition module 401, configured to acquire an initial sequence; a first information module 402 for adding species annotation information to the initial sequence; a second information module 403 for determining a gene name of the initial sequence; a third information module 404 for adding virulence gene type annotation information to the initial sequence; a construction module 405 for constructing a virulence factor database based on the initial sequence, the species annotation information, the gene name, and the virulence gene type annotation information.

The application provides a virulence factor database construction device, which is used for adding species annotation information and virulence gene type annotation information to an initial sequence after the initial sequence is acquired, and determining the gene name of the initial sequence, so that the virulence factor database is constructed based on the species annotation information, the gene name and the virulence gene type annotation information. Therefore, a microbial virulence gene database with complete sequence, accurate and complete information, standardized annotation programs and terms can be constructed. Furthermore, basic data can be provided for clinical pathogen virulence detection and virulence gene research.

The embodiment of the application also provides corresponding generating equipment and a computer storage medium, which are used for realizing the scheme provided by the embodiment of the application.

The device comprises a memory and a processor, wherein the memory is used for storing instructions or codes, and the processor is used for executing the instructions or codes to enable the device to execute the light control method of any embodiment of the application.

The computer storage medium has code stored therein, and when the code is executed, the apparatus for executing the code performs the method described in any of the embodiments of the present application.

The "first" and "second" in the names of "first", "second" (where present) and the like in the embodiments of the present application are used for name identification only, and do not represent the first and second in sequence.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and media embodiments, since they are substantially similar to the system, method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and media embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements illustrated as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely one specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of virulence factor database construction, the method comprising:

acquiring an initial sequence;

adding species annotation information to the initial sequence;

determining the gene name of the initial sequence;

adding virulence gene type annotation information to the initial sequence;

2. The method of claim 1, wherein the initial sequence comprises an initial nucleic acid sequence and an initial protein sequence; the acquiring the initial sequence includes:

3. The method according to claim 2, wherein the method further comprises:

4. The method of claim 1, wherein said adding species annotation information to said initial sequence comprises:

species annotation information is added to the first filter sequence.

5. The method of claim 4, wherein said obtaining a first alignment of said initial sequence with a genomic database comprises:

6. The method of claim 1, wherein said determining the gene name of the initial sequence comprises:

determining the gene name of the second filter sequence.

7. The method of claim 1, wherein the initial sequence is from one or more of a VFDB database, a Victors database, a DFVF database, a PATRIC database.

8. A virulence factor database building apparatus, the apparatus comprising: the system comprises an acquisition module, a first information module, a second information module, a third information module and a construction module;

the acquisition module is used for acquiring an initial sequence;

9. A virulence factor database construction apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor being configured to execute the program to implement the steps of the method according to any one of claims 1 to 6.

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1 to 6.