CN115910216A

CN115910216A - Method and system for identifying genome sequence classification errors based on machine learning

Info

Publication number: CN115910216A
Application number: CN202211537778.7A
Authority: CN
Inventors: 陈燕君; 王涛; 肖姗姗
Original assignee: Hangzhou Ruipu Medical Laboratory Co ltd; Hangzhou Repugene Technology Co ltd
Current assignee: Hangzhou Ruipu Medical Laboratory Co ltd; Hangzhou Repugene Technology Co ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-04-04
Anticipated expiration: 2042-12-01
Also published as: CN115910216B

Abstract

The invention discloses a method and a system for identifying genome sequence classification errors based on machine learning, and belongs to the technical field of bioinformatics. The invention also discloses a method for constructing a machine learning model for identifying the assembly genome classification errors, which comprises the following steps: obtaining assembled genomic sequences for a plurality of species having a reference genome; randomly generating reads from each assembled genome sequence, and breaking to obtain contigs sequence sets; comparing reads with each contigs sequence set to obtain a comparison parameter of each contig position, and constructing a characteristic data set; and constructing a machine learning model by using the characteristic data sets of all the assembled genome sequences and the information of whether the classification of the characteristic data sets is wrong. The method and the system can accurately judge whether the classification of the assembled genome sequence is correct or not, and can effectively reduce the detection of false positives in the actual sample detection when the residual high-quality assembled genome sequence is used as a reference database after the assembled genome with the wrong classification is deleted.

Description

Method and system for identifying genome sequence classification errors based on machine learning

Technical Field

The invention belongs to the technical field of bioinformatics, and particularly relates to a method and a system for identifying genome sequence classification errors based on machine learning.

Background

In the field of bioinformatics, after the genomic sequences are classified, it is often necessary to check to ensure that the classification is correct. The genome sequence classification examination method used by the National Center for Biotechnology Information (NCBI) is: comparing the genome sequence to be detected with the reference genome sequence of the species, calculating the base similarity of homologous fragments of the two genome sequences, and using the consistency of more than 96% and the coverage of more than 80% as thresholds to judge whether the species classification error condition exists in the genome sequence to be detected.

However, this method has the following drawbacks or disadvantages:

(1) For each species, a reference genomic sequence is required; if the genome sequence to be detected belongs to the species without the reference, the judgment cannot be carried out;

(2) The judgment threshold is a definite value, and judgment errors may exist for different types of species; for example, viruses with higher evolution speed have more variation in genome sequence, and a lower threshold value should be used for the corresponding base similarity threshold value;

(3) Strains with more variation from the reference genome may be judged as misclassified genomes, i.e., misjudgment may occur.

QUAST is a very popular software for genome splicing results assessment (Alexey Gurevich, vladislav Saveliev, nikolay Vyahhi & Glenn Tesler. QUAST: quality assessment tool for genome assessment. Bioinformatics.2013, 29. MetaQUAST is a modified version of QUAST (Mikheenko A, saveliev, gurevich A. MetaQUAST: evaluation of the genomic assemblies.Bioinformatics.2016Apr 1 (7)), is a relatively advanced tool based on the contig (contig) and reference aligned genome splicing evaluation, however, this tool must be entered into the species' reference genome at the time of evaluation.

Therefore, there is a need in the art for a precise method for identifying whether there is an error in the classification of genomic sequences.

Disclosure of Invention

In order to solve at least one of the above technical problems, the technical solution adopted by the present invention is as follows:

the invention provides in a first aspect a method of constructing a machine learning model for identifying errors in assembling a genome classification, comprising the steps of:

s1, obtaining assembled genome sequences of a plurality of species with reference genomes, wherein the assembled genome sequences comprise correctly classified assembled genome sequences and incorrectly classified assembled genome sequences;

s2, randomly generating reads with the length of K from each assembled genome sequence; breaking each assembled genome sequence according to the length of L and the step length of N to respectively obtain contigs sequence sets, wherein K = 75-500, L = 5000-10000, and N = 1-L;

s3, comparing the reads generated by simulation with each contigs sequence set to obtain the following parameters of each position of each contig: whether the position is A, T, C, G; the position detection genotype is the numbers of reads of A, T, C and G; reads coverage depth for the location; obtaining an L multiplied by 11 matrix as a characteristic value by the number of reads with inconsistent positions, wherein the characteristic values of all contigs in a contigs sequence set of an assembled genome sequence form a characteristic data set of the assembled genome sequence;

and S4, constructing a machine learning model by using the characteristic data sets of all the assembled genome sequences and the information of whether the classification is wrong.

In some embodiments of the invention, K =200, l =8000, n =7500, more closely approximates real high throughput sequencing results and contig assembly results.

In some embodiments of the invention, in step S1, the assembled genomic sequence is assembled based on high throughput sequencing data; correspondingly, in step S2, randomly generating paired-end sequencing reads for each assembled genome sequence, preferably generating paired-end sequencing reads of the HiSeq 2500 platform by art _ illumina simulation, which are reads1 and reads2 respectively, merging reads1 of all assembled genome sequences into simulated reads1, and merging reads2 of all assembled genome sequences into simulated reads2; accordingly, in step S3, the contigs sequence sets are aligned according to the double-ended reads alignment mode.

In some embodiments of the invention, in step S1, the assembled genomic sequences of all reference species are obtained. Preferably, if there are too many assembled genomic sequences of a species, for example more than 5, only 5 of them are retained to improve the efficiency of modeling.

In some embodiments of the invention, in step S1, the classification correctness or classification errors are evaluated using metaast.

In some embodiments of the invention, the machine learning model is constructed based on a neural network algorithm.

In some embodiments of the invention, the machine learning model is constructed based on a convolutional neural network algorithm.

Further, the machine learning model obtains an assembled genome score of the assembled genome sequence to be identified according to the input result, if the assembled genome score is lower than a preset threshold, the assembled genome sequence to be identified is wrongly classified, and if the assembled genome score is not lower than the preset threshold, the assembled genome sequence to be identified is correctly classified. In some embodiments of the invention, the preset threshold is a knee value of a curve based on scores of assembled genomic sequences of a plurality of the same species. In some embodiments of the invention, the inflection point is the point at which the slope of the curve begins to decrease, the assembled genome score for the inflection point is near the peak, and the curve growth slows down from the inflection point. In some embodiments of the present invention, the preset threshold is a representative value of inflection points of a plurality of congeneric species, and the representative value is a statistically significant value such as a mean value, a median value, a mode value, and the like.

The second aspect of the present invention provides a method for identifying species assembly genome sequence classification errors, which comprises the steps of S2-S3 of the first aspect of the present invention, constructing a feature data set of each assembly genome sequence to be identified, and inputting the feature data set into a machine learning model constructed by the first aspect of the present invention, thereby determining whether the classification of each assembly genome sequence to be identified is erroneous.

The method for identifying the misclassification of the species assembled genome sequence can judge the assembled genome sequence with misclassification aiming at the species with a plurality of assembled genome sequences. The correct taxonomic assembled genomic sequence is used as the genomic library for this species and is more accurate than when performing peer-to-peer analysis.

In some embodiments of the invention, all assembled genomic sequences from a species are obtained and the corresponding assembled genomic scores are obtained using the machine learning model, respectively. The assembled genomic sequences with assembled genomic scores below a predetermined threshold are removed, and the remaining assembled genomic sequences can be used to construct a reference gene database for the species.

In some embodiments of the invention, all assembled genomic sequences from a species are obtained and the corresponding assembled genomic scores are obtained using the machine learning model, respectively. And drawing a curve by using the scores of the assembled genomes, wherein the score of the assembled genome corresponding to the inflection point of the curve is the inflection point value, and the assembled genome with the score lower than the inflection point value is judged to be classified wrongly and cannot be used for constructing a database.

The third aspect of the invention provides a system for identifying species assembly genome sequence classification errors, which comprises the following modules:

the data input module is used for obtaining a characteristic data set of each assembled genome sequence to be identified;

the data storage module is used for storing characteristic data sets of the assembled genome sequences of a plurality of reference species and information whether the classification is wrong;

the error identification module is respectively connected with the data input module and the data storage module and is used for constructing a machine learning model according to the characteristic data sets of the assembled genome sequences and the information whether the classification is wrong or not, inputting the characteristic data sets of the assembled genome sequences to be identified into the model and judging whether the classification of the assembled genome sequences to be identified is wrong or not;

the result output module is connected with the error identification module and used for outputting an identification result;

wherein the feature data set is constructed according to steps S2-S3 of the first aspect of the invention.

In some embodiments of the invention, the data storage module corresponds, for each reference species, a plurality of feature data sets of the assembled genomic sequence and information on whether the classification is erroneous.

In some embodiments of the invention, the result output module is further connected to the data storage module for inputting the feature data set of the assembled genomic sequence to be identified and the identification result to the data storage module. Therefore, the data size for constructing the model can be continuously increased, and the accuracy of model identification is further improved.

A fourth aspect of the present invention provides a computer device comprising: a memory for storing a computer program; a processor for implementing the steps of the method according to any one of the first aspect of the invention when executing the computer program.

A fourth aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any one of the first aspects of the invention.

In a fifth aspect, the invention provides a method for constructing a species genomic database, comprising obtaining a reference genomic sequence and all assembled genomic sequences of the species, and determining and deleting the assembled genomic sequences with misclassification by using the method of the second aspect or the system of the third aspect. The retained assembled genomic sequences are used to construct the species genomic database. The core basis of the pathogen metagenome technology is a genome database of pathogenic microorganisms, and whether a genome assembly sequence obtained from public data has classification errors or not directly influences the analysis result of downstream pathogen identification. The database obtained by the method for constructing the species genome database can improve the detection accuracy.

The invention has the advantages of

Compared with the prior art, the invention has the following beneficial effects:

by utilizing the method and the system, whether the classification of the assembled genome sequence is correct can be accurately judged. Using the method and system of the present invention for identification, taking the assembled genome of Mycobacterium tuberculosis complex as an example, it was found that about 7% of the genome sequence does not belong to Mycobacterium tuberculosis, and there may be sequence contamination or splicing assembly errors.

After the method and the system are used for deleting the assembled genome with wrong classification, when the residual assembled genome sequence with correct classification is used as a reference database, the false positive detection can be effectively reduced in the actual sample detection. Taking 20 low-abundance tuberculosis samples used in the invention as an example, 80% of the samples in which mycobacterium tuberculosis is detected are found to be false positive.

Drawings

FIG. 1 shows the assembled genome scores of 608 M.tuberculosis complex assembled genomes of the present invention obtained by using a machine learning model.

Detailed Description

Unless otherwise indicated, implied from the context, or customary in the art, all parts and percentages herein are by weight and the testing and characterization methods used are synchronized with the filing date of the present application. Where applicable, the contents of any patent, patent application, or publication referred to in this application are hereby incorporated by reference in their entirety, and the equivalent family of patents is also incorporated by reference, especially with respect to the definitions of relevant terms in the art, as disclosed in these documents. To the extent that a definition of a particular term disclosed in the prior art is inconsistent with any definitions provided herein, the definition of the term provided herein controls.

The numerical ranges in this application are approximations, and thus may include values outside of the ranges unless otherwise specified. A numerical range includes all numbers from the lower value to the upper value, in increments of 1 unit, provided that there is a separation of at least 2 units between any lower value and any higher value. For ranges containing a numerical value less than 1 or containing a fraction greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is considered to be 0.0001,0.001,0.01, or 0.1, as appropriate. For ranges containing single digit numbers less than 10 (e.g., 1 to 5), 1 unit is typically considered 0.1. These are merely specific examples of what is intended to be expressed and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.

In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments.

Examples

The following examples are used herein to demonstrate preferred embodiments of the invention. It will be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the disclosures and references cited herein and the materials to which they refer are incorporated by reference.

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.

The experimental procedures in the following examples are conventional unless otherwise specified. The instruments used in the following examples, unless otherwise specified, were all conventional laboratory instruments; the test materials used in the following examples were purchased from a conventional biochemical reagent store unless otherwise specified.

Example 1 construction of a machine learning model to identify misclassified genomic sequences

1. Training set construction

(1) The Genome sequences of the reference species were downloaded from the NCBI database, with up to 5 assembly sequences (Genome at assembly level "Complete Genome" or "Chromosome") being downloaded per species, giving 4258 Genome sequences for 1000 species. Each assembled genomic sequence classification was evaluated using MetaQUAST for accuracy, and assigned a value of 1 if accurate and 0 if not. Among them, 4087 genomes were classified accurately, and 171 genomes were classified incorrectly.

(2) Aiming at each genome sequence, generating double-ended sequencing reads of the HiSeq 2500 platform by using art _ illumina simulation respectively, wherein the reads are respectively called reads1 and reads2, and the length of the reads is set to be 100bp; all reads1 are merged into simulated reads1 and all reads2 are merged into simulated reads2.

(3) Breaking each genome sequence according to the length of 8000bp and the step length of 7500bp to respectively obtain a contigs sequence set;

(4) And (3) comparing the simulated reads1 and the simulated reads2 generated in the step (2) with each contigs sequence set according to a double-end reads comparison mode by using bowtie2 to obtain a comparison result of each position on each contig. Generating sorted bam files and constructing an index;

(5) For each contig sequence, a 8000 × 11 matrix was constructed as a feature value, and encoded using onehot encoder to represent the alignment result at each position on contig. The length 11 vectors are: whether the position on the reference genome is A, T, C and G (if so, the value is 1, and if not, the value is 0); the position detection genotype is the numbers of reads of A, T, C and G; reads coverage depth for the location; the number of inconsistent reads (discordant reads) in the location; the number of SNPs detected at that position. As shown in table 1:

TABLE 1 eigenvalue matrix

2. Model training

And calling a machine learning toolkit (tensorflow) of python language by using the constructed training data set to construct a convolutional neural network model.

(1) Calling keras to define a Sequential model, and adding a training data set to an input layer;

(2) The convolutional layer and BN layer (axis = -1) were added in this order, using the parameters as follows:

convolutional layer	Output size
		Conv 2×11×8，stride 1+ReLU+BN	9999×1×8
Conv 2×1×16，stride 2+ReLU+BN	4999×1×16
		Conv 2×1×32，stride 2+ReLU+BN	2499×1×32
Conv 2×1×64，stride 2+ReLU+BN	1249×1×64
		Conv 2×1×64，stride 2+ReLU+BN	624×1×128

(3) Adding a pooling layer, wherein the size of the pooling layer is (50,1);

(4) The Flatten layer is added, using default parameters.

3. Model testing

And selecting a reference species which is not included in the training data set, randomly downloading 1000 genome sequence files of the species from NCBI, and extracting the characteristic value matrix according to a method for constructing the training set. Data is input into the constructed model to calculate score. While using MetaQUAST evaluation, 3.8% of misclassified genomes were found to be present. So score's TOP 96.2% is taken as the classification correct set and the rest as the classification error set. The final model recall rate reaches 89.47%.

Example 2 construction of highly accurate genomic database of Mycobacterium tuberculosis Complex

And (2) evaluating whether the classification of the mycobacterium tuberculosis genome has errors by using the model constructed in the embodiment 1, extracting a characteristic value data set of the mycobacterium tuberculosis assembled genome according to the method of constructing the training set in the embodiment 1, inputting the characteristic value data set into the model constructed in the embodiment 1 for operation to obtain a scoring value, and judging whether the classification of the assembled genome to be evaluated is correct or not according to the scoring value.

1. Genome assembly sequence download

Inputting a taxi 77643 corresponding to the Mycobacterium tuberculosis complex into an NCBI taxonomy database for searching, downloading all genomes with an assembly level of 'Complete Genome' or 'Chromosome', and obtaining sequences of assembled genomes of 608 strains of Mycobacterium tuberculosis after screening.

2. Analog reads generation

Paired-end sequencing reads for the HiSeq 2500 platform, reads1 and reads2, respectively, were generated using art _ illumina simulation for 608 assembled genomic sequences, respectively. The length of reads is set as 100bp, the length of an insert is set as 270bp, and the length deviation is 50bp; and then all reads1 are combined into simulation reads1, and all reads2 are combined into simulation reads2.

3. Genome disruption

Respectively breaking 608 genome sequences according to the length of 8000bp and the step length of 7500bp to obtain 608 contigs sequence sets (each genome sequence corresponds to one contigs sequence set); on average, each contigs sequence set contains 588 contigs.

Reads alignment

And (3) comparing the simulated reads1 and the simulated reads2 generated in the step (2) with 608 contigs sequence sets respectively according to a double-end reads comparison mode by using bowtie2 to obtain a comparison result of each position, generating a sorted bam file, and constructing an index.

5. Feature dataset extraction

And calculating characteristic information corresponding to each position in each contig sequence according to the bam file aiming at each genome sequence. For a particular location, including whether the location on the reference genome is a, T, C, G; the position detection genotype is the numbers of reads of A, T, C and G; reads coverage depth for the location; the number of inconsistent reads (discordant reads) in the location; the number of SNPs detected at that position results in a feature data set.

6. Model prediction and misclassification recognition

Inputting 608 feature data sets into the model obtained by training in example 1, and calculating a scoring value, namely an assembly score (contig score), for each contig sequence; for each assembled genome, the contig score of the upper quartile of all contigs was selected as the assembled genome score.

The assembled genomes at the inflection point (from which the slope of the curve starts to decrease and the assembled scores at the point approach the peak) are ranked from small to large as a threshold, and all assembled genomes with an assembled genome score below the threshold are determined to be misclassified genomes. 561 high-accuracy genome sequences of the Mycobacterium tuberculosis complex are obtained finally, and are shown in figure 1.

Example 3 Mycobacterium tuberculosis Complex identification

Samples of 20 samples of the specific alignment reads of the low-abundance detected mycobacterium tuberculosis complex are respectively aligned to a reference genome database (database 1) constructed by all 608 genome sequences in example 2 and a reference genome database (database 2) constructed by 561 genome sequences obtained after removing misclassified genome sequences, and the number of the specifically detected reads is counted. The results of the PCR method (F: AAACACAAGGAGCGACAAC; R: CATACCAGGACGCCTTGC) were used for comparison, and are shown in Table 2:

TABLE 2 detection of specific reads number of Mycobacterium tuberculosis Complex Using different genomic databases

According to the statistical results in table 2, after the genome sequence of the mycobacterium tuberculosis complex which is judged as the classification error by using the machine learning model of the invention is removed, the false positive rate of the detection is greatly reduced and the accuracy of the detection is improved under the condition that the true positive rate is not changed.

All documents referred to herein are incorporated by reference into this application as if each were individually incorporated by reference. Furthermore, it should be understood that various changes or modifications of the present invention can be made by those skilled in the art after reading the above teachings of the present invention, and these equivalents also fall within the scope of the appended claims of the present application.

Claims

1. A method of constructing a machine learning model for identifying errors in assembling genome classification, comprising the steps of:

s2, randomly generating reads with the length of K from each assembled genome sequence; breaking each assembled genome sequence according to the length of L and the step length of N to respectively obtain contigs sequence sets, wherein K = 75-500, L = 1000-10000, and N = 1-L;

s3, comparing the reads generated by simulation with each contigs sequence set to obtain the following parameters of each position of each contig: whether the position is A, T, C, G; the position detection genotype is the numbers of reads of A, T, C and G; the reads coverage depth of the location; obtaining an L multiplied by 11 matrix as a characteristic value by the number of reads with inconsistent positions, wherein the characteristic values of all contigs in a contigs sequence set of an assembled genome sequence form a characteristic data set of the assembled genome sequence;

2. The method of claim 1, wherein in step S1, the assembled genomic sequence is assembled based on high throughput sequencing data; in the step S2, randomly generating double-end sequencing reads, namely reads1 and reads2, for each assembled genome sequence, combining the reads1 of all the assembled genome sequences into simulated reads1, and combining the reads2 of all the assembled genome sequences into simulated reads2; in step S3, the two-end reads are compared with each contigs sequence set according to the two-end reads comparison mode.

3. Method according to claim 1, characterized in that in step S1 the classification correctness or classification errors are evaluated using metaast.

4. The method of claim 1, wherein the machine learning model is constructed based on a neural network algorithm.

5. The method of claim 4, wherein the machine learning model is constructed based on a convolutional neural network algorithm.

6. A method for identifying species assembly genome sequence classification errors is characterized in that a characteristic data set of each assembly genome sequence to be identified is constructed by steps S2-S3 of claim 1 and is input into a machine learning model constructed by claim 1, so that whether the classification of each assembly genome sequence to be identified is wrong or not is judged.

7. A system for identifying species assembly genomic sequence classification errors, comprising the following modules:

the error identification module is respectively connected with the data input module and the data storage module and is used for constructing a machine learning model according to the characteristic data sets of the assembled genome sequences and information about whether the classification is wrong, inputting the characteristic data sets of the assembled genome sequences to be identified into the model and judging whether the classification of the assembled genome sequences to be identified is wrong;

wherein the feature data set is constructed according to steps S2 to S3 of claim 1.

8. The system of claim 7, wherein the result output module is further connected to the data storage module for inputting the feature data set and the recognition result of each assembled genomic sequence to be recognized into the data storage module.

9. A computer device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method according to any one of claims 1 to 5 when executing said computer program.

10. A computer-readable storage medium, characterized in that,

the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.