CN115910216B

CN115910216B - Method and system for identifying genome sequence classification errors based on machine learning

Info

Publication number: CN115910216B
Application number: CN202211537778.7A
Authority: CN
Inventors: 陈燕君; 王涛; 肖姗姗
Original assignee: Hangzhou Ruipu Medical Laboratory Co ltd; Hangzhou Repugene Technology Co ltd
Current assignee: Hangzhou Ruipu Medical Laboratory Co ltd; Hangzhou Repugene Technology Co ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-07-25
Anticipated expiration: 2042-12-01
Also published as: CN115910216A

Abstract

The invention discloses a method and a system for identifying a genome sequence classification error based on machine learning, and belongs to the technical field of bioinformatics. The invention also discloses a method for constructing a machine learning model for identifying the classification errors of the assembled genome, which comprises the following steps: obtaining assembled genomic sequences of a plurality of species having a reference genome; randomly generating reads from each assembled genome sequence, and breaking to obtain a contigs sequence set; comparing reads with each contigs sequence set to obtain comparison parameters of each contig at each position, and constructing a characteristic data set; and constructing a machine learning model by utilizing the characteristic data sets of all the assembled genome sequences and the information of whether the classification is wrong or not. By using the method and the system provided by the invention, whether the classification of the assembled genome sequence is correct or not can be accurately judged, and when the residual high-quality assembled genome sequence is used as a reference database after the assembled genome with the wrong classification is deleted, false positives can be effectively reduced in the actual sample detection.

Description

Method and system for identifying genome sequence classification errors based on machine learning

Technical Field

The invention belongs to the technical field of bioinformatics, and particularly relates to a method and a system for identifying a genome sequence classification error based on machine learning.

Background

In the field of bioinformatics, after classification of genomic sequences, it is often necessary to check to ensure that the classification is correct. The genomic sequence classification examination method used by the national center for biotechnology information (National Center for Biotechnology Information, NCBI) is: comparing the genome sequence to be detected with the reference genome sequence of the species, calculating the base similarity of homologous fragments of the two genome sequences, and judging whether the genome sequence to be detected has species classification errors by using the conditions that the consistency is greater than 96% and the coverage is greater than 80% as thresholds.

However, this method has the following drawbacks or deficiencies:

(1) For each species, a reference genomic sequence is required; if the genome sequence to be detected belongs to the species without the reference, the judgment cannot be carried out;

(2) The decision threshold is a determined value, and there may be decision errors for different types of species; for example, viruses with higher evolution speed can have more variation in genome sequences, and a lower threshold value should be used for the corresponding base similarity threshold value;

(3) Strains that have more variation from the reference genome may be judged as misclassified genomes, i.e., misjudgment may occur.

QUAST is a very popular type of genome splice outcome assessment software (Alexey Gurlevich, vladislav Saveliev, nikolay Vyahhi & Glenn Tesler. QUAST: quality assessment tool for genome assmbles. Bioinformatics.2013, 29:1072-1075.). MetaQUAST is a modified version of QUAST (Mikheenko A, saveliev, gurlevich A. MetaQUAST: evaluation of metagenome Assembles. Bioinformatics.2016Apr 1;32 (7)), a more advanced tool based on contig (contig) and reference aligned genome splice assessment, however, the tool must input the species' reference genome at the time of assessment.

Thus, there is a need in the art for an accurate method of identifying whether genomic sequence classification is erroneous.

Disclosure of Invention

In order to solve at least one of the technical problems, the invention adopts the following technical scheme:

the first aspect of the present invention provides a method of constructing a machine learning model for identifying assembled genomic classification errors, comprising the steps of:

s1, obtaining assembled genome sequences of a plurality of species with reference genomes, wherein the assembled genome sequences comprise correctly classified assembled genome sequences and incorrectly classified assembled genome sequences;

s2, randomly generating reads with the length of K by each assembled genome sequence; breaking each assembled genome sequence according to the length L and the step length N to respectively obtain a contigs sequence set, wherein K=75-500, L=5000-10000 and N=1-L;

s3, comparing the reads generated by simulation with each contigs sequence set to obtain the following parameters of each contig at each position: whether the position is A, T, C, G; the position detection genotype is A, T, C, G number of reads; the reads of the location cover depth; the number of the reads is inconsistent at the position, an L multiplied by 11 matrix is obtained and is used as a characteristic value, and for an assembled genome sequence, the characteristic values of all contigs in the contigs sequence set form a characteristic data set of the assembled genome sequence;

s4, constructing a machine learning model by utilizing the characteristic data sets of all the assembled genome sequences and the information of whether classification is wrong or not.

In some embodiments of the invention, k=200, l=8000, n=7500, more closely approximates true high throughput sequencing results and contig assembly results.

In some embodiments of the invention, in step S1, the assembled genomic sequence is assembled based on high throughput sequencing data; accordingly, in step S2, each assembled genomic sequence randomly generates double-ended sequencing reads, preferably, double-ended sequencing reads of the HiSeq 2500 platform are generated by using art_illumine simulation, which are reads1 and reads2 respectively, reads1 of all assembled genomic sequences are combined into simulated reads1, and reads2 of all assembled genomic sequences are combined into simulated reads2; correspondingly, in step S3, the two-terminal reads are aligned with each contigs sequence set according to the two-terminal reads alignment mode.

In some embodiments of the invention, in step S1, the assembled genomic sequences of all the reference species are obtained. Preferably, if there are too many assembled genomic sequences of a species, e.g., more than 5, only 5 of the assembled genomic sequences are retained to increase the efficiency of model building.

In some embodiments of the invention, in step S1, the classification correctness or classification mistakes are evaluated using metaquat.

In some embodiments of the invention, the machine learning model is constructed based on a neural network algorithm.

In some embodiments of the invention, the machine learning model is constructed based on a convolutional neural network algorithm.

Further, the machine learning model obtains an assembled genome value of the assembled genome sequence to be identified according to the input result, if the assembled genome value is lower than a preset threshold value, the assembled genome sequence to be identified is wrongly classified, and if the assembled genome value is not lower than the preset threshold value, the assembled genome sequence to be identified is correctly classified. In some embodiments of the invention, the predetermined threshold is a curved inflection point value based on scores of assembled genomic sequences of a plurality of the same species. In some embodiments of the invention, the inflection point is a point at which the slope of the curve begins to decrease, the assembled genomic value corresponding to the inflection point approaches a peak, and the curve increases slowly from the inflection point. In some embodiments of the invention, the predetermined threshold is a representative value of inflection points of a plurality of sibling species, the representative value being a statistically significant value of average, median, mode, etc.

The second aspect of the present invention provides a method for identifying misclassification of assembled genomic sequences of species, comprising the steps of constructing a feature data set of each assembled genomic sequence to be identified by utilizing steps S2 to S3 of the first aspect of the present invention, and inputting the feature data set into a machine learning model constructed in the first aspect of the present invention, thereby judging whether the misclassification of each assembled genomic sequence to be identified is wrong.

The method for identifying the species assembled genome sequence classification errors can judge the assembled genome sequence with the classification errors aiming at the species with a plurality of assembled genome sequences. The use of the assembled genomic sequences, which are correctly classified, as a genomic library of the species is more accurate than when performing equivalent analysis.

In some embodiments of the invention, all of the assembled genomic sequences of a species are obtained and the corresponding assembled genomic values are obtained using the machine learning model, respectively. The assembled genomic sequences whose assembled genomic values are below the preset threshold are removed and the remaining assembled genomic sequences can be used to construct a reference gene database for that species.

In some embodiments of the invention, all of the assembled genomic sequences of a species are obtained and the corresponding assembled genomic values are obtained using the machine learning model, respectively. And drawing a curve by using the values of the assembled genome, wherein the value of the assembled genome corresponding to the inflection point of the curve is an inflection point value, and the assembled genome with the value of the assembled genome lower than the inflection point value is judged to be wrongly classified and cannot be used for constructing a database.

In a third aspect the invention provides a system for identifying species assembly genomic sequence classification errors comprising the following modules:

the data input module is used for obtaining characteristic data sets of each assembled genome sequence to be identified;

the data storage module is used for storing characteristic data sets of the assembled genome sequences of a plurality of species with the parameters and information of whether the classification is wrong or not;

the error recognition module is respectively connected with the data input module and the data storage module and is used for constructing a machine learning model according to the characteristic data sets of the plurality of assembled genome sequences and the information of whether the classification is wrong, inputting the characteristic data sets of the assembled genome sequences to be recognized into the model and judging whether the classification of the assembled genome sequences to be recognized is wrong;

the result output module is connected with the error identification module and is used for outputting an identification result;

wherein the feature data set is constructed according to steps S2-S3 of the first aspect of the invention.

In some embodiments of the invention, the data storage module corresponds to a feature data set of a plurality of assembled genomic sequences and information of whether the classification is incorrect for each species with a reference.

In some embodiments of the invention, the result output module is further connected to the data storage module for inputting the feature data set of the assembled genomic sequence to be identified and the identification result to the data storage module. Therefore, the data volume for constructing the model can be increased continuously, and the accuracy of model identification is further improved.

A fourth aspect of the invention provides a computer device comprising: a memory for storing a computer program; a processor for implementing the steps of the method according to any of the first aspects of the invention when executing said computer program.

A fourth aspect of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to any of the first aspects of the invention.

In a fifth aspect the invention provides a method of constructing a genome database of a species, comprising obtaining a reference genome sequence and all assembled genome sequences of said species, determining and deleting assembled genome sequences in which errors are classified using the method of the second aspect of the invention or the system of the third aspect of the invention. The remaining assembled genomic sequences are used to construct the species genome database. The core basis of the pathogen metagenome technology is a genome database of pathogenic microorganisms, and whether the genome assembly sequences obtained from public data have classification errors directly affects the analysis result of downstream pathogen identification. The database obtained by the construction method of the species genome database can improve the detection accuracy.

The beneficial effects of the invention are that

Compared with the prior art, the invention has the following beneficial effects:

by using the method and the system provided by the invention, whether the classification of the assembled genome sequence is correct can be accurately judged. Taking the assembled genome of the Mycobacterium tuberculosis complex as an example, the method and the system of the invention are used for identification, and about 7% of genome sequences are found to be not belonging to the Mycobacterium tuberculosis, and sequence pollution or splicing assembly errors can exist.

After deleting the assembled genome with wrong classification by using the method and the system of the invention, when the rest assembled genome sequences with correct classification are used as a reference database, false positive detection can be effectively reduced in actual sample detection. Taking 20 low abundance tuberculosis samples used in the invention as an example, samples in which 80% of the samples detected mycobacterium tuberculosis are all false positives can be found.

Drawings

FIG. 1 shows the assembled genome values obtained by the machine learning model of the assembled genome of 608 Mycobacterium tuberculosis complex of the present invention.

Detailed Description

Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this application is incorporated by reference in its entirety, and the equivalent patents to those cited are incorporated by reference, particularly as they relate to the definitions of terms in the art. If the definition of a particular term disclosed in the prior art does not conform to any definition provided in this application, the definition of that term provided in this application controls.

Numerical ranges in this application are approximations, so that it may include the numerical values outside of the range unless otherwise indicated. The numerical range includes all values from the lower value to the upper value that increase by 1 unit, provided that there is a spacing of at least 2 units between any lower value and any higher value. For ranges containing values less than 1 or containing fractions greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is suitably considered to be 0.0001,0.001,0.01, or 0.1. For a range containing units of less than 10 (e.g., 1 to 5), 1 unit is generally considered to be 0.1. These are merely specific examples of what is intended to be provided, and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the embodiments.

Examples

The following examples are presented herein to demonstrate preferred embodiments of the present invention. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the claims.

The experimental methods in the following examples are conventional methods unless otherwise specified. The instruments used in the following examples are laboratory conventional instruments unless otherwise specified; the test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores.

Example 1 construction of a machine learning model to identify misclassified genomic sequences

1. Training set construction

(1) Genomic sequences of the reference species were downloaded from the NCBI database, with a maximum of 5 assembly sequences (the Genome at assembly level "Complete Genome" or "chromosomeme") downloaded per species, giving a total of 4258 genomic sequences of 1000 species. MetaQUAST was used to evaluate whether each assembled genomic sequence classification was accurate, and if so, a value of 1 was assigned, and if not, a value of 0 was assigned. Of these, 4087 genomes were classified accurately, and 171 genomes were misclassified.

(2) For each genome sequence, respectively using art_illumina simulation to generate double-end sequencing reads of the HiSeq 2500 platform, wherein the reads are respectively reads1 and 2, and the length of the reads is set to be 100bp; all reads1 are merged into simulated reads1, and all reads2 are merged into simulated reads2.

(3) Breaking each genome sequence according to 8000bp length and 7500bp step length to obtain a contigs sequence set;

(4) And (3) using bowtie2, comparing the simulated reads1 and the simulated reads2 generated in the step (2) with each contigs sequence set respectively according to a comparison mode of double-end reads, and obtaining a comparison result of each position on each contig. Generating ordered bam files and constructing indexes;

(5) For each contig sequence, a 8000×11 matrix is constructed as a eigenvalue, and OneHotEncoder is used for encoding, which is used to represent the comparison result of each position on contig. Vectors of length 11 are: whether the position on the reference genome is A, T, C, G (if so, a value of 1 is assigned, if otherwise, a value of 0 is assigned); the position detection genotype is A, T, C, G number of reads; the reads of the location cover depth; the position is inconsistent reads (discordant reads) number; the number of SNPs was detected at this position. As shown in table 1:

TABLE 1 eigenvalue matrix

2. Model training

And calling a machine learning tool bag tensorflow of the python language by using the constructed training data set to construct a convolutional neural network model.

(1) Calling a keras to define a Sequential model, and adding a training data set into an input layer;

(2) The convolutional layer and BN layer (axis= -1) were added sequentially using the parameters as follows:

convolutional layer	Output size
		Conv 2×11×8，stride 1+ReLU+BN	9999×1×8
Conv 2×1×16，stride 2+ReLU+BN	4999×1×16
		Conv 2×1×32，stride 2+ReLU+BN	2499×1×32
Conv 2×1×64，stride 2+ReLU+BN	1249×1×64
		Conv 2×1×64，stride 2+ReLU+BN	624×1×128

(3) Adding a pooling layer, wherein the size of the pooling layer is (50, 1);

(4) The flat layer is added and default parameters are used.

3. Model testing

Selecting one reference species not included in the training data set, randomly downloading 1000 genome sequence files of the reference species from NCBI, and extracting a characteristic value matrix according to a method for constructing the training set. The data is input into the constructed model to calculate score. At the same time, using MetaQUAST evaluation, 3.8% of the misclassified genome was found to be present. So TOP 96.2% of score was taken as the correct set of classification and the remainder as the wrong set of classification. The recall rate of the final model reaches 89.47 percent.

EXAMPLE 2 construction of high accuracy genome database of Mycobacterium tuberculosis Complex

The model constructed in the embodiment 1 is used for evaluating whether the classification of the mycobacterium tuberculosis genome is wrong, the characteristic value dataset of the assembled genome of the mycobacterium tuberculosis is extracted according to the same method as that of constructing the training set in the embodiment 1, the characteristic value dataset is input into the model constructed in the embodiment 1 for operation, a scoring value is obtained, and whether the classification of the assembled genome to be evaluated is correct is judged according to the scoring value.

1. Genome assembly sequence download

Inputting the taxd 77643 corresponding to the mycobacterium tuberculosis complex into NCBI taxomory database for searching, downloading all genomes with the assembly level of Complete Genome or chromoname, and screening to obtain the sequence of the assembled Genome of 608 mycobacterium tuberculosis.

2. Simulated reads generation

For 608 assembled genomic sequences, double-ended sequencing reads for the HiSeq 2500 platform were generated using the art_illumina simulation, respectively, reads1 and reads2. The length of reads is set to be 100bp, the length of the insertion fragment is set to be 270bp, and the length deviation is 50bp; all reads1 are then merged into simulated reads1, and all reads2 are merged into simulated reads2.

3. Genome disruption

Breaking 608 genome sequences according to 8000bp length and 7500bp step length to obtain 608 contigs sequence sets (each genome sequence corresponds to one contigs sequence set); each contigs sequence set includes 588 contigs on average.

Reads alignment

And (3) using bowtie2, comparing the simulated reads1 and the simulated reads2 generated in the step (2) with 608 continuous sequence sets respectively according to the comparison mode of double-end reads to obtain the comparison result of each position, generating the ordered bam file, and constructing an index.

5. Feature dataset extraction

And calculating the characteristic information corresponding to each position in each contig sequence according to the bam file according to each genome sequence. For a particular location, including whether the location on the reference genome is A, T, C, G; the position detection genotype is A, T, C, G number of reads; the reads of the location cover depth; the position is inconsistent reads (discordant reads) number; the number of SNPs is detected at this location, resulting in a feature dataset.

6. Model prediction and misclassification identification

Inputting 608 feature data sets into the model trained in the embodiment 1, and calculating scoring values, namely assembly scores (contig score), for each contig sequence; for each assembled genome, the upper quartile of all contigs was selected as the assembled genome value.

The assembled genome values at the inflection points (from which the slope of the curve starts to decrease, and the assembled scores at which the points approach the peak values) are set as thresholds in order from small to large, and all assembled genomes having assembled genome values lower than the thresholds are determined as misclassified genomes. Finally 561 high-accuracy Mycobacterium tuberculosis complex genome sequences are obtained, as shown in figure 1.

Example 3 Mycobacterium tuberculosis Complex identification

Samples of the specific alignment reads of Mycobacterium tuberculosis complex were detected with 20 cases of low abundance, and the numbers of the specifically detected reads were counted by comparing them to a reference genome database (database 1) constructed from all of the 608 genome sequences in example 2 and a reference genome database (database 2) constructed from 561 genome sequences obtained by removing the misclassified genome sequences, respectively. The results of PCR (F: AAACACAAGGAGCCGACAAC; R: CATACCAGGACGCCTTGC) were used as a comparison and are shown in Table 2:

TABLE 2 detection of the number of specific reads of Mycobacterium tuberculosis Complex Using different genome databases

According to the statistical results of Table 2, after the genome sequence of the Mycobacterium tuberculosis complex determined to be wrongly classified by the machine learning model is removed, the false positive rate is greatly reduced and the detection accuracy is improved under the condition that the true positive rate is unchanged.

All documents mentioned in this application are incorporated by reference as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the claims appended hereto.

Claims

1. A method of constructing a machine learning model for identifying assembled genomic classification errors, comprising the steps of:

s2, randomly generating reads with the length of K by each assembled genome sequence; breaking each assembled genome sequence according to the length L and the step length N to respectively obtain a contigs sequence set, wherein K=75-500, L=1000-10000 and N=1-L;

2. The method of claim 1, wherein in step S1, the assembled genomic sequence is assembled based on high throughput sequencing data; in step S2, each assembled genome sequence randomly generates double-ended sequencing reads, which are reads1 and reads2 respectively, the reads1 of all assembled genome sequences are combined into simulated reads1, and the reads2 of all assembled genome sequences are combined into simulated reads2; in step S3, the two-terminal reads are aligned with each contigs sequence set according to the two-terminal reads alignment mode.

3. Method according to claim 1, characterized in that in step S1 the classification correctness or classification mistakes are evaluated using MetaQUAST.

4. The method of claim 1, wherein the machine learning model is constructed based on a neural network algorithm.

5. The method of claim 4, wherein the machine learning model is constructed based on a convolutional neural network algorithm.

6. A method for identifying species assembled genome sequence classification errors, characterized in that a characteristic data set of each assembled genome sequence to be identified is constructed by using the steps S2-S3 of claim 1 and is input into a machine learning model constructed by claim 1, so as to judge whether the classification of each assembled genome sequence to be identified is wrong.

7. A system for identifying species-assembled genomic sequence classification errors, comprising the following modules:

the error recognition module is respectively connected with the data input module and the data storage module and is used for constructing a machine learning model according to the characteristic data set of the assembled genome sequences and the information of whether the classification is wrong, inputting the characteristic data set of each assembled genome sequence to be recognized into the model and judging whether the classification of the assembled genome sequence to be recognized is wrong;

wherein the feature data set is constructed according to steps S2-S3 of claim 1.

8. The system of claim 7, wherein the result output module is further coupled to the data storage module for inputting the feature data set for each assembled genomic sequence to be identified and the identification result to the data storage module.

9. A computer device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method according to any one of claims 1 to 5 when executing said computer program.

10. A computer-readable storage medium comprising,

the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-5.