CN115910216A - Method and system for identifying genome sequence classification errors based on machine learning - Google Patents

Method and system for identifying genome sequence classification errors based on machine learning Download PDF

Info

Publication number
CN115910216A
CN115910216A CN202211537778.7A CN202211537778A CN115910216A CN 115910216 A CN115910216 A CN 115910216A CN 202211537778 A CN202211537778 A CN 202211537778A CN 115910216 A CN115910216 A CN 115910216A
Authority
CN
China
Prior art keywords
assembled
sequence
classification
genome
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211537778.7A
Other languages
Chinese (zh)
Other versions
CN115910216B (en
Inventor
陈燕君
王涛
肖姗姗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Ruipu Medical Laboratory Co ltd
Hangzhou Repugene Technology Co ltd
Original Assignee
Hangzhou Ruipu Medical Laboratory Co ltd
Hangzhou Repugene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Ruipu Medical Laboratory Co ltd, Hangzhou Repugene Technology Co ltd filed Critical Hangzhou Ruipu Medical Laboratory Co ltd
Priority to CN202211537778.7A priority Critical patent/CN115910216B/en
Publication of CN115910216A publication Critical patent/CN115910216A/en
Application granted granted Critical
Publication of CN115910216B publication Critical patent/CN115910216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a method and a system for identifying genome sequence classification errors based on machine learning, and belongs to the technical field of bioinformatics. The invention also discloses a method for constructing a machine learning model for identifying the assembly genome classification errors, which comprises the following steps: obtaining assembled genomic sequences for a plurality of species having a reference genome; randomly generating reads from each assembled genome sequence, and breaking to obtain contigs sequence sets; comparing reads with each contigs sequence set to obtain a comparison parameter of each contig position, and constructing a characteristic data set; and constructing a machine learning model by using the characteristic data sets of all the assembled genome sequences and the information of whether the classification of the characteristic data sets is wrong. The method and the system can accurately judge whether the classification of the assembled genome sequence is correct or not, and can effectively reduce the detection of false positives in the actual sample detection when the residual high-quality assembled genome sequence is used as a reference database after the assembled genome with the wrong classification is deleted.

Description

Method and system for identifying genome sequence classification errors based on machine learning
Technical Field
The invention belongs to the technical field of bioinformatics, and particularly relates to a method and a system for identifying genome sequence classification errors based on machine learning.
Background
In the field of bioinformatics, after the genomic sequences are classified, it is often necessary to check to ensure that the classification is correct. The genome sequence classification examination method used by the National Center for Biotechnology Information (NCBI) is: comparing the genome sequence to be detected with the reference genome sequence of the species, calculating the base similarity of homologous fragments of the two genome sequences, and using the consistency of more than 96% and the coverage of more than 80% as thresholds to judge whether the species classification error condition exists in the genome sequence to be detected.
However, this method has the following drawbacks or disadvantages:
(1) For each species, a reference genomic sequence is required; if the genome sequence to be detected belongs to the species without the reference, the judgment cannot be carried out;
(2) The judgment threshold is a definite value, and judgment errors may exist for different types of species; for example, viruses with higher evolution speed have more variation in genome sequence, and a lower threshold value should be used for the corresponding base similarity threshold value;
(3) Strains with more variation from the reference genome may be judged as misclassified genomes, i.e., misjudgment may occur.
QUAST is a very popular software for genome splicing results assessment (Alexey Gurevich, vladislav Saveliev, nikolay Vyahhi & Glenn Tesler. QUAST: quality assessment tool for genome assessment. Bioinformatics.2013, 29. MetaQUAST is a modified version of QUAST (Mikheenko A, saveliev, gurevich A. MetaQUAST: evaluation of the genomic assemblies.Bioinformatics.2016Apr 1 (7)), is a relatively advanced tool based on the contig (contig) and reference aligned genome splicing evaluation, however, this tool must be entered into the species' reference genome at the time of evaluation.
Therefore, there is a need in the art for a precise method for identifying whether there is an error in the classification of genomic sequences.
Disclosure of Invention
In order to solve at least one of the above technical problems, the technical solution adopted by the present invention is as follows:
the invention provides in a first aspect a method of constructing a machine learning model for identifying errors in assembling a genome classification, comprising the steps of:
s1, obtaining assembled genome sequences of a plurality of species with reference genomes, wherein the assembled genome sequences comprise correctly classified assembled genome sequences and incorrectly classified assembled genome sequences;
s2, randomly generating reads with the length of K from each assembled genome sequence; breaking each assembled genome sequence according to the length of L and the step length of N to respectively obtain contigs sequence sets, wherein K = 75-500, L = 5000-10000, and N = 1-L;
s3, comparing the reads generated by simulation with each contigs sequence set to obtain the following parameters of each position of each contig: whether the position is A, T, C, G; the position detection genotype is the numbers of reads of A, T, C and G; reads coverage depth for the location; obtaining an L multiplied by 11 matrix as a characteristic value by the number of reads with inconsistent positions, wherein the characteristic values of all contigs in a contigs sequence set of an assembled genome sequence form a characteristic data set of the assembled genome sequence;
and S4, constructing a machine learning model by using the characteristic data sets of all the assembled genome sequences and the information of whether the classification is wrong.
In some embodiments of the invention, K =200, l =8000, n =7500, more closely approximates real high throughput sequencing results and contig assembly results.
In some embodiments of the invention, in step S1, the assembled genomic sequence is assembled based on high throughput sequencing data; correspondingly, in step S2, randomly generating paired-end sequencing reads for each assembled genome sequence, preferably generating paired-end sequencing reads of the HiSeq 2500 platform by art _ illumina simulation, which are reads1 and reads2 respectively, merging reads1 of all assembled genome sequences into simulated reads1, and merging reads2 of all assembled genome sequences into simulated reads2; accordingly, in step S3, the contigs sequence sets are aligned according to the double-ended reads alignment mode.
In some embodiments of the invention, in step S1, the assembled genomic sequences of all reference species are obtained. Preferably, if there are too many assembled genomic sequences of a species, for example more than 5, only 5 of them are retained to improve the efficiency of modeling.
In some embodiments of the invention, in step S1, the classification correctness or classification errors are evaluated using metaast.
In some embodiments of the invention, the machine learning model is constructed based on a neural network algorithm.
In some embodiments of the invention, the machine learning model is constructed based on a convolutional neural network algorithm.
Further, the machine learning model obtains an assembled genome score of the assembled genome sequence to be identified according to the input result, if the assembled genome score is lower than a preset threshold, the assembled genome sequence to be identified is wrongly classified, and if the assembled genome score is not lower than the preset threshold, the assembled genome sequence to be identified is correctly classified. In some embodiments of the invention, the preset threshold is a knee value of a curve based on scores of assembled genomic sequences of a plurality of the same species. In some embodiments of the invention, the inflection point is the point at which the slope of the curve begins to decrease, the assembled genome score for the inflection point is near the peak, and the curve growth slows down from the inflection point. In some embodiments of the present invention, the preset threshold is a representative value of inflection points of a plurality of congeneric species, and the representative value is a statistically significant value such as a mean value, a median value, a mode value, and the like.
The second aspect of the present invention provides a method for identifying species assembly genome sequence classification errors, which comprises the steps of S2-S3 of the first aspect of the present invention, constructing a feature data set of each assembly genome sequence to be identified, and inputting the feature data set into a machine learning model constructed by the first aspect of the present invention, thereby determining whether the classification of each assembly genome sequence to be identified is erroneous.
The method for identifying the misclassification of the species assembled genome sequence can judge the assembled genome sequence with misclassification aiming at the species with a plurality of assembled genome sequences. The correct taxonomic assembled genomic sequence is used as the genomic library for this species and is more accurate than when performing peer-to-peer analysis.
In some embodiments of the invention, all assembled genomic sequences from a species are obtained and the corresponding assembled genomic scores are obtained using the machine learning model, respectively. The assembled genomic sequences with assembled genomic scores below a predetermined threshold are removed, and the remaining assembled genomic sequences can be used to construct a reference gene database for the species.
In some embodiments of the invention, all assembled genomic sequences from a species are obtained and the corresponding assembled genomic scores are obtained using the machine learning model, respectively. And drawing a curve by using the scores of the assembled genomes, wherein the score of the assembled genome corresponding to the inflection point of the curve is the inflection point value, and the assembled genome with the score lower than the inflection point value is judged to be classified wrongly and cannot be used for constructing a database.
The third aspect of the invention provides a system for identifying species assembly genome sequence classification errors, which comprises the following modules:
the data input module is used for obtaining a characteristic data set of each assembled genome sequence to be identified;
the data storage module is used for storing characteristic data sets of the assembled genome sequences of a plurality of reference species and information whether the classification is wrong;
the error identification module is respectively connected with the data input module and the data storage module and is used for constructing a machine learning model according to the characteristic data sets of the assembled genome sequences and the information whether the classification is wrong or not, inputting the characteristic data sets of the assembled genome sequences to be identified into the model and judging whether the classification of the assembled genome sequences to be identified is wrong or not;
the result output module is connected with the error identification module and used for outputting an identification result;
wherein the feature data set is constructed according to steps S2-S3 of the first aspect of the invention.
In some embodiments of the invention, the data storage module corresponds, for each reference species, a plurality of feature data sets of the assembled genomic sequence and information on whether the classification is erroneous.
In some embodiments of the invention, the result output module is further connected to the data storage module for inputting the feature data set of the assembled genomic sequence to be identified and the identification result to the data storage module. Therefore, the data size for constructing the model can be continuously increased, and the accuracy of model identification is further improved.
A fourth aspect of the present invention provides a computer device comprising: a memory for storing a computer program; a processor for implementing the steps of the method according to any one of the first aspect of the invention when executing the computer program.
A fourth aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any one of the first aspects of the invention.
In a fifth aspect, the invention provides a method for constructing a species genomic database, comprising obtaining a reference genomic sequence and all assembled genomic sequences of the species, and determining and deleting the assembled genomic sequences with misclassification by using the method of the second aspect or the system of the third aspect. The retained assembled genomic sequences are used to construct the species genomic database. The core basis of the pathogen metagenome technology is a genome database of pathogenic microorganisms, and whether a genome assembly sequence obtained from public data has classification errors or not directly influences the analysis result of downstream pathogen identification. The database obtained by the method for constructing the species genome database can improve the detection accuracy.
The invention has the advantages of
Compared with the prior art, the invention has the following beneficial effects:
by utilizing the method and the system, whether the classification of the assembled genome sequence is correct can be accurately judged. Using the method and system of the present invention for identification, taking the assembled genome of Mycobacterium tuberculosis complex as an example, it was found that about 7% of the genome sequence does not belong to Mycobacterium tuberculosis, and there may be sequence contamination or splicing assembly errors.
After the method and the system are used for deleting the assembled genome with wrong classification, when the residual assembled genome sequence with correct classification is used as a reference database, the false positive detection can be effectively reduced in the actual sample detection. Taking 20 low-abundance tuberculosis samples used in the invention as an example, 80% of the samples in which mycobacterium tuberculosis is detected are found to be false positive.
Drawings
FIG. 1 shows the assembled genome scores of 608 M.tuberculosis complex assembled genomes of the present invention obtained by using a machine learning model.
Detailed Description
Unless otherwise indicated, implied from the context, or customary in the art, all parts and percentages herein are by weight and the testing and characterization methods used are synchronized with the filing date of the present application. Where applicable, the contents of any patent, patent application, or publication referred to in this application are hereby incorporated by reference in their entirety, and the equivalent family of patents is also incorporated by reference, especially with respect to the definitions of relevant terms in the art, as disclosed in these documents. To the extent that a definition of a particular term disclosed in the prior art is inconsistent with any definitions provided herein, the definition of the term provided herein controls.
The numerical ranges in this application are approximations, and thus may include values outside of the ranges unless otherwise specified. A numerical range includes all numbers from the lower value to the upper value, in increments of 1 unit, provided that there is a separation of at least 2 units between any lower value and any higher value. For ranges containing a numerical value less than 1 or containing a fraction greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is considered to be 0.0001,0.001,0.01, or 0.1, as appropriate. For ranges containing single digit numbers less than 10 (e.g., 1 to 5), 1 unit is typically considered 0.1. These are merely specific examples of what is intended to be expressed and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.
In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments.
Examples
The following examples are used herein to demonstrate preferred embodiments of the invention. It will be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the disclosures and references cited herein and the materials to which they refer are incorporated by reference.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.
The experimental procedures in the following examples are conventional unless otherwise specified. The instruments used in the following examples, unless otherwise specified, were all conventional laboratory instruments; the test materials used in the following examples were purchased from a conventional biochemical reagent store unless otherwise specified.
Example 1 construction of a machine learning model to identify misclassified genomic sequences
1. Training set construction
(1) The Genome sequences of the reference species were downloaded from the NCBI database, with up to 5 assembly sequences (Genome at assembly level "Complete Genome" or "Chromosome") being downloaded per species, giving 4258 Genome sequences for 1000 species. Each assembled genomic sequence classification was evaluated using MetaQUAST for accuracy, and assigned a value of 1 if accurate and 0 if not. Among them, 4087 genomes were classified accurately, and 171 genomes were classified incorrectly.
(2) Aiming at each genome sequence, generating double-ended sequencing reads of the HiSeq 2500 platform by using art _ illumina simulation respectively, wherein the reads are respectively called reads1 and reads2, and the length of the reads is set to be 100bp; all reads1 are merged into simulated reads1 and all reads2 are merged into simulated reads2.
(3) Breaking each genome sequence according to the length of 8000bp and the step length of 7500bp to respectively obtain a contigs sequence set;
(4) And (3) comparing the simulated reads1 and the simulated reads2 generated in the step (2) with each contigs sequence set according to a double-end reads comparison mode by using bowtie2 to obtain a comparison result of each position on each contig. Generating sorted bam files and constructing an index;
(5) For each contig sequence, a 8000 × 11 matrix was constructed as a feature value, and encoded using onehot encoder to represent the alignment result at each position on contig. The length 11 vectors are: whether the position on the reference genome is A, T, C and G (if so, the value is 1, and if not, the value is 0); the position detection genotype is the numbers of reads of A, T, C and G; reads coverage depth for the location; the number of inconsistent reads (discordant reads) in the location; the number of SNPs detected at that position. As shown in table 1:
TABLE 1 eigenvalue matrix
Figure BDA0003975951860000071
2. Model training
And calling a machine learning toolkit (tensorflow) of python language by using the constructed training data set to construct a convolutional neural network model.
(1) Calling keras to define a Sequential model, and adding a training data set to an input layer;
(2) The convolutional layer and BN layer (axis = -1) were added in this order, using the parameters as follows:
convolutional layer Output size
Conv 2×11×8,stride 1+ReLU+BN 9999×1×8
Conv 2×1×16,stride 2+ReLU+BN 4999×1×16
Conv 2×1×32,stride 2+ReLU+BN 2499×1×32
Conv 2×1×64,stride 2+ReLU+BN 1249×1×64
Conv 2×1×64,stride 2+ReLU+BN 624×1×128
(3) Adding a pooling layer, wherein the size of the pooling layer is (50,1);
(4) The Flatten layer is added, using default parameters.
3. Model testing
And selecting a reference species which is not included in the training data set, randomly downloading 1000 genome sequence files of the species from NCBI, and extracting the characteristic value matrix according to a method for constructing the training set. Data is input into the constructed model to calculate score. While using MetaQUAST evaluation, 3.8% of misclassified genomes were found to be present. So score's TOP 96.2% is taken as the classification correct set and the rest as the classification error set. The final model recall rate reaches 89.47%.
Example 2 construction of highly accurate genomic database of Mycobacterium tuberculosis Complex
And (2) evaluating whether the classification of the mycobacterium tuberculosis genome has errors by using the model constructed in the embodiment 1, extracting a characteristic value data set of the mycobacterium tuberculosis assembled genome according to the method of constructing the training set in the embodiment 1, inputting the characteristic value data set into the model constructed in the embodiment 1 for operation to obtain a scoring value, and judging whether the classification of the assembled genome to be evaluated is correct or not according to the scoring value.
1. Genome assembly sequence download
Inputting a taxi 77643 corresponding to the Mycobacterium tuberculosis complex into an NCBI taxonomy database for searching, downloading all genomes with an assembly level of 'Complete Genome' or 'Chromosome', and obtaining sequences of assembled genomes of 608 strains of Mycobacterium tuberculosis after screening.
2. Analog reads generation
Paired-end sequencing reads for the HiSeq 2500 platform, reads1 and reads2, respectively, were generated using art _ illumina simulation for 608 assembled genomic sequences, respectively. The length of reads is set as 100bp, the length of an insert is set as 270bp, and the length deviation is 50bp; and then all reads1 are combined into simulation reads1, and all reads2 are combined into simulation reads2.
3. Genome disruption
Respectively breaking 608 genome sequences according to the length of 8000bp and the step length of 7500bp to obtain 608 contigs sequence sets (each genome sequence corresponds to one contigs sequence set); on average, each contigs sequence set contains 588 contigs.
Reads alignment
And (3) comparing the simulated reads1 and the simulated reads2 generated in the step (2) with 608 contigs sequence sets respectively according to a double-end reads comparison mode by using bowtie2 to obtain a comparison result of each position, generating a sorted bam file, and constructing an index.
5. Feature dataset extraction
And calculating characteristic information corresponding to each position in each contig sequence according to the bam file aiming at each genome sequence. For a particular location, including whether the location on the reference genome is a, T, C, G; the position detection genotype is the numbers of reads of A, T, C and G; reads coverage depth for the location; the number of inconsistent reads (discordant reads) in the location; the number of SNPs detected at that position results in a feature data set.
6. Model prediction and misclassification recognition
Inputting 608 feature data sets into the model obtained by training in example 1, and calculating a scoring value, namely an assembly score (contig score), for each contig sequence; for each assembled genome, the contig score of the upper quartile of all contigs was selected as the assembled genome score.
The assembled genomes at the inflection point (from which the slope of the curve starts to decrease and the assembled scores at the point approach the peak) are ranked from small to large as a threshold, and all assembled genomes with an assembled genome score below the threshold are determined to be misclassified genomes. 561 high-accuracy genome sequences of the Mycobacterium tuberculosis complex are obtained finally, and are shown in figure 1.
Example 3 Mycobacterium tuberculosis Complex identification
Samples of 20 samples of the specific alignment reads of the low-abundance detected mycobacterium tuberculosis complex are respectively aligned to a reference genome database (database 1) constructed by all 608 genome sequences in example 2 and a reference genome database (database 2) constructed by 561 genome sequences obtained after removing misclassified genome sequences, and the number of the specifically detected reads is counted. The results of the PCR method (F: AAACACAAGGAGCGACAAC; R: CATACCAGGACGCCTTGC) were used for comparison, and are shown in Table 2:
TABLE 2 detection of specific reads number of Mycobacterium tuberculosis Complex Using different genomic databases
Figure BDA0003975951860000091
Figure BDA0003975951860000101
According to the statistical results in table 2, after the genome sequence of the mycobacterium tuberculosis complex which is judged as the classification error by using the machine learning model of the invention is removed, the false positive rate of the detection is greatly reduced and the accuracy of the detection is improved under the condition that the true positive rate is not changed.
All documents referred to herein are incorporated by reference into this application as if each were individually incorporated by reference. Furthermore, it should be understood that various changes or modifications of the present invention can be made by those skilled in the art after reading the above teachings of the present invention, and these equivalents also fall within the scope of the appended claims of the present application.

Claims (10)

1. A method of constructing a machine learning model for identifying errors in assembling genome classification, comprising the steps of:
s1, obtaining assembled genome sequences of a plurality of species with reference genomes, wherein the assembled genome sequences comprise correctly classified assembled genome sequences and incorrectly classified assembled genome sequences;
s2, randomly generating reads with the length of K from each assembled genome sequence; breaking each assembled genome sequence according to the length of L and the step length of N to respectively obtain contigs sequence sets, wherein K = 75-500, L = 1000-10000, and N = 1-L;
s3, comparing the reads generated by simulation with each contigs sequence set to obtain the following parameters of each position of each contig: whether the position is A, T, C, G; the position detection genotype is the numbers of reads of A, T, C and G; the reads coverage depth of the location; obtaining an L multiplied by 11 matrix as a characteristic value by the number of reads with inconsistent positions, wherein the characteristic values of all contigs in a contigs sequence set of an assembled genome sequence form a characteristic data set of the assembled genome sequence;
and S4, constructing a machine learning model by using the characteristic data sets of all the assembled genome sequences and the information of whether the classification is wrong.
2. The method of claim 1, wherein in step S1, the assembled genomic sequence is assembled based on high throughput sequencing data; in the step S2, randomly generating double-end sequencing reads, namely reads1 and reads2, for each assembled genome sequence, combining the reads1 of all the assembled genome sequences into simulated reads1, and combining the reads2 of all the assembled genome sequences into simulated reads2; in step S3, the two-end reads are compared with each contigs sequence set according to the two-end reads comparison mode.
3. Method according to claim 1, characterized in that in step S1 the classification correctness or classification errors are evaluated using metaast.
4. The method of claim 1, wherein the machine learning model is constructed based on a neural network algorithm.
5. The method of claim 4, wherein the machine learning model is constructed based on a convolutional neural network algorithm.
6. A method for identifying species assembly genome sequence classification errors is characterized in that a characteristic data set of each assembly genome sequence to be identified is constructed by steps S2-S3 of claim 1 and is input into a machine learning model constructed by claim 1, so that whether the classification of each assembly genome sequence to be identified is wrong or not is judged.
7. A system for identifying species assembly genomic sequence classification errors, comprising the following modules:
the data input module is used for obtaining a characteristic data set of each assembled genome sequence to be identified;
the data storage module is used for storing characteristic data sets of the assembled genome sequences of a plurality of reference species and information whether the classification is wrong;
the error identification module is respectively connected with the data input module and the data storage module and is used for constructing a machine learning model according to the characteristic data sets of the assembled genome sequences and information about whether the classification is wrong, inputting the characteristic data sets of the assembled genome sequences to be identified into the model and judging whether the classification of the assembled genome sequences to be identified is wrong;
the result output module is connected with the error identification module and used for outputting an identification result;
wherein the feature data set is constructed according to steps S2 to S3 of claim 1.
8. The system of claim 7, wherein the result output module is further connected to the data storage module for inputting the feature data set and the recognition result of each assembled genomic sequence to be recognized into the data storage module.
9. A computer device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method according to any one of claims 1 to 5 when executing said computer program.
10. A computer-readable storage medium, characterized in that,
the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202211537778.7A 2022-12-01 2022-12-01 Method and system for identifying genome sequence classification errors based on machine learning Active CN115910216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211537778.7A CN115910216B (en) 2022-12-01 2022-12-01 Method and system for identifying genome sequence classification errors based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211537778.7A CN115910216B (en) 2022-12-01 2022-12-01 Method and system for identifying genome sequence classification errors based on machine learning

Publications (2)

Publication Number Publication Date
CN115910216A true CN115910216A (en) 2023-04-04
CN115910216B CN115910216B (en) 2023-07-25

Family

ID=86481137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211537778.7A Active CN115910216B (en) 2022-12-01 2022-12-01 Method and system for identifying genome sequence classification errors based on machine learning

Country Status (1)

Country Link
CN (1) CN115910216B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682454A (en) * 2016-12-29 2017-05-17 中国科学院深圳先进技术研究院 Method and device for data classification of metagenome
CN108460248A (en) * 2018-03-08 2018-08-28 北京希望组生物科技有限公司 A method of based on the long tandem repetitive sequence of Bionano detection of platform
CN114155914A (en) * 2021-12-01 2022-03-08 复旦大学 Detection and correction system based on metagenome splicing error
CN114333987A (en) * 2021-12-30 2022-04-12 天津金匙医学科技有限公司 Metagenome sequencing-based data analysis method for predicting drug resistance phenotype
US11514289B1 (en) * 2016-03-09 2022-11-29 Freenome Holdings, Inc. Generating machine learning models using genetic data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11514289B1 (en) * 2016-03-09 2022-11-29 Freenome Holdings, Inc. Generating machine learning models using genetic data
CN106682454A (en) * 2016-12-29 2017-05-17 中国科学院深圳先进技术研究院 Method and device for data classification of metagenome
CN108460248A (en) * 2018-03-08 2018-08-28 北京希望组生物科技有限公司 A method of based on the long tandem repetitive sequence of Bionano detection of platform
CN114155914A (en) * 2021-12-01 2022-03-08 复旦大学 Detection and correction system based on metagenome splicing error
CN114333987A (en) * 2021-12-30 2022-04-12 天津金匙医学科技有限公司 Metagenome sequencing-based data analysis method for predicting drug resistance phenotype

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAKOB NYBO NISSEN 等: "Binning microbial genomes using deep learning", 《BIORXIV》, pages 1 - 21 *
郭茂祖 等: "生物信息学中的学习问题", 《山东大学学报(工学版)》, vol. 39, no. 3, pages 1 - 6 *
陈波 等: "宏基因组分类问题中的特征提取及其降维研究", 《计算机系统应用》, vol. 24, no. 11, pages 31 - 37 *

Also Published As

Publication number Publication date
CN115910216B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN111292802B (en) Method, electronic device, and computer storage medium for detecting sudden change
CN106033502B (en) The method and apparatus for identifying virus
CN110246544B (en) Biomarker selection method and system based on integration analysis
CN112489723A (en) DNA binding protein prediction method based on local evolution information
CN112669903A (en) HLA typing method and device based on Sanger sequencing
CN108681742B (en) Analysis method for analyzing sensitivity of driver driving behavior to vehicle energy consumption
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN116842240B (en) Data management and control system based on full-link management and control
EP2518656A1 (en) Taxonomic classification system
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN111128300B (en) Protein interaction influence judgment method based on mutation information
CN112801222A (en) Multi-classification method and device based on two-classification model, electronic equipment and medium
CN115910216B (en) Method and system for identifying genome sequence classification errors based on machine learning
CN111863135A (en) False positive structure variation filtering method, storage medium and computing device
CN111414930A (en) Deep learning model training method and device, electronic equipment and storage medium
CN108595914A (en) One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method
CN112233722B (en) Variety identification method, and method and device for constructing prediction model thereof
CN116646010B (en) Human virus detection method and device, equipment and storage medium
CN112102880A (en) Method for identifying variety, and method and device for constructing prediction model thereof
CN117708569B (en) Identification method, device, terminal and storage medium for pathogenic microorganism information
CN112599190B (en) Method for identifying deafness-related genes based on mixed classifier
CN115083522B (en) Method and device for predicting cell types and server
Leong Modeling Sequencing Artifacts for Next Generation Sequencing
CN117437976B (en) Disease risk screening method and system based on gene detection
CN113611355B (en) Method for identifying antioxidant protein based on amino acid composition and protein interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant