CN117976042A - Method for determining read mass fraction, sequencing method and sequencing device - Google Patents

Method for determining read mass fraction, sequencing method and sequencing device Download PDF

Info

Publication number
CN117976042A
CN117976042A CN202311865546.9A CN202311865546A CN117976042A CN 117976042 A CN117976042 A CN 117976042A CN 202311865546 A CN202311865546 A CN 202311865546A CN 117976042 A CN117976042 A CN 117976042A
Authority
CN
China
Prior art keywords
read
machine learning
training
learning model
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311865546.9A
Other languages
Chinese (zh)
Inventor
张艳华
陈巍月
万新春
金欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genemind Biosciences Co Ltd
Original Assignee
Genemind Biosciences Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genemind Biosciences Co Ltd filed Critical Genemind Biosciences Co Ltd
Priority to CN202311865546.9A priority Critical patent/CN117976042A/en
Publication of CN117976042A publication Critical patent/CN117976042A/en
Pending legal-status Critical Current

Links

Abstract

The application provides a method for determining a read mass fraction, a sequencing method and a sequencing device. The method for determining the read mass fraction comprises the following steps: acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and inputting the value of the feature into a trained machine learning model to obtain the quality score of the read, wherein the trained machine learning model is a quantization scheme which is related to the feature of the read and the probability that the read is classified as a read of a specified category, the machine learning model is a decision tree, and the quality score of the read is positively correlated with the probability that the read is classified as the read of the specified category. By adopting the method for determining the quality fraction of the reading segment, the quality fraction of the reading segment can be effectively determined, and the quality of the sequence can be estimated. The quality fraction is utilized to screen the sequencing data, so that the influence of a high error rate sequence and an impurity sequence on the subsequent target detection based on analysis of the sequencing data can be reduced.

Description

Method for determining read mass fraction, sequencing method and sequencing device
Technical Field
The application belongs to the field of data processing, in particular relates to the field of gene sequencing data processing, and more particularly relates to a method for determining a read mass fraction, a sequencing method, a computing device and a computer readable storage medium.
Background
In the related art, a machine learning model is a calculation model or algorithm, which can automatically perform tasks such as prediction, classification, recognition, or decision making by learning and analyzing input data. Generally, the learning process of the model is based on statistical principles and data pattern recognition, using training data sets for parameter adjustment and model optimization to improve its predictive or inferential capabilities. The machine learning model may employ various algorithms and techniques, such as neural networks, support vector machines, decision trees, random forests, deep learning, and the like. These models may be trained and optimized by means of supervised learning, unsupervised learning, or reinforcement learning. In practical applications, machine learning models may be used in various fields such as natural language processing, image recognition, pattern recognition, data mining, recommendation systems, predictive analysis, and the like. It has important application potential in processing large-scale data, automation decision and intelligent system.
In the sequencing related technology, the second generation gene sequencing, also called high throughput sequencing or parallel large-scale sequencing, is widely used for evaluating the quality of a sequencing sequence by Q20, Q30 and even Q40, and the quality fraction can reflect the error probability and the reliability degree of the sequencing sequence and can be used for distinguishing the data fluctuation caused by the sequencing error or the data fluctuation caused by a biological sample (Mbandi,S.K.,et al(2014),A glance at quality score:implication for de novo transcriptome reconstruction of Illumina reads,Frontiers in genetics,5,17.;Garber,M.,et al(2011),Computational methods for transcriptome annotation and quantification using RNA-seq,Nature methods,8(6),469-477.;Kelley,D.R.,et al(2010),Quake:quality-aware detection and correction of sequencing errors,Genome biology,11(11),R116.;US8965076B2;US2018051329A1).
Various sequencing principles and various sequencing platforms often have adaptive or applicable sequencing sequence quality quantitative evaluation schemes, and a more reliable and/or more universal sequencing sequence quality evaluation method is still a concern, particularly a quality evaluation method which is especially suitable for sequencing data of a non-current mainstream sequencing platform, such as quantitative evaluation of sequencing sequence quality which can be suitable for nanopore sequencing, single-molecule sequencing-by-synthesis, or quantitative evaluation schemes of sequencing sequence quality which can be simultaneously suitable for multiple platforms such as second-generation sequencing, third-generation sequencing, fourth-generation sequencing, single-molecule sequencing-by-synthesis, and the like, is a concern.
Disclosure of Invention
The present invention aims to solve, at least to some extent, one of the above technical problems or at least to provide a useful commercial choice. To this end, it is an object of the present invention to provide a means to effectively determine the mass fraction of sequencing reads produced by gene sequencing.
The present application was made based on the following findings and experimental tests by the inventors:
In the second generation gene sequencing technology, amplification is generally involved to amplify a target signal, and parameters such as Q30, Q20 are generally used to evaluate the quality of a sequencing sequence to reflect the error probability and the reliability of the sequencing sequence, so as to help distinguish between data fluctuation caused by detection errors and data fluctuation (Mbandi,S.K.,et al.(2014).A glance at quality score:implication for de novo transcriptome reconstruction of Illumina reads.Frontiers in genetics,5,17.;Garber,M.,et al.(2011).Computational methods for transcriptome annotation and quantification using RNA-seq.Nature methods,8(6),469-477.;Kelley,D.R.,et al.(2010).Quake:quality-aware detection and correction of sequencing errors.Genome biology,11(11),R116.;US8,965,076;US2018051329A1). caused by biological samples, however, when this quality evaluation scheme and quality score calculation method are applied to quantitatively evaluate the quality of sequencing data generated by single molecule sequencing platforms of companies such as pacific organisms, oxford nanopores, real michaelis organisms and the like, the quality of the sequencing sequence of these sequencing platforms is not suitable or cannot be truly reflected, and as known by the inventor, no published evaluation method is known at present for sequencing data of single molecule sequencing.
Single molecule detection, such as single molecule sequencing, generally does not involve cloning such as the process or handling of amplification to amplify a signal of interest. The target signal detected by single molecule is easy to be interfered by noise and impurity signal, the condition that the target signal is lost or difficult to be detected or identified easily occurs, and the low signal-to-noise ratio is the difficulty of single molecule detection. Also, as a result, sequencing data generated by single molecule sequencing often contains impurities or interfering or noisy sequences that have a high error rate and even are not of sample origin. These sequences of higher error rates and/or non-sample origin can take up significant data processing or analysis resources, for example increasing the time required for sequence alignment, and even with better or better adapted reference sequences, it is often difficult to distinguish between data fluctuations due to sequencing errors and data fluctuations due to biological/test sample characteristics, depending on the alignment. For example, some of these sequences may be randomly aligned to a reference sequence, such as a reference ginseng genome, and in the case of complex samples, involving sequencing from different species sources, such as detecting pathogens in a human sample, such random alignment to the reference ginseng genome or to the sequencing sequence of a pathogen reference genome may adversely affect or interfere with the resolution and subsequent analysis of the sequencing sequence of the mixed sample. Therefore, it is necessary to design an adaptive or more general and practical mass fraction calculation model based on the characteristics of the sequencing data.
In view of this, embodiments of the present application provide a method of predicting the probability that a read will be classified as a specified class of reads, comprising: acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and inputting the value of the feature into a trained machine learning model so as to predict the probability that the read is classified into the specified category read, wherein the trained machine learning model is a quantization scheme which is associated with the feature of the read and the probability that the read is classified into the specified category read, and the machine learning model is a decision tree.
The embodiment of the present application further provides an apparatus for predicting the probability that a read will be classified as a read of a specified class, which is configured to implement part or all of the steps of the method for predicting the probability that a read will be classified as a read of a specified class according to the embodiment of the present application, where the method includes: the characteristic value acquisition unit is used for acquiring the value of the characteristic of the reading section, wherein the reading section is obtained by sequencing while synthesizing; and the generation unit is used for inputting the values of the features from the feature value acquisition unit into a trained machine learning model so as to predict the probability that the read is classified into the read of the specified category, wherein the trained machine learning model is a quantization scheme associated with the probability that the features and the corresponding read are classified into the read of the specified category, and the machine learning model is a decision tree.
The embodiment of the application provides a method for determining the mass fraction of a read, which comprises the following steps: acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and inputting the value of the characteristic into a trained machine learning model to obtain the quality score of the reading, wherein the trained machine learning model is a quantization scheme which is related to the characteristic of the reading and the probability that the reading is classified into a specified category reading, the machine learning model is a decision tree, and the quality score of the reading is positively correlated with the probability that the reading is classified into the specified category reading.
The embodiment of the application also provides a device for determining the read quality score, which is used for implementing part or all of the steps of the method for determining the read quality score in the embodiment of the application, and comprises the following steps: the characteristic value acquisition unit is used for acquiring the value of the characteristic of the reading section, wherein the reading section is obtained by sequencing while synthesizing; the generation unit is connected with the characteristic value acquisition unit and is used for inputting the value of the characteristic into a trained machine learning model so as to obtain the quality score of the reading, the trained machine learning model is a quantization scheme which is related with the probability that the characteristic and the reading are classified into the reading of the specified category, the machine learning model is a decision tree, and the quality score of the reading is positively correlated with the probability that the reading is classified into the reading of the specified category.
Embodiments of the present application also provide a computing device comprising a memory for storing data, including a computer-executable program; a processor for executing the computer-executable program, the execution of the computer-executable program comprising performing the method of predicting the probability that a read will be classified as a read of a specified class or the method of determining the read quality score in any of the implementations or embodiments described above.
An embodiment of the present application also provides a computer-readable storage medium storing a program for execution by a computer, the execution of which includes a method of completing the probability of a predicted read being classified as a specified category read in any of the above embodiments or examples, or a method of determining a read quality score. The computer readable storage medium may include read only memory, random access memory, magnetic or optical disks, and the like.
Embodiments of the present application also provide a method of training a machine learning model, comprising: acquiring values of M initial features of a training reading segment, wherein the training reading segment is obtained through sequencing while synthesis, the training reading segment has a known classification result, the training reading segment comprises a plurality of reading segments belonging to a specified category of reading segments and a plurality of reading segments not belonging to the specified category of reading segments, and M is an integer greater than or equal to 5; and inputting the value of the initial feature into a machine learning model, training the machine learning model by taking the known classification result as a mark to obtain a trained machine learning model for determining the quality score of the reading, wherein the machine learning model is a decision tree, and the quality score of the reading determined by the trained machine learning model is positively correlated with the probability that the reading is classified into a specified category reading.
The embodiment of the application also provides a device for training a machine learning model, which is used for implementing part or all of the steps of the method for training the machine learning model in the embodiment of the application, and comprises the following steps: the training reading section comprises a plurality of reading sections belonging to a specified category and a plurality of reading sections not belonging to the specified category, wherein M is an integer greater than or equal to 5; and a training unit for inputting the value of the initial feature from the initial feature value acquisition unit into a machine learning model, training the machine learning model by taking the known classification result as a mark to obtain a trained machine learning model for determining the quality score of the reading, wherein the machine learning model is a decision tree, and the reading quality score determined by the trained machine learning model is positively correlated with the probability that the reading is classified into the reading of the specified category.
An embodiment of the present application also provides a computing device, including a memory for storing data, including storing a trained machine learning model obtained by the method of training a machine learning model of the above embodiment; and a processor for executing a computer executable program to determine a probability of a read being classified as a specified category of read or a quality score of the read using the trained machine learning model.
The embodiment of the present application also provides a computer-readable storage medium for storing a trained machine learning model obtained by the method of training a machine learning model of the above embodiment.
The embodiment of the application also provides a sequencing method, which comprises the steps of sequencing a nucleic acid molecule to be detected while synthesizing, and determining the mass fraction of a reading segment generated by sequencing by adopting the method for determining the mass fraction of the reading segment in the embodiment or a trained machine learning model obtained by the method for training the machine learning model in the embodiment; and performing a sequence analysis on reads in which the mass fraction is above a predetermined threshold.
Finally, the embodiment of the application also provides a sequencing system, which is used for implementing the sequencing method of the embodiment. In some embodiments, the sequencing system comprises a computing device of any of the embodiments described above or a computer-readable storage medium of any of the embodiments.
Based on the sequencing data set of the same known classification result, the inventor tests a plurality of models or algorithms, and finds that the trained decision tree model has better distinguishing effect on the specified class reading and the non-specified class reading in the sequencing data set. The method, the device and the system in any embodiment or example of the foregoing embodiment input the values of a plurality of specific features of the read into the trained decision tree, so as to quickly obtain the quality score output of the read and obtain effective and reliable quantization quality information of the read. The quality score reflects the probability that the read is classified into the read of the specified category, can effectively and reliably reflect the quality of the read, and is favorable for distinguishing the quality degradation of the read caused by systematic errors and sequencing errors and the relatively low quality of the read caused by specific characteristics of a sample to be tested. Based on the quantized quality score carried by the read, whether the read is reserved or the read should be made to enter a subsequent processing analysis flow can be directly determined without carrying out additional processing analysis, such as comparison, on the read to obtain a related processing analysis result, for example, the read with the quality score lower than a preset threshold is filtered, so that the data processing efficiency and the detection efficiency can be improved, and meanwhile, the accuracy of the detection result can also be improved.
Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of embodiments of the application.
Drawings
The foregoing and/or additional aspects and advantages of embodiments of the application will become apparent and may be readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1A is a flow chart of a method of predicting the probability that a read will be classified as a specified class of reads in accordance with an embodiment of the present application;
FIG. 1B is a flow chart of a method of determining a read quality score according to an embodiment of the present application;
FIG. 2 is a schematic representation of a model ROC curve of an embodiment of the application;
FIG. 3 is a schematic diagram of feature importance ranking obtained by an embodiment of the present application;
FIG. 4 is a schematic diagram of the comparison success rate and the sequence filtering and loss ratio of the test set data under different quality score filtering thresholds according to the embodiment of the present application;
FIG. 5 is a schematic diagram of the comparison success rate and the sequence filtering and loss ratio corresponding to the mapped mass fraction below different filtering thresholds according to the embodiment of the present application;
FIG. 6 is a schematic diagram of the distribution of quality score values of a test set of an embodiment of the present application for an actual alignment success sequence and an actual alignment failure sequence;
FIG. 7 is a schematic diagram of the correlation between the quality scores of the test set and the average error rate of the unique alignment sequences therein according to an embodiment of the present application;
FIG. 8A is a schematic diagram of an apparatus for predicting the probability that a read will be classified as a specified class of reads in accordance with an embodiment of the application;
FIG. 8B is a schematic diagram of an apparatus for reading mass fraction according to an embodiment of the application;
FIG. 9 is a flow chart diagram of a method of training a machine learning model according to an embodiment of the present application;
FIG. 10 is a schematic flow chart of a sequencing method according to an embodiment of the present application;
FIG. 11 is a diagram showing the filtering effect of the NIPT sample data of 32 cases according to the present application;
FIG. 12 is a diagram showing the distribution of the quality score values of the actual alignment success sequence and the actual alignment unsuccessful sequence of 32 NIPT samples according to the present application;
FIG. 13 is a graph showing correlation results between mass fractions of 32 NIPT samples and average error rates of corresponding unique alignment sequences according to an embodiment of the present application;
FIG. 14 is a graph showing the effect of the data filtration of the sample of E.coli of example 16 according to the present application;
FIG. 15 is a diagram showing the distribution results of mass fraction values of the actual alignment success sequence and the actual alignment unsuccessful sequence of 16 E.coli samples according to the embodiment of the present application;
FIG. 16 is a graph showing the correlation between the mass fraction of the sample of E.coli and the average error rate of the corresponding unique alignment sequences in accordance with the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.
As used herein, the singular forms "a", "an", and "the" include plural referents unless the context clearly dictates otherwise. "set" or "plurality" refers to two or more.
In this document, "comprising" or "including" is an open-ended expression, including what is indicated or exemplified hereafter, and also includes what is applicable or consistent with the stated situation, but not specifically recited.
As used herein, "sequencing" refers to nucleic acid sequencing, and as "nucleic acid sequencing" or "gene sequencing" refers to determining the order of the primary structural bases of a nucleic acid molecule, and can be accomplished using sequencing by synthesis (sequencing by synthesis Sequencing by Synthesis, SBS), sequencing by ligation (sequencing by ligation, SBL), sequencing By Hybridization (SBH), or the like. As used herein, sequencing-by-synthesis, unless otherwise indicated, includes, in addition to the commonly understood SBS (typically ILLUMINA/Solexa technology) which uses a polymerase to catalyze the incorporation of nucleotides into a nucleic acid molecule to be tested (synthesis reaction) and the detection of corresponding reaction signals to identify the type of nucleotide incorporated, sequencing which is similar to SBS, uses a polymerase or non-polymerase to controllably introduce or ligate nucleotides to a nucleic acid molecule to be tested, and directly or indirectly detects corresponding signals to determine the type of nucleotide ligated.
Sequencing may include DNA sequencing and/or RNA sequencing. Including long fragment sequencing and/or short fragment sequencing, where the long and short fragments are relative, such as nucleic acid molecules longer than 1Kb, 2Kb, 5Kb, or 10Kb may be referred to as long fragments, and shorter than 1Kb or 800bp may be referred to as short fragments; may include double-ended sequencing, single-ended sequencing, paired-ended sequencing, and the like, where double-ended sequencing or paired-ended sequencing may refer to the readout of any two segments or portions of the same nucleic acid molecule that do not overlap completely.
Sequencing may be performed by sequencing platforms, according to embodiments of the present application, alternative sequencing platforms include, but are not limited to, the Hiseq, miseq, nextseq and Novaseq sequencing platforms from Illumina, the Ion Torrent platform from Thermo Fisher/Life Technologies, the BGISEQ and MGISEQ/DNBSEQ platforms from hua major genes, and single molecule sequencing platforms; the sequencing mode can be single-ended sequencing, double-ended sequencing or the sequencing mode supported by an automatic sequencing platform.
Sequencing the sequence read out is called sequencing sequence, also called reads (reads), and the length of the sequencing sequence or read is also called read length.
In certain examples, sequencing by synthesis is used with multiple rounds of sequencing to obtain a sequencing sequence or read. For example, a test nucleic acid molecule is contacted with a polymerase, the modified nucleotide and placed under conditions suitable for the polymerization reaction, the modified nucleotide is controllably incorporated into the test nucleic acid molecule or controllably allowed to undergo single base extension, and a corresponding reaction signal is detected, based on which signal the type of nucleotide incorporated into the test nucleic acid molecule by the reaction is determined, and the detection of multiple controllable single base extensions and corresponding signals is performed such that the type of nucleotide or base incorporated into the test nucleic acid molecule in multiple or rounds of reactions is detected from the reaction signal information to determine a portion of the sequence of the test nucleic acid molecule.
The nucleic acid molecules to be detected, which are also called as templates or templates to be detected, can be single molecules which are not amplified, or can be amplified molecular clusters or long chains containing a plurality of identical polynucleotide molecules, such as gram Long Cu cluster or DNA nanosphere DNB formed by bridge amplification or rolling circle amplification adopted by a mainstream sequencing platform. The nucleic acid molecule to be tested may be in the form of a single strand, double strand and/or complex hybridized with a probe or primer.
The corresponding reaction signal may be, for example, a fluorescent signal or may be converted into image data obtained by collecting the fluorescent signals, whereby the image data is processed and analyzed to detect the nucleotides incorporated into the nucleic acid molecule to be detected in each reaction or round, so as to determine a part of the base sequence of the nucleic acid molecule to be detected.
Specifically, in some examples, sequencing is accomplished based on surface fluorescence imaging detection, the nucleic acid molecule to be tested is attached to the solid phase surface, for example, nucleotides can be modified to carry or bind fluorescent labels, and to carry cleavable inhibitor groups (such modified nucleotides are also referred to as reversible terminators reversible terminator) that prevent other nucleotides from being attached to the next position of the nucleic acid molecule to be tested in a polymerization reaction, and after each polymerization reaction or single base extension reaction is completed, the fluorescent labels are excited to emit light, and the emitted light signals are collected to obtain an image of the nucleic acid molecule to be tested where the single base extension reaction occurs at the designated surface position; then, the inhibition group, fluorescent label, etc. are removed to perform the next polymerization reaction or the next round of signal collection (photographing), and thus polymerization-photographing-excision is repeated a plurality of times or a plurality of times to obtain the information of the image set related to the nucleotide attached to the nucleic acid molecule to be measured in association with each single base extension reaction.
It will be appreciated that the nucleic acid molecules to be detected at a given location on the surface undergo a polymerization reaction to fluoresce, typically as a bright spot or spot (spots) above the background signal intensity at the corresponding location on the round of the captured image. Therefore, according to the information of the bright spots corresponding to the specific chemical characteristics (the nucleic acid molecules to be detected undergoing the polymerization reaction) in the image sets, whether the nucleic acid molecules to be detected undergo the polymerization reaction at the designated positions can be judged, and the types of the nucleotides connected to the nucleic acid molecules to be detected through the polymerization reaction can be detected by combining the preset distinguishable fluorescent luminous signals and the corresponding relations of the nucleotide types, so that at least a part of the sequences of the nucleic acid molecules to be detected are determined, and the reading section is obtained.
The term "nucleotide" as used herein includes ribonucleic acid or deoxyribonucleic acid, and includes natural nucleotides or derivatives thereof or modifications thereof (also referred to as modified nucleotides or modified nucleotides). Such nucleotides are sometimes referred to herein as bases comprised by the nucleotides, as will be apparent to those of ordinary skill in the art from conventional wisdom and/or context.
The so-called round of sequencing may include a single base extension reaction (a repeat). For example, four different nucleotides (dATP, dTTP, dGTP and dCTP) may be placed in the same polymerization reaction system as a plurality of test nucleic acid molecules, such that each nucleotide may be excited to a signal distinguishable from the other types of nucleotides, whereby the information obtained by one repetition of the reaction determines the type of nucleotide incorporated or introduced at one position of the plurality of test molecules. For example, four nucleotides are respectively provided with fluorescent markers of four different light emitting wavebands, and four-color or four-channel single-molecule sequencing or high-throughput sequencing is performed.
The so-called one-round sequencing may also include multiple repeats. For example, four nucleotides can be contacted with a plurality of nucleic acid molecules to be detected on the surface in sequence, and respectively base extension and corresponding collection of reaction signals such as photographing are carried out, and one round of sequencing comprises four base extensions; for another example, any combination of four nucleotides is contacted with a plurality of nucleic acid molecules to be tested on the surface, e.g., two-by-two or one-by-three, the two combinations being respectively base extended and the corresponding reaction signals being collected, e.g., photographed, and a round of sequencing comprising two base extensions. In some embodiments, a repeat is also sometimes referred to as a round, as will be appreciated by those skilled in the art based on the context.
Referring to FIG. 1A, a method for predicting a probability that a read will be classified as a specified category read is provided, according to an embodiment of the present application, comprising: s10, acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and S20, inputting the value of the characteristic into a trained machine learning model so as to predict the probability that the read is classified into the specified category read, wherein the trained machine learning model is a quantization scheme which is associated with the characteristic of the read and the probability that the read is classified into the specified category read, and the machine learning model is a decision tree.
Referring to fig. 1B, a method for determining a read quality score is provided according to an embodiment of the present application, including: s10, acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and S30, inputting the value of the characteristic into a trained machine learning model to obtain the quality score of the read, wherein the trained machine learning model is a quantization scheme which is related with the characteristic of the read and the probability that the read is classified into the specified category read, the machine learning model is a decision tree, and the quality score of the read is positively correlated with the probability that the read is classified into the specified category read.
By using the method in any of the embodiments or examples, the values of the specific features of the read are input into the trained decision tree, so that the probability that the read is classified into the read of the specified category can be quickly determined and/or the quality score of the read can be obtained, and the classification result of the read can be reliably predicted or reliable quantized quality information can be obtained. The read quality score reflects the probability that the read is classified into the read of the specified category, can effectively and reliably reflect the sequencing quality of the read, and is beneficial to distinguishing the quality degradation of the read caused by systematic errors and sequencing errors and the relatively low quality level of the read caused by specific characteristics of a sample to be tested. Based on the quantized quality score of the read, it can be directly determined whether to retain the read or to enable the read to enter a subsequent processing analysis flow without performing additional processing analysis such as comparison on the read to obtain a related processing analysis result, for example, filtering out the read with the quality score lower than a predetermined threshold, which can improve data processing efficiency and detection efficiency, and also improve accuracy of the detection result.
For example, obtaining a mixed sequencing sequence of a plurality of samples in one sequencing run, filtering reads with mass fractions below a predetermined threshold facilitates accurate splitting of sequencing data into corresponding samples, and accurate detection of corresponding samples based on the split sequencing data. For example, in the case of a sample to be tested which is complex and involves sequencing of different species sources, such as detecting pathogens in a human sample based on sequencing data, removing reads with high probability of being compared with a reference sequence of a ginseng and/or reads with low probability of being compared with a reference sequence of a target pathogen based on reads with quantized quality scores can greatly improve the efficiency of subsequent sequence analysis and the accuracy of pathogen detection results, and effectively reduce adverse effects of reads of human or low quality pathogens sources on detection results.
In some examples, the characteristic of a so-called read is a plurality of quantifiable characteristics generated during a sequencing run that relate to whether the read belongs to a specified class of reads. The features are quantifiable features trained by the machine learning model that fit or correlate to the probability of reads belonging to a given class of reads.
In some examples, there are typically two conclusions, yes or no, as to whether a given one of the reads belongs to a specified category of reads. Thus, it will be appreciated that the trained machine learning model on this side is essentially a classifier that determines the magnitude of the probability that a read will be classified as a read of a given class based on the value of that particular feature of the read. It will be appreciated that the so-called trained machine learning model is a quantization scheme that associates the values of the features of a read with the probability that the read is classified as a read of a specified class, and that the trained machine learning model is a quantization scheme that associates the values of the features of a read with the probability that the read is classified as a read of a non-specified class, both of which are technically equivalent.
In some examples, a specified category read refers to a read that can or cannot be aligned to a specified reference sequence. In other examples, a read of a specified class is a read that is capable of aligning to a specified reference sequence, and a read of a non-specified class is a read that is not capable of aligning to the specified reference sequence; in the same application scenario, for sequencing data obtained from the same sample or different samples, sequencing data obtained from the same or different sequencing platforms, what is referred to as a alignable and non-alignable, relative alignment results, typically determined based on the same reference sequence and following the same judgment rules.
The decision tree is a choice made by the inventor by training multiple models or algorithms by utilizing multiple public sequencing data sets with known classification results and further comparing the distinguishing effect of the trained models on the comparison and non-comparison reads in the data sets, and the inventor finds that the prediction accuracy of the trained decision tree model on the comparison reads and the non-comparison reads in the multiple public sequencing data sets is better.
The decision tree is a tree established by depending on policy decision, and entropy is reduced (uncertainty is reduced) along the direction of the tree; the training process of decision trees is often to divide all samples into their corresponding leaves by simple if/else. The various decision tree models may be partitioned according to maximum information gain, maximum information gain rate, and coefficient of kunning, with the ID3 decision tree, the C4.5 decision tree, and the CART (Classification and regression Tree) decision tree being typical representatives of the three partition criteria, respectively. The integrated learning tree model can be divided into two main categories, one category is based on that no strong dependency relationship exists between learners, the two categories can be generated simultaneously or in parallel, the main representatives are Bagging and Random Forest (Random Forest), the other category is based on that the strong dependency relationship exists between learners, the two categories need to be generated in series, and Boosting is the representative. As an exception, the decision tree referred to by embodiments of the present application is a broad concept, including simple and complex decision trees, including integrated learning and non-integrated learning tree models, including machine learners that cover or are based on decision trees.
The term alignment refers to sequence alignment, including the process of locating one or more sequences to another sequence or sequences and the positioning results obtained. For example, including the process of locating reads onto a reference sequence, and also including the process of obtaining read locating/matching results. The alignment may be local alignment or global alignment, and may be fault tolerant alignment or fault tolerant alignment, for example, alignment when the mismatch ratio ε is considered to be 0 (fault tolerant matching) or not more than 0.05 (the number of 100bp mismatched bases is not more than 5).
The reference sequence (ref) and the reference chromosome sequence are defined sequences, and may be DNA and/or RNA sequences which are assembled by self-predetermined measurement, or may be DNA and/or RNA sequences which are disclosed by other people measurement, and may be any reference template in a biological class of a sample source individual/target individual obtained in advance, for example, all or at least part of the disclosed genomic assembly sequences in the same biological class. If the sample source or target individual is a human, the genomic reference sequence (also referred to as a reference genome or reference chromosome) may be selected from human reference genomes provided by UCSC, NCBI or ENSEMBL databases, such as HG19, HG38, GRCh36, GRCh37, GRCh38, etc., and the corresponding relationship of each reference genome version may be known to those skilled in the art through the description of the databases, and the version used may be selected. Furthermore, a resource library containing more reference sequences can be pre-configured, for example, before comparison, sequences which are closer to or have a certain characteristic can be selected or determined and assembled according to factors such as the sex, the race, the region and the like of the target individual to serve as the reference sequences, so that more accurate sequence analysis results can be obtained later.
The reference sequence can be constructed when the target sample is detected, or can be pre-constructed and stored and called when the prepared sample is detected. In certain embodiments, the test sample is from a human, and the reference sequence is a human reference genome or a human autosomal group.
It will be appreciated that the two classification of a given read involves both reads belonging to a given category and reads not belonging to a given category, which generally do not overlap or overlap under established rules or criteria, and that a given read either belongs to or is classified as belonging to or not belonging to a given category or is classified as not belonging to a given category. For example, a read may be able to align either the so-called reference sequence or not after the reference sequence is determined, the so-called alignment to or alignment is determined to have a ratio of fault tolerant matched mismatched bases of no more than 0.02, and the alignment is considered successful.
In other examples, reads belonging to or classified as reads of a specified class are sequences having specific information, e.g., reads containing a specific sequence, reads containing a specific mutation site, etc.
In some examples, reads are obtained using a means or platform for performing SBS sequencing based on surface imaging detection, one read including read-out wheel data reflecting the presence of a base extension signal on the template corresponding to a specified location of the image set and wheel-blank data reflecting the absence of a base extension signal on the template corresponding to a specified location of the image set (the nucleic acid molecule to be tested).
The presence or absence may be the presence or absence in a physical sense, or the occurrence or absence of base extension, or may refer to the ability or inability or difficulty of recognition, detection, or detection of a base extension signal by means conventional in the art; the read-round data and the empty-round data are referred to herein as the latter, and are relative detection of a specific base and non-detection of a specific base, unless otherwise specified.
In some examples, a round of sequencing includes a process of base extension by placing a template in a solution containing one or more nucleotides and collecting corresponding images to identify base extension signals, the round of sequencing manifesting on a read as one or more data bits that increase in number corresponding to the number of nucleotide species in the solution in which the template is placed.
Specifically, one round of sequencing can be achieved by any one of the following (1) to (4): (1) Firstly, contacting the template with a solution containing two of four nucleotides to perform single base extension and collect corresponding images, and then contacting the template with a solution containing the other two of the four nucleotides to perform single base extension and collect corresponding images; for example, so-called single molecule bicolour sequencing (two bases with the same fluorescent label); (2) Contacting the template with a solution containing all four nucleotides for single base extension and collecting corresponding images; for example, four-color sequencing (each base extension corresponds to a unique fluorescence labeling luminescence band) or two-color sequencing (two bases with two fluorescence labels, a third base with the two fluorescence labels, and a fourth base without the fluorescence label) or three-color sequencing (three bases with three fluorescence labels, and a fourth base without the fluorescence label, respectively) in which extension signals of four bases are distinguishable; (3) Firstly, contacting the template with a solution containing three of four nucleotides to perform single base extension and collect corresponding images, and then contacting the template with a solution containing the rest of the four nucleotides to perform single base extension and collect corresponding images; for example, three of the three four nucleotides/bases may bear a distinguishable luminescent label, and the remaining base may bear the same luminescent label as any of the preceding three bases; (4) Sequentially contacting the template with a solution containing one of four nucleotides to perform single base extension and acquiring corresponding images; as in the case of single-molecule or high-throughput single-colour sequencing (all four bases carry the same fluorescent label).
Typically, a given read typically consists of four bases A, T/U, C and G, sometimes also containing indeterminate bases N or gap (N or gap being one of A, T/U, C and G). The total number of bases contained in the read is the length of the read (the read length of the read).
From the common representation of reads as a sequence of bases, it will be appreciated that a common given read only embodies what is known as read wheel data, generally does not contain what is known as space wheel data, or, generally, does not specifically embody space wheel data.
In some examples, to reflect the process of determining the base of a position of a nucleic acid molecule to be tested at a given position on a surface using sequencing-by-synthesis, including the process of contacting it with four bases or combinations of bases sequentially or simultaneously to determine the base type of that position, the reads are presented in a form consisting of read-round data and space-round data.
For example, a read may be represented as a sequence comprising a plurality of consecutive repeat units, each repeat unit corresponding to a position to be measured (base) on a nucleic acid molecule to be measured, each repeat unit comprising four data bits, each of the four data bits corresponding to four bases arranged in a predetermined order, such as ACTG; if a particular base detection exists on a data bit, the data bit is displayed as corresponding particular base detection A, T, C or G, the data bit is called read wheel data, and if the data bit does not exist corresponding base detection, the data bit is called wheel blank data.
It will be appreciated that for any one read in conventional form containing only read wheel data that is read by sequencing-by-synthesis multiple round reaction sequencing-by-synthesis, the read can be converted to a read containing both read wheel data and space wheel data. For example, for a given one of reads AATGCGCCGT, assuming four-color or four-channel fluorescence sequencing (one base extension corresponds to a unique fluorescence emission band; the nucleic acid molecule under test may be subjected to a simultaneous polymerization or extension reaction in a same solution system with four nucleotides to effect recognition of the base type at one position on the nucleic acid molecule under test), the four bases ACTG (one repeat unit) arranged in the given order above may be converted into three data bits represented by a ___ (three data bits represented by left underline in turn representing undetected C, T and G) a _____ (five data bits represented by left underline in turn representing undetected C, T, G, A and C) T ____ (four data bits represented by left underline in turn representing undetected G, A, C and T) g_ (one data bit represented by left underline represents undetected a) C _____ (five data bits represented by left underline in turn representing undetected a) and five data bits represented by left underline in a repeated unit), the three data bits represented by left underline representing unclean a T, G, A, C (three data bits represented by left underline in turn representing unclipped a) and G35 (three data bits represented by left underline in turn representing unclipped a) and C) T ____ (four data bits represented by left underline in turn representing unclipped 3 and T) G35 (one data bits represented by left underline in turn representing unclipped C) and C) C35 (one data bits represented by left underline in turn representing unclipped 3 and C) and C35 (one data bits represented by left underline) and C) and one data bits represented by three bits represented by 3 (one bits represented by 3 are represented by one bits in turn, one nucleotide and one nucleotide is one nucleic acid molecule under test molecule. The representation of the read is converted to include read wheel data and space wheel data.
For another example, for a given read of base sequence still AATGCGCCGT, the read is obtained by two-color fluorescent single-molecule sequencing (two base extensions correspond to one and the same fluorescence emission band; A nucleic acid molecule to be tested can be subjected to a polymerization or extension reaction by contacting two nucleotides having different luminescence bands in the same solution system at the same time, the identification of the base type at one position on the nucleic acid molecule to be tested can be achieved by contacting four nucleotides with the nucleic acid molecule to be tested by a polymerization or extension reaction at two times, for example, A and C respectively have F1 and F2 fluorescent labels having distinguishable luminescence bands, T and G respectively have F1 and F2 fluorescent labels, one cycle of sequencing or detecting the base type at one position of an arbitrary template can cause the template to be subjected to a base extension reaction by contacting the solution containing A and C first and then to a base extension reaction by contacting the solution containing T and G again, the base type at one position of the template can be regarded as comprising a double extension reaction, four bases ACTG (one repeating unit) arranged in the order given above can be represented by the read conversion of A ___ (three data bits represented by underlining on the left side are represented by 64 and G) A_ (one bit represented by underlining on the left side) and one bit of 3234 represented by the left bit represented by the four pieces of data represented by underlining on the left side ____ (represented by the left bit represented by the following one bit of ____ in turn) C and T) g_ (one data bit indicated by the left underline indicates undetected a) c_ (one data bit indicated by the left underline indicates undetected T) g_ (one data bit indicated by the left underline indicates undetected a) C ___ (three data bits indicated by the left underline indicate undetected T, G and a) c_ (one data bit indicated by the left underline indicates undetected T) G __ (two data bits indicated by the left underline indicate undetected a and C) t_ (one data bit indicated by the left underline indicates undetected G) in turn), thus converting the representation of the read segment to include a read wheel and a wheel blank wheel without changing the physical or chemical meaning or characteristics to which the read segment corresponds.
Unless otherwise indicated, the read length of a read segment, which is presented herein in the form of a read wheel and a wheel blank, is generally calculated from only the read wheel data therein, i.e., the read wheel length of a read segment is the length of the read segment. In addition, for the sequences exemplified herein, unless otherwise specified, a letter such as A, T, C or G, or a space represented by an english underline (half angle), where the letter indicates a read wheel, and the underline "_indicates a space wheel, occupies one data bit. One bit of data, indicated by an underline, is sometimes referred to as an underline length 1.
In particular, WO2017028760A1, which is incorporated herein by reference in its entirety, describes a single molecule bicolour sequencing method. In the single-molecule bicolour sequencing process, as disclosed in this publication, a nucleic acid molecule to be detected (target nucleic acid or template), such as a DNA strand, is attached to the surface of a solid substrate, and a round of sequencing adds two nucleotides (reversible terminator or virtual terminator) mixed together to carry out an extension reaction. Specifically, for example, a G base carries a red fluorescent molecular marker, a C base carries a green fluorescent molecular marker, both nucleotides are contacted with the template in the same round of sequencing, an A base carries a red fluorescent molecular marker and a T base carries a green fluorescent molecular marker and contacted with the template in another round of sequencing; the bases thus read out in each round of reaction can be recorded in the order of CGTA repeating units. It will be appreciated that for single molecule bi-colour sequencing, each round or extension reaction template may be contacted with two bases distinguishable by the luminescent signal, and thus it will be appreciated that two rounds or runs of extension reaction may detect a base at one position of either template on the surface. When the reaction information of each round is recorded, if a base that emits light is detected, the base type (for example, A, G, C, T) is written in the recorded sequence, and if a specific fluorescent signal/specific base is not detected, the recorded sequence is denoted as a round bar, and indicated by "_". For example, the signals detected by each round of reaction are recorded according to CGTA sequence, and a sequencing sequence (read) corresponding to a certain template is obtained and is "_GT __ G_A", which represents that when the mixed base is added to C, G in the first round, the template reacts with the G base (G is connected or doped into the template); when A, T mixed bases are added in the second round, the template reacts with T bases; when C, G mixed bases are added in the third round, the template reacts G bases; when A, T mixed bases are added on the fourth round, the template reacts with A bases.
The set of images referred to includes a plurality of images containing the designated location and its surrounding background signals generated by one or more rounds of sequencing during a sequencing run. The value of the characteristic of the read is determined according to at least one kind of information in the image set, the read wheel data and the wheel blank data.
In some examples, the value of the characteristic of the read is determined from at least two of the image set, the read wheel data, and the wheel space data. In a specific example, the data set according to the known classification result (the read set of the known classification result) includes that the generation process of the data set generates initial features greater than 50 and determines values of the initial features of each read in the data set according to any one or combination of three types of information, and the trained decision tree determines that the initial features serving as the features are greater than 10 based on the classification of the read and the importance of each initial feature to the training of the decision tree model.
The training or learning of classifier models is typically performed using a dataset of known classification results, and the dataset is typically further divided into a training Set (TRAINING SET) and a Test Set (Test Set), the training Set being used to train the model, the Test Set being used to evaluate the performance of the model, including classification prediction accuracy, etc., and the training of decision tree models is no exception to embodiments of the present application. Specifically, training the decision tree model with sequences in the training set includes performing feature reduction and model parameter tuning to minimize errors between the predicted results and the true labels, so that the model has more accurate prediction capabilities. The sequences in the test set are not used in the model training stage, in other words, the classification result model of the sequences in the test set is unknown, and after the (each) training is completed, the test set is used for performing performance assessment of the model, namely, the trained model is used for carrying out classification prediction on the sequences in the test set, and the sequences are compared with the real classification labels of the sequences, so that the generalization capability and the prediction accuracy of the model are judged based on the performance of the trained model on the data set of the unknown classification result. In some embodiments, a training set of known classification results is sometimes referred to as a training read or training sample, and a test set is sometimes referred to as a test read or test sample.
Specifically, in some examples, the trained decision tree model is determined by: generating M initial features based on a generation process of the data set including the data set, wherein the data set has a known classification result, the data set includes a plurality of reads belonging to a specified category of reads and a plurality of reads not belonging to the specified category of reads, the reads in the data set all have the initial features, and the initial features are all quantifiable features, and M is an integer greater than or equal to 5; training a machine learning model based on the dataset and the initial features, comprising:
the dataset is divided into training and test reads,
Initializing parameters of the machine learning model to obtain an initialized machine learning model, and
Inputting the training reads including initial features of the training reads into the initialized machine learning model to iteratively train the machine learning model, including inputting N initial features of the training reads into the machine learning model after the x-th training to train the machine learning model for an x+1th time, and determining whether to terminate the iterative training based on whether a correct rate and/or a predicted speed of a predicted result of classification of the test reads by the machine learning model after the x+1th training meets a preset requirement, including terminating the iterative training if the correct rate and/or the predicted speed of the predicted result of classification of the test reads meets the preset requirement, or else, conducting the x+2th training, wherein N is an integer less than or equal to M and greater than 0, and x is an integer greater than or equal to 1.
In some examples, preferably, M is greater than or equal to 30. In some examples, M is also less than 300 or less than 200.
The embodiment of the present application is not limited to the way in which the initial feature is developed or generated, and in principle, any quantifiable parameter that is possessed by each read in the data set may be used as the initial feature. The initial characteristic may be a characteristic that has or reflects a particular physical and/or chemical meaning of the read, or may be a characteristic that does not have or reflect any physical or chemical meaning of the read. For example, a computationally valuable feature that may be generated purely from a data set including one or more forms of numbers, symbols, colors and/or strings that are represented or presented during its generation, in other words, any quantifiable feature that may be generated from a data set including mathematical features and/or rules of existence that are formed or possessed by its generation, may be used as the initial feature. For example, a feature that may or may not characterize the characteristics of all or part of the reads in the dataset may be an initial feature, or a feature that may or may not characterize the characteristics of all or part of the intermediate data in the dataset generation process may be an initial feature. As another example, a set of features that may be generated from the data set and/or intermediate data of its generation, a fitting, a deformation such as a normalization or normalization process, a feature determined after any mathematical operation, or the like, may also be used as the initial feature. For another example, a read length, a read wheel data length or ratio, a wheel space data length or ratio, a continuous read wheel data length or ratio, a continuous wheel space data length or ratio, a read even data bit being a read wheel data length or ratio, a read odd position being a number or ratio of bases of a specific type, a read specific position being a number or ratio of bases of a specific type, an absolute or relative signal intensity of a specific position on the image corresponding to a chemical feature, a ratio of a signal intensity of a specific position on the image corresponding to a chemical feature to a signal intensity of a surrounding specific region, and the like can be used as the initial feature.
In some examples, the initial feature that is mined based on bi-color single molecule sequencing data includes intermediate data is greater than 50. Tables 1 and 2 show, respectively, a portion of the initial features developed from the sequencing data including intermediate data such as an image (set), FASTA file (. Fa file) or fastaq file (. Fastq or. Fq) (read-out and/or wheelspace data information containing base identification or detection information, or base sequence/reads); one round (1 cycle) of table 1 or table 2 is repeated one time, each round determining a repeating unit of four data bits presented in CGTA order.
TABLE 1
TABLE 2
In some examples, the training and test reads in the dataset each contain a specified category read and a non-specified category read, and the number of reads in the training read for either category is the same as the number of reads in the test read for that category. For example, the specified category reads and the non-specified category reads in the training reads are both 7:3, as are the proportions of the two types of reads in the test reads. Thus, the method is beneficial to testing the classification effect of the model in or after training.
In other examples, the ratio of the two types of reads in the training set and the test set may be different, e.g., the ratio of the comparison reads in the training set is any preferably greater than 5% and the ratio in the test set is any preferably greater than 5% and the ratio in the training set is any preferably greater than 5% and different than the ratio in the training set. Thus, the generalization capability of the test model is facilitated.
In some examples, the decision tree model selected is a gradient-lifting decision tree (Gradient Boosting Decision Tree, GBDT).
The gradient lifting decision tree GBDT is a widely used machine learning algorithm, is commonly used for solving the problems of classification and regression, and has the advantages of high efficiency, accuracy, interpretability and the like. GBDT is a member of the ensemble learning Boosting family, which is formed by a series connection of decision trees, each tree is secondarily modeled based on the output of the previous tree, and the whole series modeling process is equivalent to correcting the prediction result towards the target value (Zhou Zhihua, machine learning [ M ], university of Qinghai press, 2016.01); in each iteration GBDT learns the decision tree by fitting a negative gradient (also called residual), the main cost of GBDT is to learn the decision tree, while the most time-consuming part of learning the decision tree is to find the best segmentation point. Unless otherwise stated, "learning" as a verb is used interchangeably with "training" or "iterative training".
GBDT have many effective implementations such as XGBoost (extreme gradient-lifted tree), pGBRT, catBoost, scikit-learn, R gbm and LightGBM (lightweight high-efficiency gradient-lifted tree), etc. Preferably, at least one of LightGBM and XGBoost algorithms is employed. More preferably, the inventors training and testing a plurality of GBDT models including traditional GBDT, lightGBM and XGBoost by using a plurality of public data sets with known classification results by using LightGBM(Guolin Ke,et al.,LightGBM:A Highly Efficient Gradient Boosting Decision Tree;31st Conference on Neural Information Processing Systems(NIPS2017),Long Beach,CA,USA.)., and finding that the trained traditional GBDT, lightGBM and XGBoost have better effects in distinguishing aligned reads from non-aligned reads in the data sets; furthermore, lightGBM performs better in training efficiency, memory usage, and prediction accuracy, as compared to the prior art. Fig. 2 shows a ROC curve of LightGBM after training for classification prediction of the test set, showing that AUC value is 0.91 and accuracy is 0.83, indicating that the prediction effect of the trained model is better.
Specifically, the result of classification prediction on multiple public data sets shows that LightGBM is faster than the training process of the traditional GBDT by more than 20 times, and almost the same accuracy is achieved. Therefore, the lightGBM algorithm is adopted to carry out training learning so as to predict the probability of successful sequence comparison, and the probability value is converted into the quality score of the sequence, so that the quality of the sequence can be accurately estimated.
The machine learning model has adjustable model parameters (sometimes simply referred to as parameters). Adjusting these parameters typically changes the predictive and generalizing capabilities of the model. The process of adjusting parameters is commonly referred to as model tuning or hyper-parameter tuning. The parameters of the various decision tree models generally include maximum Depth (Max Depth), minimum number of samples (MIN SAMPLES SPLIT), feature selection criteria (Criterion), and the like.
Specifically, in some examples, iteratively training the machine learning model includes: (a) Inputting M initial features of the training set into the initialized machine learning model to train the initialized machine learning model for the first time; (b) Ranking the initial features according to the relative height of the importance of the machine learning model obtained after the initial features are obtained in the step (a), and determining N initial features with relatively high importance, wherein the importance of the initial features is positively correlated with the number of times the initial features are split in the step (a) or the coefficient gain of the initial features, and N is an integer smaller than M and larger than 0; and (c) replacing M initial features in (a) with N initial features, and repeatedly performing (a) and (b) one or more times by replacing the initialized machine learning model in (a) with the machine learning model after (a) until the prediction result of the obtained test set of the machine learning model meets the preset requirement.
Specifically, in a certain example, (b) further includes adjusting the model parameters of the machine learning model after (a), including making the value of at least one of the model parameters smaller than its value in performing the most recent training of (a).
In some examples, the so-called initializing parameters of the machine learning model LightGBM includes initializing at least one of the following model parameters: the number of leaves on a tree num_leave, the minimum number of data points needed to form a new leaf min_data_in_leaf, the proportion subsamples of random sampling per iteration, the proportion colsample _ bytree of randomly extracted features of a tree, the learning rate learning_rate, the model learner type tree_ learner, the evaluation index metric, and the number of trees num_boost_round allowed to be created.
Preferably, the parameters of the initialization LightGBM include initializing at least one of the number of leaves num_leave on a tree and the minimum number of data points min_data_in_leaf required to form a new leaf, and the number of trees num_boost_round allowed to be created.
In a specific example, the following parameters of the lightGBM algorithm are initially assigned: iteration number num_boost_round, learning rate learning_rate, number num_leave of leaf nodes, proportion colsample _ bytree of randomly selected features, minimum sample number min_data_in_leaf of leaf nodes, sampling proportion subsamples, measurement index metric and the like.
In some examples, prior to performing the x+1st training or performing the x+1st training further comprises ordering the initial features according to their relative high importance in the x-th training to determine the N initial features of relatively high importance, the importance of an initial feature being positively correlated with the number of split times the initial feature was in the x-th training or the coefficient of the initial feature's coefficient of kuntze gain.
GBDT the tree is typically a CART regression tree, CART uses the coefficient of base (Giniindex/coeffient) to select features, which represent model incompetence, the smaller the coefficient of base, the lower the incompetence and the better the features. The purity of the data set S can be measured by a radix valueWhere pi is the probability of the class i occurring, c is the number of classes, gini (S) reflects the probability of two samples randomly drawn from the dataset S, whose class labels are inconsistent. Thus, the smaller gini (S), the higher the purity of the data set S.
For sample S, the number is |S|, and the sample S is divided into two parts S1 and S2 according to whether the feature A takes a certain possible value a or not: s1= (x, y) ∈s|a (x) =a, s2=s-S1. The CART classification tree algorithm builds a binary tree. Under the condition of the feature A, the coefficient of the Kernel of the sample S is defined asThe smaller the value of the coefficient of kunning is in the range between [0,1], the higher the purity of the node (sometimes also referred to as a split node), i.e., the greater the likelihood that the samples belong to the same class. When the coefficient of the radix is 0, the node purity is highest, and all samples belong to the same class; when the coefficient of kunning is 1, the node purity is the lowest and the samples are evenly distributed across the categories. GBDT during the construction of the tree, the dataset is divided into different subsets by trying different features and feature values, and then the coefficient of the kurting of each subset is calculated. Among all possible nodes, the decision tree will select the partition with the smallest coefficient of the base as the optimal partition of the current node, i.e. the feature and the feature value that can most distinguish the categories are selected by minimizing the coefficient of the base.
In a particular example, the Gini coefficient gain Δgini (a) of a feature may be calculated as follows; specifically, the feature a is a feature of a certain split node of the tree, and the information gain Δgini (a) of the feature a at the node is calculated as follows:
① The gini index for a given dataset S is defined as:
② When the training set (S) is partitioned into k parts according to the feature a, the quality of the partitioning can be calculated by the following formula:
③ Characteristic a-Gini coefficient gain: Δgini (a) =gini (S) -gini (S, a).
If a tree is used multiple times for the node, the Gini coefficient gain Δgini (a) for the feature a on the tree may be the sum of the gains at the nodes.
Further, in some specific examples, the importance of each feature is determined by the Gini coefficient gain of the feature:
① A feature a is selected, and the sum of the Gini coefficient gains delta Gini (a) of all trees related to the feature in the statistical model is used as a importance degree quantization value (Gini-importance) of the feature a, so that the relative importance degree of the feature can be determined by sorting according to the value.
Optionally ② normalizes Giniimportance for all features (with the sum of all features Gini-importance as a baseline for normalization) and ranks features according to their Gini-importance normalized values to determine their relative importance.
In some examples, prior to performing the x+1 th training or performing the x+1 th training further comprises adjusting the value of the model parameter, including making the value of at least one model parameter less than the value of the model parameter in the last training.
Typically, the model training has a high initial feature count, and the initial assignment of model parameters, in particular the number of trees and/or parameters related to the nodes of the tree (the depth of the tree), is also typically set to a high value, so that the prediction speed of the initialized model is much lower than the required or desired speed, and pruning is typically required to control the number of trees and the depth of the tree in order to simplify the model.
In a particular example, in iterative training, one or more parameters affecting the depth of the tree and/or one or more parameters affecting the number of trees are scaled down (less than in the previous training). In the LightGBM model, num_leave and min_data_in_leave affect the depth of the tree, and num_boost_round determines the number of trees. Thus, the features participating in model training are screened and model parameters are adjusted until the obtained prediction effect of the trained model on the test set reaches the expected value.
In a specific example, 66 initial features are discovered, including those shown in table 1 and table 2, and after the first training, the importance of these initial features is determined by using the manner of determining the importance of the features in any of the above examples, and as shown in fig. 3, it can be seen that the importance of the 44 th initial feature to the 45 th initial feature drops sharply, and the importance of the features after the 44 th initial feature is lower, so the model after the first training can be trained for the second time by taking the first 44 initial features.
In one example, a round of sequencing includes a process of subjecting a template to base extension in a solution comprising a plurality of nucleotides and collecting corresponding images to identify base extension signals, the plurality of data bits on a read that appear to increase in number corresponding to the nucleotide species over a round of sequencing, the plurality of nucleotides subjected to base extension exhibiting a plurality of fluorescent signals including a first fluorescent signal and a second fluorescent signal, the trained model obtained by the method of any one of the embodiments or examples above, including screening for features including at least one selected from the group consisting of: the data density of the wheel is base_density, the proportion doubleratio of the data of the continuous twice reading in the data of the wheel, the proportion emp_len_ratio5 of the data of the continuous five times of the wheel in the wheel, the proportion rep_ numG _ratio of the first fluorescent signal on the first data bit, the proportion rep_ RGratio of the first fluorescent signal and the second fluorescent signal, the proportion MeanThresold of the designated position signal and the surrounding background signal, the proportion rep_ numR _ratio of the second fluorescent signal on the second data bit, and the proportion emp_len_ratio3 of the data of the continuous three times of the wheel in the wheel.
In a particular example, the trained model employs the features of 8 features shown above.
In some examples, for the method of determining the probability that a read is classified as a specified category read in any of the implementations or embodiments described above, further comprising determining a quality score for the read based on the probability.
In some examples, the predicted effect is expected or meets a preset requirement that the accuracy and speed of the predicted results for the test set using the trained model reach a predetermined level.
The mode of evaluation of the model classification prediction effect according to the embodiment of the present application is not limited. In some examples, the ROC curve is used to evaluate the classification predictive effect of the model to determine if training of the model is to be terminated. The ROC curve, collectively referred to as the "subject work characteristics" (Receiver Operating Characteristic) curve, is a common statistical analysis method (Bradley,A.P.(1997).The use of the area under the ROC curve in the evaluation of machine learning algorithms.Pattern recognition,30(7),1145-1159.). and table 3 below shows the basic concepts or indicators of some model evaluations.
TABLE 3 Table 3
In the ROC curve, the FPR values are on the abscissa and the TPR values are on the ordinate.
The ROC curve can be plotted by the steps of:
a. assuming that a series of probability values for the samples divided into positive classes have been obtained, ordered by size;
b. the probability value is sequentially taken as a threshold value from high to low, and when the probability that the test sample belongs to the positive sample is greater than or equal to the threshold value, the test sample can be considered as the positive sample, and otherwise, the test sample is the negative sample. For example, for a test sample, the threshold is 0.6, and samples with probability values greater than or equal to 0.6 are considered positive samples (positive examples), while other samples are considered negative samples (negative examples).
C. Selecting different thresholds each time to obtain a group of FPR and TPR, wherein the FPR value is taken as an abscissa and the TPR value is taken as an ordinate, namely a point on the ROC curve;
d. And c, drawing according to the coordinate points obtained in the step c to obtain an ROC curve.
Model performance-AUC values were evaluated using ROC curves:
AUC represents the area under ROC curve, and is mainly used for measuring the generalization performance of the model, namely, the quality of classification effect. AUC is an evaluation index for measuring the quality of the two classification models, and represents the probability that the positive case is arranged in front of the negative case.
Determination criteria for AUC values versus model performance:
① Auc=1, a perfect classifier, when this prediction model is used, there is at least one threshold to yield perfect predictions. Most predictive cases do not exist as perfect classifiers.
② Auc <1, 0.5< over random guesses. This classifier (model) can be predictably valuable if it is thresholded properly.
③ Auc=0.5, as with random guesses (e.g., copper loss), the model has no predictive value.
④ AUC <0.5, worse than random guess; but is also better than random guesses if always anti-predicted.
In a specific example, the predictive effect of the model after x training is expressed as: the test set had a total of 2253100 sequences, which took 78.37s, with a predicted speed of 28749 bars/s. As shown in fig. 2, the ROC curve of the model has an AUC value of 0.91 and a accuracy of 0.83. The model has higher prediction accuracy and AUC value, and has better prediction effect, and can terminate the iterative training of the model.
In some example, as shown in FIG. 4, when the quality score filtering threshold is greater than 80, the inventors found that 99% of the unsuccessful alignment sequences (which cannot be aligned to the specified reference sequences) have been filtered out, while the loss rate of the successful alignment sequences (which can be aligned to the specified reference sequences) increases sharply, so that the quality scores 0-80 are uniformly mapped to the range of 0-100 in order to prevent the situation that the user sets too high quality score filtering threshold and causes excessive loss of the successful alignment sequences, the specific mapping formula is that
Where Q represents the mass fraction of the read and P represents the probability that the trained machine learning model predicts that the read is classified as a specified class of reads. Thus, the value can more accurately reflect the quality of the read, and the practicality of the calculated quality score is increased. Fig. 5 illustrates the filtering effect of the quality fraction under different thresholds after the mapping.
After determining the mass fraction of each read in the test set according to the above mapping formula, fig. 6 shows the mass fraction value distribution of the actual aligned reads (aligned successful sequences) and the non-aligned reads (aligned unsuccessful sequences) in the test set, and it can be seen that the aligned reads exhibit a higher mass fraction (mass fraction of the aligned reads greater than 97% is 100) and the non-aligned reads exhibit a lower mass fraction (mass fraction of the non-aligned reads greater than 97% is less than 20). It is known that the quality score of a read determined from the trained model can reliably predict the probability of the read being classified as a comparison read.
Moreover, the graph of fig. 7 shows the correlation between the quality score of the test set and the average error rate of the unique comparison sequences therein, and it can be seen that the quality score of the unique comparison reads is inversely proportional to the average error rate of the sequences, in other words, the higher the quality score, the lower the error rate, and the quality score of the reads determined by the trained model can well represent the quality of the sequences. The graph of fig. 7 is made by dividing the mass fractions 0-100 of the reads of the test set into 101 groups, each group of mass fractions containing a plurality of uniquely aligned reads, the error rate averages of the plurality of uniquely aligned reads in a group of mass fractions appearing as a dot on the graph of fig. 7. The term unique alignment reads as used herein refers to reads (unique reads) aligned to unique positions of the reference sequence in the test set, the term error rate is the proportion of the number of mismatched bases in the unique alignment reads, and the average error rate is the average of the error rates of the plurality of unique alignment reads in the set. In a specific example, the mismatched bases include at least one of an insertion (insertion) and a deletion (deletion) base in addition to the mismatched (mismatch) base.
Referring to fig. 8A, an embodiment of the present application further provides an apparatus 200 for predicting a probability that a read will be classified as a specified class read, where the apparatus is configured to implement the method for predicting a probability that a read will be classified as a specified class read in any of the above embodiments or examples, including: a feature value obtaining unit 210 for obtaining a value of a feature of a read obtained by sequencing while synthesizing; the generating unit 220 is configured to input the feature values from the feature value obtaining unit into a trained machine learning model so as to predict the probability that the read is classified into the specified category read, where the trained machine learning model is a quantization scheme associated with the feature and the probability that the corresponding read is classified into the specified category read, and the machine learning model is a decision tree.
Referring to fig. 8B, an embodiment of the present application further provides an apparatus 300 for determining a read quality score, where the apparatus is configured to implement the method for determining a read quality score in any of the foregoing embodiments or examples, including: a feature value obtaining unit 310 for obtaining a value of a feature of a read obtained by sequencing while synthesizing; the generating unit 320 is connected to the feature value obtaining unit, and is configured to input the feature value into a trained machine learning model to obtain a quality score of the read segment, where the trained machine learning model is a quantization scheme associated with the feature and a probability of the read segment being classified as a specified class read segment, the machine learning model is a decision tree, and the quality score of the read segment is positively correlated with the probability of the read segment being classified as the specified class read segment.
The apparatus 200 or 300 in the above embodiment can quickly determine the probability that a read is classified into a specified category of read or acquire the quality score of the read by inputting the values of the plurality of features of the read acquired from the feature value acquisition unit 210 into the trained decision tree in the generation unit 220 or 320, and can reliably predict the classification result of the read or obtain reliable quantized quality information. The read quality score reflects the probability that the read is classified into the read of the specified category, can effectively and reliably reflect the sequencing quality of the read, and is beneficial to distinguishing the quality degradation of the read caused by systematic errors and sequencing errors and the relatively low quality level of the read caused by specific characteristics of a sample to be tested.
The explanation of the above embodiments or examples regarding the method for predicting the probability that a read will be classified as a specified class of reads or the method for determining the quality score of a read, including the steps of the methods or the setting of the condition parameters, etc., is also applicable to the apparatus 200 for predicting the probability that a read will be classified as an instruction class of reads or the apparatus 300 for determining the quality score of a read in this embodiment, and will not be further elaborated herein to avoid redundancy.
For example, in some examples, a read includes read wheel data reflecting the presence of a base extension signal on a template corresponding to a specified location of an image set and space wheel data reflecting the absence of a base extension signal on a template corresponding to a specified location of an image set, the image set including a plurality of images generated by one or more rounds of sequencing during a sequencing run including the specified location and its surrounding background signals, a value of a characteristic of the read being determined from at least one of the image set, the read wheel data, and the space wheel data.
In some examples, a round of sequencing includes a process of placing a template in a solution containing one or more nucleotides for base extension and collecting corresponding images to identify base extension signals, the round of sequencing manifesting on a read as one or more data bits that increase in number corresponding to the number of nucleotide species in the solution in which the template is placed.
In some examples, a model training module may also be included in the generation unit 220 or 320 to train the model or a model saving module to save a pre-trained model. The following steps are performed in the model training module to obtain a trained model:
Generating M initial features based on a generation process of the data set including the data set, wherein the data set has a known classification result, the data set includes a plurality of reads belonging to a specified category of reads and a plurality of reads not belonging to the specified category of reads, the reads in the data set all have the initial features, and the initial features are all quantifiable features, and M is an integer greater than or equal to 5;
Training a machine learning model based on the dataset and the initial features, comprising:
the dataset is divided into training and test reads,
Initializing parameters of the machine learning model to obtain an initialized machine learning model, and
Inputting initial features of the training read including the training read into an initialized machine learning model to iteratively train the machine learning model, including
Inputting N initial features of the training read into the machine learning model after the x-th training to perform the x+1-th training on the machine learning model, and
Determining whether to terminate the iterative training based on whether the accuracy and/or the prediction speed of the prediction result of the classification of the test read by the machine learning model after the (x+1) th training meets a preset requirement, including terminating the iterative training if the accuracy and/or the prediction speed of the prediction result of the classification of the test read meets the preset requirement, otherwise, performing the (x+2) th training, wherein N is an integer less than or equal to M and greater than 0, and x is an integer greater than or equal to 1.
In some examples, each step performed in the model training module may be implemented by setting corresponding function sub-modules associated with each other.
In certain examples, M is greater than or equal to 30; preferably, M is also less than 300.
In some examples, the training and test reads each include a specified category read and a non-specified category read, and the number of reads of either of the two categories in the training read is the same as the number of reads of the category in the test read.
In some examples, the machine learning model is a gradient-lifting decision tree GBDT, and the gradient-lifting decision tree employs at least one of LightGBM and XGBoost algorithms. Preferably, the machine learning model is LightGBM.
In some examples, initializing parameters of the machine learning model includes initializing assignments for at least one of the following model parameters: the number of leaves on a tree num_leave, the minimum number of data points needed to form a new leaf min_data_in_leaf, the proportion subsamples of random sampling per iteration, the proportion colsample _ bytree of randomly extracted features of a tree, the learning rate learning_rate, the model learner type tree_ learner, the evaluation index metric, and the number of trees num_boost_round allowed to be created.
Preferably, the parameters for initializing the machine learning model include initializing at least one of a number of leaves num_leave on a tree and a minimum number of data points min_data_in_leaf required to form new leaves, and a number of trees num_boost_round allowed to be created.
Specifically, in some examples, before performing the x+1st training or performing the x+1st training further includes ordering the initial features according to their relative importance in the x-th training to determine the N initial features of relatively high importance, the importance of an initial feature being positively correlated with the number of splits of the initial feature in the x-th training or the kunit gain of the initial feature.
In some examples, prior to performing the x+1 th training or performing the x+1 th training further comprises adjusting the value of the model parameter, including making the value of at least one model parameter less than the value of the model parameter in the last training.
In some examples, meeting the preset requirement is that the accuracy of the predicted outcome and the predicted speed reach a predetermined level.
In one example, a round of sequencing includes a process of base extension by placing a template in a solution comprising a plurality of nucleotides and collecting corresponding images to identify base extension signals, the data bits including a first data bit and a second data bit appearing on a read over a round of sequencing as a plurality of data bits corresponding to the number of nucleotide species, the plurality of nucleotides undergoing base extension exhibiting a plurality of fluorescent signals, the fluorescent signals including a first fluorescent signal and a second fluorescent signal, the read characterized by at least one selected from the group consisting of: the data density of the wheel is base_density, the proportion doubleratio of the data of the continuous twice reading in the data of the wheel, the proportion emp_len_ratio5 of the data of the continuous five times of the wheel in the wheel, the proportion rep_ numG _ratio of the first fluorescent signal on the first data bit, the proportion rep_ RGratio of the first fluorescent signal and the second fluorescent signal, the proportion MeanThresold of the designated position signal and the surrounding background signal, the proportion rep_ numR _ratio of the second fluorescent signal on the second data bit, and the proportion emp_len_ratio3 of the data of the continuous three times of the wheel in the wheel.
In a particular example, the trained model takes all 8 of the above features.
In some examples, the model trained by any of the examples described above is saved in a template save module to facilitate invocation.
In some examples, the apparatus 200 for predicting reads to be classified as specified class reads further includes a scoring unit that obtains a quality score for the read based on the probabilities determined from the generating unit.
In some examples, the mass fraction 0-80 is mapped to a range of 0-100 according to the following formula: Where Q represents the mass fraction of the read and P represents the probability that the trained machine learning model predicts that the read is classified as a specified class of reads.
Another embodiment of the present application also provides a terminal, a computing device, including: a memory for storing data, including computer executable programs; a processor for executing a computer-executable program to perform the method of predicting read classification results or the method of determining read quality scores in any of the embodiments or examples described above.
Yet another embodiment of the present application also provides a computer-readable storage medium storing a program for execution by a computer, the method of predicting a read classification result or the method of determining a read quality score in any one of the embodiments or examples described above.
Referring to fig. 9, still another embodiment of the present application further provides a method for training a machine learning model, including: s62, acquiring values of M initial features of a training reading segment, wherein the training reading segment is obtained through sequencing while synthesis, the training reading segment has a known classification result, the training reading segment comprises a plurality of reading segments belonging to a specified class of reading segments and a plurality of reading segments not belonging to the specified class of reading segments, and M is an integer greater than or equal to 5; and S64, inputting the value of the initial characteristic into a machine learning model, training the machine learning model by taking a known classification result as a mark to obtain a trained machine learning model for determining the quality score of the reading, wherein the machine learning model is a decision tree, and the quality score of the reading determined by the trained machine learning model is positively correlated with the probability that the reading is classified into a reading of a specified category. Based on the decision tree model selected by the inventor, the method can be used for quickly obtaining a trained model, and the classification prediction effect of the trained model is better.
In some examples, training the machine learning model includes tuning the machine learning model and feature dimension reduction. It will be appreciated that the method, manner, setting, and associated technical effects related to model training, among the methods for predicting the read classification result or determining the read quality score in any of the above embodiments or examples, are equally applicable to the model learning method of the present embodiment, and are not further detailed herein.
Specifically, for example, training the decision tree model includes: initializing parameters of a machine learning model; and inputting initial features of the training read into the initialized machine learning model to iteratively train the machine learning model, including inputting N initial features of the training read into the machine learning model after the x-th training so as to train the machine learning model for the x+1th time, and determining whether to terminate the iterative training based on whether a correct rate and/or a predicted speed of a predicted result of classification of the test read by the machine learning model after the x+1th training meets a preset requirement, including terminating the iterative training if the correct rate and/or the predicted speed of the predicted result of classification of the test read meets the preset requirement, otherwise, conducting the x+2th training, wherein N is an integer less than or equal to M and greater than 0, x is an integer greater than or equal to 1, and the test read has a known classified result.
In certain examples, M is greater than or equal to 30; preferably, M is also less than 300 at the same time.
In some examples, the number of specified or non-specified category reads in the training read is the same as it is in the test read.
In some examples, the machine learning model is a gradient-lifting decision tree GBDT that employs at least one of the LightGBM and XGBoost algorithms. Preferably LightGBM is used.
In some examples, initializing the machine learning model includes initializing assignments for at least one of the following model parameters: the number of leaves on a tree num_leave, the minimum number of data points needed to form a new leaf min_data_in_leaf, the proportion subsamples of random sampling per iteration, the proportion colsample _ bytree of randomly extracted features of a tree, the learning rate learning_rate, the model learner type tree_ learner, the evaluation index metric, and the number of trees num_boost_round allowed to be created. In a specific example, all of the parameters are initialized.
Preferably, in some example, initializing the machine learning model includes initializing at least one of a number of leaves num_leave on a tree and a minimum number of data points min_data_in_leaf required to form new leaves, and a number of trees num_boost_round allowed to be created.
In some examples, prior to performing the x+1st training or performing the x+1st training further comprises ordering the initial features according to their relative high importance in the x-th training to determine the N initial features of relatively high importance, the importance of an initial feature being positively correlated with the number of split times the initial feature was in the x-th training or the coefficient of the initial feature's coefficient of kuntze gain.
In some examples, prior to performing the x+1 th training or performing the x+1 th training further comprises adjusting the value of the model parameter, including making the value of at least one model parameter less than the value of the model parameter in the last training.
In some example, meeting the preset requirement is that the accuracy of the predicted outcome and the predicted speed reach a predetermined level.
The embodiment of the application also provides a device for training a machine learning model, which is used for implementing the model learning method of any embodiment or example, and comprises the following steps: the training device comprises an initial characteristic value acquisition unit, a training reading unit and a training reading unit, wherein the initial characteristic value acquisition unit is used for acquiring values of M initial characteristics of a training reading section, the training reading section is obtained through sequencing while synthesis, the training reading section has a known classification result, the training reading section comprises a plurality of reading sections belonging to a specified class of reading sections and a plurality of reading sections not belonging to the specified class of reading sections, and M is an integer greater than or equal to 5; and a training unit that inputs the value of the initial feature from the initial feature value acquisition unit to a machine learning model, trains the machine learning model with a known classification result as a flag to obtain a trained machine learning model for determining a read quality score, the machine learning model being a decision tree, the read quality score determined by the trained machine learning model being positively correlated with a probability that the read is classified as a specified class read.
It will be appreciated that all the additional technical features and associated technical effects of the model learning method in any of the above embodiments are equally applicable to the model learning device of this embodiment, and therefore, for avoiding redundancy, a detailed description is omitted herein.
In some embodiments, there is also provided a terminal, a computing device comprising a memory for storing data, including storing a trained machine learning model obtained by any of the above embodiments or example model learning methods or apparatus; and a processor for executing a program to determine a quality score of the read using the trained machine learning model.
A computer readable storage medium is also provided for storing a trained machine learning model obtained by any of the embodiments or example model learning methods.
Referring to fig. 10, the embodiment of the present application further provides a sequencing method, including: s82, sequencing the nucleic acid molecules to be detected while synthesizing, and determining the mass fraction of the reading segment generated by sequencing by adopting the method for determining the mass fraction of the reading segment in any embodiment or a trained machine learning model obtained by a model learning method in any embodiment; and S84, performing sequence analysis on the read segments with the quality scores higher than a preset threshold value. The method is based on the evaluation of the quality of the obtained read, and is beneficial to the subsequent efficient processing and analysis of the sequencing data.
In some examples, the predetermined threshold may be set to 80. In some test, 99% of the unsuccessful alignment sequences (non-aligned reads) are filtered out when the predetermined threshold is set to 80.
In one example, the mass fraction of the read may be calculated according to the following formulaWherein Q represents the mass fraction of the read and P represents the probability that the trained machine learning model predicts that the read is classified as a specified class of reads; to map the mass fraction of 0-80 to 0-100 for the convenience of user setting.
Finally, embodiments of the present application also provide a sequencing system comprising a computing device as in any of the embodiments described above, or a computer-readable storage medium as in any of the embodiments.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. Various computer-readable storage media described herein can represent one or more devices and/or other machine-readable storage media for storing information. The term "machine-readable storage medium" can include, without being limited to, wireless channels and various other media capable of storing, containing, and/or carrying instruction(s) and/or data.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or part of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.
Examples
1. Data source
The NIPT samples were used for sequencing data, 16 lanes per chip, two samples per lane. The sequencing conditions of the chip, the reagent formula and the like of the data set are identical. The total length of sequencing was 71 cycles, where the first 3 cycles were barcode, and the sample sequence was sequenced starting from cycle 4, so the parameter statistics were calculated starting from cycle 4. In order to ensure the accuracy of model prediction, sequences smaller than 25bp are filtered, i.e. sequences smaller than 25bp in length do not participate in the training of the model.
2. Mining features
The (initial) features are mined from the fa _ file and the image file obtained at basecall. Some of the initial characteristics of the development are shown in tables 1 and 2.
3. Training model and adjusting parameters
(1) Labeling the dataset: an analog success sequence (labeled 1), referred to as a positive sample for short, and an analog failure sequence (labeled 0), referred to as a negative sample for short; wherein, the total data set is 7510331, and the positive sample accounts for 30.95 percent; negative samples account for 69.05%, and the number of (initial) features is 66;
(2) Classifying the data set: the method comprises the steps of dividing the data into a training set and a testing set, wherein the dividing ratio is 7:3, and keeping the positive and negative sample ratios of the two data sets consistent with the original data set;
(3) Initializing model parameters, training a model, and calculating ROC curve, AUC value and accuracy of the model after each training.
If the initialization parameter :[num_leaves:30,min_data_in_leaf:20,subsample:0.9,colsample_bytree:0.8,learning_rate:0.05,tree_learner:voting,metric:auc,num_boost_round=3000]; is still other parameters, which are not specified to be assigned, a default value can be taken. The meaning of the parameters can be seen in LightGBM website links such as the description in https:// lightgbm.readthendocs.
(4) Look at the predicted effect of each trained model on the test set: initialized model prediction speed: the test set had a total of 2253100 data, which took 78.37s, and the predicted speed was 28749 pieces/s. The model ROC curve is shown in fig. 2, with AUC value of 0.91 and accuracy of 0.83. In conclusion, the model accuracy and the AUC value are high, and the model prediction effect is good.
In order to improve the prediction speed of the model (the prediction speed of the initialized model is far lower than the expected speed by 40 ten thousand/s), the operations of reducing and regulating parameters are carried out on the premise of guaranteeing the prediction effect, the model is simplified, and the prediction speed of the model is further improved.
(5) Model dimension reduction and parameter adjustment: on one hand, calculating the variable importance of the model, drawing a variable importance sorting change graph, selecting inflection points with obviously reduced importance, intercepting the first N important variables, and performing dimension reduction operation on the model. On the other hand, the parameters of the model are adjusted, the model is retrained, and the steps (2), (3) and (4) are repeated until the model separation effect and the speed requirement are met.
For example, the initial importance ranking of 66 features is shown in FIG. 3. From the figure, it can be seen that the importance drops sharply from 44 th feature to 45 th feature, and the importance of features after 44 th feature is lower, so the first 44 features are preferentially selected for retraining the model.
Additionally, model parameters that may be heavily tuned include: [ num_leave, min_data_in_leaf, num_boost_round ]. Wherein num_leave and min_data_in_leave affect the depth of the tree and num_boost_round determines the number of trees.
From the speed of the initial model, the predicted speed is far lower than the required speed, so pruning is needed, the number of trees and the depth of the trees are controlled, and the model is simplified. And continuously adjusting the characteristics and parameters participating in model training until the model prediction effect and speed reach the requirements.
And finally, evaluating the sequence filtering effect of the trained model, if the model training is stopped when the requirement is met, converting the probability value predicted by the model into a mass fraction ranging from 0 to 100, and if the requirement is not met, re-training the parameters and the characteristics.
The final adopted characteristics of the model are 8, namely:
[base_density,doubleratio,emp_len_ratio5,rep_numG_ratio,rep_RGratio,MeanThresold,rep_numR_ratio,emp_len_ratio3].
The model parameter value is finally taken :[num_leaves:7,min_data_in_leaf:200,subsample:0.9,colsample_bytree:0.8,learning_rate:0.05,tree_learner:voting,metric:auc,num_boost_round=200].
The prediction speed of the trained model is as follows: 105 ten thousand bars/s, higher than the expected 40 ten thousand bars/s.
Further, the effect of the trained model is verified by using the test set, specifically, the quality score of each read in the test set is determined by using the model, and each read in the test set is filtered, as shown in fig. 4, when the quality score filtering threshold is greater than 80, the comparison unsuccessful sequence (non-comparison read) is filtered out by 99%, and the loss speed of the comparison successful sequence (comparison read) is sharply increased, so that in order to prevent the situation that the user sets an excessively high quality score filtering threshold to cause excessive loss of the comparison successful sequence, the quality score is uniformly mapped to the range of 0-100, and the specific mapping formula is as follows:
Wherein P is the probability of successful alignment of the model predicted sequences.
The filtering effect of the quality fraction after the mapping under different thresholds is shown in fig. 5.
As can be seen from the mass fraction distribution diagram shown in FIG. 6, the mass fraction of the actual comparison successful sequence in the test set is high, the mass fraction of the actual comparison unsuccessful sequence is low, and the mass fraction of the read can better predict the comparison result of the read.
Moreover, the graph of fig. 7 shows the correlation between the quality score of the test set and the average error rate of the unique comparison sequences therein, and it can be seen that the quality score of the unique comparison reads is inversely proportional to the average error rate of the sequences, in other words, the higher the quality score, the lower the error rate, and the quality score of the reads determined by the trained model can well represent the quality of the sequences. The graph of fig. 7 is made by dividing the mass fractions 0-100 of the reads of the test set into 101 groups, each group of mass fractions containing a plurality of uniquely aligned reads, the error rate averages of the plurality of uniquely aligned reads in a group of mass fractions appearing as a dot on the graph of fig. 7. The term unique alignment reads as used herein refers to reads (unique reads) aligned to unique positions of the reference sequence in the test set, the term error rate is the proportion of the number of mismatched bases in the unique alignment reads, and the average error rate is the average of the error rates of the plurality of unique alignment reads in the set. In a particular example, the mismatched bases can include at least one of an insertion (insertion) and a deletion (deletion) base in addition to the mismatched (mismatch) base.
The following test the prediction effect of the trained model using another dataset of known classification results:
1. Data source
① The data set is not involved in model training and parameter tuning test, the sequencing conditions of the data set are consistent with the data sequencing conditions used for model training, the total sequencing length is 71 cycles, and the first 3 cycles are barcode.
② The sequencing data of the E.coli, which is a pathogenic sample and is commonly used as a standard, are used for testing the performance of the model on non-human samples by using the type data, wherein the sequencing length is 72 cycles.
2. Model effect
1) Sequencing data model prediction effect for 32 NIPT samples: and a total of 14701793 pieces of data are obtained, and the original comparison success rate is 31.75%. The predicted speed is: 929840 bars/s. The filtering effect of the 32 NIPT sample data is shown in fig. 11, the distribution result of the quality score values of the actual comparison success sequence and the actual comparison unsuccessful sequence is shown in fig. 12, and the correlation result of the quality score and the average error rate of the corresponding unique comparison sequence is shown in fig. 13.
2) The prediction effect of the sequencing data model of 16 E.coli samples: and the total 1163639 pieces of data are counted, and the original comparison success rate is 35.03%. The predicted speed is: 930583 bars/s. The filtering effect of the 16 cases of escherichia coli sample data is shown in fig. 14, the distribution result of the quality score values of the actual comparison successful sequence and the actual comparison unsuccessful sequence is shown in fig. 15, and the correlation result of the quality score and the average error rate of the corresponding unique comparison sequence is shown in fig. 16.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (9)

1. A method of predicting the probability that a read will be classified as a specified class of reads, comprising:
Acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and
Inputting the value of the characteristic into a trained machine learning model so as to predict the probability of classifying the read into a specified category read, wherein the trained machine learning model is a quantization scheme which is associated with the characteristic of the read and the probability of classifying the read into the specified category read, and the machine learning model is a decision tree.
2. A method of determining a read quality score, comprising:
Acquiring the value of the characteristic of a read, wherein the read is obtained by sequencing while synthesizing; and
Inputting the value of the characteristic into a trained machine learning model to obtain the quality score of the reading segment, wherein the trained machine learning model is a quantization scheme which is related to the characteristic of the reading segment and the probability of the reading segment being classified into a specified category reading segment, the machine learning model is a decision tree, and the quality score of the reading segment is positively correlated with the probability of the reading segment being classified into the specified category reading segment.
3. The method of claim 1 or 2, wherein one of the reads comprises read wheel data reflecting the presence of a base extension signal on the template corresponding to the specified location of the image set and wheel data reflecting the absence of a base extension signal on the template corresponding to the specified location of the image set,
The image set includes a plurality of images generated by one or more rounds of sequencing during a sequencing run including the designated location and its surrounding background signals,
The value of the characteristic of the reading segment is determined according to at least one kind of information in the image set, the reading wheel data and the wheel-to-wheel data;
Optionally, a cycle of sequencing comprising subjecting the template to base extension in a solution comprising one or more nucleotides and acquiring corresponding images to identify base extension signals, the cycle of sequencing over which appears to increase one or more data bits corresponding to the number of nucleotide species in the solution in which the template is placed;
optionally, the trained machine learning model is determined by:
Generating M initial features based on a data set including a generation process of the data set, the data set having a known classification result, the data set including a plurality of reads belonging to a specified category of reads and a plurality of reads not belonging to the specified category of reads, the reads in the data set each having the initial features and the initial features each being a quantifiable feature, M being an integer greater than or equal to 5;
Training the machine learning model based on the dataset and the initial feature, comprising:
the dataset is divided into training reads and test reads,
Initializing parameters of the machine learning model to obtain an initialized machine learning model, and
Inputting the initial features of the training segment including the training segment into an initialized machine learning model to iteratively train the machine learning model, including
Inputting N initial features of the training reading into the machine learning model after the x-th training so as to perform the x+1-th training on the machine learning model, and
Determining whether to terminate the iterative training based on whether the accuracy and/or the prediction speed of the machine learning model trained for the x+1th time meets a preset requirement, including terminating the iterative training if the accuracy and/or the prediction speed of the machine learning model trained for the classification of the test reading meets the preset requirement, otherwise, performing the x+2th training, wherein N is an integer less than or equal to M and greater than 0, and x is an integer greater than or equal to 1;
Optionally, M is greater than or equal to 30;
Optionally, the training reads and the test reads each comprise a specified category read and a non-specified category read, and the number of reads of either category in the training reads is the same as the number of reads of that category in the test reads;
Optionally, the machine learning model is a gradient boosting decision tree GBDT employing at least one of LightGBM and XGBoost algorithms;
optionally, the machine learning model is LightGBM;
Optionally, initializing parameters of the machine learning model includes initializing assignments to at least one of the following model parameters: the number of leaves on a tree num_leave, the minimum number of data points needed to form a new leaf min_data_in_leaf, the proportion subsamples of random sampling per iteration, the proportion colsample _ bytree of randomly extracted features of a tree, the learning rate learning_rate, the model learner type tree_ learner, the evaluation index metric and the number of trees num_boost_round allowed to be created;
optionally, initializing parameters of a machine learning model includes initializing at least one of a number of leaves num_leave on the one tree and the minimum number of data points min_data_in_leaf required to form new leaves, and a number of the trees num_boost_round allowed to be created;
Optionally, before performing the x+1st training or performing the x+1st training, further comprising sorting the initial features according to their relative importance in the x-th training to determine the N initial features with relatively high importance, the importance of an initial feature being positively correlated with the number of split times of the initial feature in the x-th training or the kunit gain of the initial feature;
optionally, before performing the (x+1) th training or performing the (x+1) th training further comprises adjusting the values of the model parameters, including making the value of at least one of the model parameters smaller than the value of the model parameter in the last training;
optionally, the meeting of the preset requirement is that the accuracy of the predicted result and the predicted speed reach a predetermined level;
Optionally, a cycle of sequencing comprises subjecting the template to a solution comprising a plurality of nucleotides to base extension and collecting corresponding images to identify a base extension signal, the cycle of sequencing over which appears to increase a plurality of data bits on the read corresponding to the number of nucleotide species, the data bits comprising a first data bit and a second data bit, the plurality of nucleotides subjected to base extension exhibiting a plurality of fluorescent signals, the fluorescent signals comprising a first fluorescent signal and a second fluorescent signal, the read characterized by comprising at least one selected from the group consisting of:
The method comprises the steps of reading wheel data density base_density, the proportion doubleratio of data which are continuously read twice in wheel data, the proportion emp_len_ratio5 of data which are continuously empty five times in wheel data, the proportion rep_ numG _ratio of a first fluorescent signal on a first data bit, the proportion rep_ RGratio of the first fluorescent signal and a second fluorescent signal, the proportion MeanThresold of a designated position signal and surrounding background signals, the proportion rep_ numR _ratio of the second fluorescent signal on the second data bit, and the proportion emp_len_ratio3 of data which are continuously empty three times in wheel data;
optionally, determining a mass fraction of the read further based on the probability;
Optionally, calculating the mass fraction of the read according to the following formula
Q represents the quality score of the read and P represents the probability that the trained machine learning model predicts that the read is classified as a specified class of reads.
4. An apparatus for predicting the probability that a read will be classified as a read of a specified class, comprising:
the characteristic value acquisition unit is used for acquiring the value of the characteristic of the reading section, wherein the reading section is obtained by sequencing while synthesizing;
And the generation unit is used for inputting the values of the features from the feature value acquisition unit into a trained machine learning model so as to predict the probability that the read is classified into the read of the specified category, wherein the trained machine learning model is a quantization scheme associated with the probability that the features and the corresponding read are classified into the read of the specified category, and the machine learning model is a decision tree.
5. An apparatus for determining a read quality score, comprising:
the characteristic value acquisition unit is used for acquiring the value of the characteristic of the reading section, wherein the reading section is obtained by sequencing while synthesizing;
the generation unit is connected with the characteristic value acquisition unit and is used for inputting the value of the characteristic into a trained machine learning model so as to obtain the quality score of the reading, the trained machine learning model is a quantization scheme which is related with the probability that the characteristic and the reading are classified into the reading of the specified category, the machine learning model is a decision tree, and the quality score of the reading is positively correlated with the probability that the reading is classified into the reading of the specified category.
6. A method of training a machine learning model, comprising:
acquiring values of M initial features of a training reading segment, wherein the training reading segment is obtained through sequencing while synthesis, the training reading segment has a known classification result, the training reading segment comprises a plurality of reading segments belonging to a specified category of reading segments and a plurality of reading segments not belonging to the specified category of reading segments, and M is an integer greater than or equal to 5; and
And inputting the value of the initial characteristic into a machine learning model, training the machine learning model by taking the known classification result as a mark to obtain a trained machine learning model for determining the quality score of the reading, wherein the machine learning model is a decision tree, and the quality score of the reading determined by the trained machine learning model is positively correlated with the probability that the reading is classified into a specified category reading.
7. The method of claim 6, wherein training the machine learning model comprises parametrizing and dimensionality reduction the machine learning model;
optionally, training the machine learning model comprises:
initializing parameters of the machine learning model; and
Inputting initial features of the training read into the initialized machine learning model to perform iterative training on the machine learning model, including inputting N initial features of the training read into the machine learning model after the x-th training so as to perform x+1th training on the machine learning model, and determining whether to terminate the iterative training based on whether a correct rate and/or a predicted speed of a predicted result of classification of the test read by the machine learning model after the x+1th training meets a preset requirement, including terminating the iterative training if the correct rate and/or the predicted speed of the predicted result of classification of the test read meets the preset requirement, otherwise performing the x+2th training, wherein N is an integer less than or equal to M and greater than 0, and x is an integer greater than or equal to 1, the test read having a known classified result;
Optionally, M is greater than or equal to 30;
Optionally, the number of specified or non-specified class reads in the training read is the same as it is in the test read;
Optionally, the machine learning model is a gradient boosting decision tree GBDT employing at least one of LightGBM and XGBoost algorithms;
optionally, the machine learning model is LightGBM;
Optionally, initializing the machine learning model includes initializing assignment of at least one of the following model parameters: the number of leaves on a tree num_leave, the minimum number of data points needed to form a new leaf min_data_in_leaf, the proportion subsamples of random sampling per iteration, the proportion colsample _ bytree of randomly extracted features of a tree, the learning rate learning_rate, the model learner type tree_ learner, the evaluation index metric and the number of trees num_boost_round allowed to be created;
Optionally, the initializing machine learning model includes initializing at least one of a number of leaves num_leave on the one tree and the minimum number of data points min_data_in_leaf required to form new leaves, and the number of trees num_boost_round allowed to be created;
Optionally, before performing the x+1st training or performing the x+1st training, further comprising sorting the initial features according to their relative importance in the x-th training to determine the N initial features with relatively high importance, the importance of an initial feature being positively correlated with the number of split times of the initial feature in the x-th training or the kunit gain of the initial feature;
optionally, before performing the (x+1) th training or performing the (x+1) th training further comprises adjusting the values of the model parameters, including making the value of at least one of the model parameters smaller than the value of the model parameter in the last training;
optionally, the meeting of the preset requirement is that the accuracy of the predicted result and the predicted speed reach a predetermined level.
8. An apparatus for training a machine learning model, comprising:
The training reading section comprises a plurality of reading sections belonging to a specified category and a plurality of reading sections not belonging to the specified category, wherein M is an integer greater than or equal to 5; and
And the training unit is used for inputting the value of the initial characteristic from the initial characteristic value acquisition unit into a machine learning model, training the machine learning model by taking the known classification result as a mark to obtain a trained machine learning model for determining the quality score of the reading, wherein the machine learning model is a decision tree, and the quality score of the reading determined by the trained machine learning model is positively correlated with the probability that the reading is classified into a specified category reading.
9. A method of sequencing comprising:
Sequencing a nucleic acid molecule to be tested while synthesizing, and determining the mass fraction of reads generated by sequencing by using the method of claim 2 or 3 or a trained machine learning model obtained by the method of claim 6 or 7; and
Performing a sequence analysis on reads in which the mass fraction is above a predetermined threshold;
Optionally, the predetermined threshold is 80.
CN202311865546.9A 2023-12-29 2023-12-29 Method for determining read mass fraction, sequencing method and sequencing device Pending CN117976042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311865546.9A CN117976042A (en) 2023-12-29 2023-12-29 Method for determining read mass fraction, sequencing method and sequencing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311865546.9A CN117976042A (en) 2023-12-29 2023-12-29 Method for determining read mass fraction, sequencing method and sequencing device

Publications (1)

Publication Number Publication Date
CN117976042A true CN117976042A (en) 2024-05-03

Family

ID=90852431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311865546.9A Pending CN117976042A (en) 2023-12-29 2023-12-29 Method for determining read mass fraction, sequencing method and sequencing device

Country Status (1)

Country Link
CN (1) CN117976042A (en)

Similar Documents

Publication Publication Date Title
KR102273717B1 (en) Deep learning-based variant classifier
US20240079092A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
Smith et al. Demographic model selection using random forests and the site frequency spectrum
US20210065847A1 (en) Systems and methods for determining consensus base calls in nucleic acid sequencing
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
WO2024027032A1 (en) Method and system for evaluating tumor formation risk and tumor tissue source
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
CN113160882A (en) Pathogenic microorganism metagenome detection method based on third generation sequencing
CN111868832A (en) Method for identifying copy number abnormality
Asnicar et al. Machine learning for microbiologists
CN111180013A (en) Device for detecting blood disease fusion gene
CN116596933B (en) Base cluster detection method and device, gene sequencer and storage medium
US20220351804A1 (en) Improved Variant Caller Using Single-Cell Analysis
CN117976042A (en) Method for determining read mass fraction, sequencing method and sequencing device
CN116434843A (en) Base sequencing quality assessment method
Tanaseichuk et al. A probabilistic approach to accurate abundance-based binning of metagenomic reads
CN114420214A (en) Quality evaluation method and screening method of nucleic acid sequencing data
Questier et al. Feature selection for hierarchical clustering
CN108182347B (en) Large-scale cross-platform gene expression data classification method
CN117912550A (en) Method for determining read quality and sequencing method
CN115579066B (en) Method for searching candidate genes influencing pig fat deposition based on machine learning
CN116363403B (en) Image recognition method, image recognition system, and storage medium for gene samples
US20230316054A1 (en) Machine learning modeling of probe intensity
CN113308545A (en) DNA methylation-based invasive glioma classification device
US20240011105A1 (en) Analysis of microbial fragments in plasma

Legal Events

Date Code Title Description
PB01 Publication