Disclosure of Invention
The invention solves the problems that in the prior art, after a tissue sample is extracted, the quality of the sample cannot be effectively and quickly predicted, and further the quality of the sample is uncontrollable when the subsequent library building and sequencing process is carried out, so that the whole sequencing process is overlong and the cost is increased: the invention provides a model and software for predicting quality control of NGS detection, which realize the purpose of accurately predicting whether the quality control of an organization sample is qualified or not by carrying out data analysis and model construction on data of an early-stage experimental link of NGS. The method can realize accurate evaluation of sample quality on the first day of experiment, performs quality control prediction, and feeds back clinicians and patients in time.
The technical scheme is as follows:
a method for predicting the quality of a sample in NGS sequencing comprises the following steps:
step 1, sequentially carrying out DNA extraction, DNA fragmentation, end repair and joint treatment on a tissue sample;
step 2, carrying out PCR reaction on the DNA solution obtained in the step 1;
step 3, acquiring data of DNA entry amount, concentration after adding connector cleaning, total amount after adding connector cleaning, amplification cycle number, concentration after PCR, total amount after PCR and amplification proportion in the operation process in the step 1 and the step 2, using the data as input variables of a prediction model, and using the qualified condition of sequencing quality control as output variables of the model to construct the prediction model;
and 4, predicting the quality control result of the sample to be sequenced by adopting the constructed prediction model.
Preferably, in step 1, the tissue sample is extracted by phenol chloroform extraction, centrifugal column method or magnetic bead method to obtain DNA.
Preferably, in step 1, the tissue sample is subjected to a rapid freezing process, a paraffin fixation process or a formalin fixation process.
Preferably, in step 3, the prediction model is a classifier.
Preferably, in step 3, the classifier includes: support vector machine, decision tree, random forest, logistic regression, Bayes, K nearest neighbor, K mean, Markov, and regression ridge algorithms.
A system for predicting sample quality in NGS sequencing, comprising:
the data acquisition module is used for acquiring data of DNA (deoxyribonucleic acid) entry amount, concentration after cleaning of the adding connector, total amount after cleaning of the adding connector, amplification cycle number, concentration after PCR, total amount after PCR, amplification proportion and quality control result in the processes of sample extraction and PCR reaction;
and the prediction module is used for predicting the result of the sample processing process by taking the DNA entry amount, the concentration after adding the connector for cleaning, the total amount after adding the connector for cleaning, the amplification cycle number, the concentration after PCR, the total amount after PCR and the amplification proportion as input variables of the prediction model and taking the qualified condition of sequencing quality control as an output variable of the model.
Preferably, the prediction module is constructed based on a classifier.
Preferably, the classifier comprises: support vector machine, decision tree, random forest, logistic regression, Bayes, K nearest neighbor, K mean, Markov, and regression ridge algorithms.
A computer readable medium carrying a program for executing the method for predicting sample quality in NGS sequencing.
Advantageous effects
The quality control prediction model is constructed for the first time based on the experimental parameters of tissue sample extraction and pre-library, can predict the possibility of whether the quality control of the sample is qualified or not from early-stage experimental indexes, and has the advantages of high flux, high detection specificity and high sensitivity.
Detailed Description
According to the invention, a prediction model for predicting whether the quality control of the sample is qualified is established based on the experimental indexes of tissue sample extraction and pre-library construction for the first time, so that the specificity and sensitivity of the quality control prediction of the tissue sample are improved.
The prediction method provided by the invention is used for modeling a DNA fragment obtained after DNA extraction of a tissue sample and parameters of a PCR process and predicting a quality control result of subsequent library construction and sequencing.
The specific process is as follows:
sample extraction: DNA was extracted from the tissue sample and DNA concentration was measured in the present invention, and the amount of DNA taken in was calculated from the sample addition volume. The tissue sample DNA extraction method applicable to the invention comprises the following steps: phenol chloroform extraction, centrifugal column method, magnetic bead method, etc., and the treatment of the tissue sample may also include rapid freezing treatment, paraffin fixation treatment, formalin fixation treatment, etc. In the following examples, the DNA extraction method used was a centrifugal column method, and the samples were each subjected to paraffin embedding treatment.
DNA fragmentation, end repair and ligation: breaking the entered DNA, repairing the tail end of the obtained DNA fragment, adding a joint, purifying, measuring the concentration, and calculating the total amount of the cleaned joint according to the volume of the sample.
And (3) PCR amplification: and setting the PCR cycle number according to the DNA concentration and the entry amount, measuring the concentration after PCR after the PCR is finished, and calculating the total amount after PCR according to the volume.
Thus, the above steps obtain the amount of input data required by the prediction method. Then, the conventional library building and sequencing process can be adopted subsequently, and the following steps are briefly described as follows: hybridizing and capturing the PCR amplification reaction product obtained in the step through a designed probe, performing adsorption separation through magnetic beads with streptavidin, eluting the captured nucleic acid from the magnetic beads, performing PCR amplification on the eluted product due to loss, performing on-machine sequencing and off-machine data analysis on the amplified product, and obtaining a sequencing result.
The main term definitions in the present invention are:
DNA entry: total DNA amount of the tissue sample after DNA extraction and subsequent treatment. The data range is typically 50 to 250 ng.
Concentration after cleaning of the added joint: and (3) carrying out fragmentation, end repair, adaptor addition and cleaning treatment on the extracted DNA to obtain the DNA concentration in a solution. The data is related to the amount of entry and the sample mass, and can be generally between 1 and 50 ng/. mu.L.
Total amount of the joints after cleaning: and (3) carrying out fragmentation, end repair, adaptor addition and cleaning treatment on the extracted DNA to obtain the total DNA content in the solution. The data is related to the amount of entry and the sample mass, and can be generally between 20 ng and 700 ng.
Number of amplification cycles: number of cycles of PCR reaction. The data range is generally 5-30 cycles, and can be controlled to be about 7-15 cycles.
Concentration after PCR: DNA concentration in the solution after PCR reaction. The data is related to the amount of entry and sample mass/fold amplification and can typically range from 5 to 150 ng/. mu.L.
Total amount after PCR: total amount of DNA in solution after PCR reaction. The data is related to the amount of entry and the sample quality, and can be generally 100-3000 ng.
Amplification ratio: ratio of total amount after PCR to total amount after adaptor cleaning. The data range is generally greater than 1.5, and less than 1.5 is a warning.
Quality control pass/fail: in the patent, indexes that the average sequencing depth is more than 500 times, the effective sequencing depth is more than 200 times \ Q30 (%) > 75% and the comparison rate with the human genome is more than 90% are used for evaluating whether the quality control of the NGS sequencing process is qualified or not.
The experimental procedure of the present invention is shown in FIG. 1.
The case of the samples involved in the present invention
714 tissue samples were analyzed retrospectively from 2019.7-2021.2 and divided into a training set and a validation set. The training set is used for constructing an optimal sensitivity specificity model, and the verification set is used for verifying the accuracy of model prediction. 272 samples were then prospectively collected 2021.3-2021.5 while grouping the in-group samples into a training group and a validation group, with the following information:
TABLE 1 sample information
Model construction
The DNA entry amount, the concentration after the cleaning of the added connector, the total amount after the cleaning of the added connector, the amplification cycle number, the concentration after the PCR, the total amount after the PCR and the amplification proportion are subjected to parameter screening and modeling by the 7 variables of 481 samples of the training set, and the model effect obtained by simultaneously putting the 7 variables into the model is found to be optimal by a ridge regression algorithm.
The performance of the screened optimal model in the training set is shown in figure 2, wherein the model is qualified in quality control and unqualified in quality control. The training set AUC by ridge regression modeling was 0.955. Sensitivity and specificity were 88.9% and 95.9%, respectively, as shown in table 2.
TABLE 2 model sensitivity and specificity in training set
For comparison, a control model was constructed, and two variables, i.e., the amplification cycle number and the amplification ratio, were omitted as control models.
Input variables for control model 1: the DNA entry amount, the concentration after adding the connector for cleaning, the total amount after adding the connector for cleaning, the concentration after PCR, the total amount after PCR and the amplification ratio, and the amplification cycle number is removed.
Input variables for control model 2: the DNA entry amount, the concentration after adding the connector for cleaning, the total amount after adding the connector for cleaning, the amplification cycle number, the concentration after PCR and the total amount after PCR, and the amplification proportion is removed.
Modeling the model by using the same data set according to the same method, and obtaining the following prediction results:
table 3: comparative model 1 (after removal of amplification scale factor) modeling performance
Table 4: modeling performance after comparison model 2 (removal of amplification scale factor)
It can be seen that after the amplification cycle number and the amplification ratio are removed for modeling, the test effect of the model in the training set and the verification set, whether sensitivity or specificity, is significantly lower than that of the model modeled by using 7 variables, the model built by using 7 variables, whether in the training set or in the subsequent two verification sets, performs significantly better than other models, and when the influence degree of each variable on the prediction effect of the model is calculated, it can be seen that the influence degrees of the amplification cycle number and the amplification ratio on the result prediction are greater, respectively 0.08675629 and 0.1342217, and the influence degrees of each variable on the model result are as follows:
table 5: degree of influence of each factor on prediction result
In the method of the present invention, the amplification cycle number and the amplification ratio are two important variables, and in the experimental process, when the sample quality is poor, the template DNA is seriously damaged, and the DNA amplification capability is weak. Meanwhile, the amplification ratio also depends on the amplification cycle number, and when the sample quality is poor, the amount of the sample after the joint is added is low, so that the amplification of PCR needs to be increased, and therefore, the amplification cycle number and the amplification ratio can reflect the quality of the sample to a certain extent, can be used as a main index for predicting the mechanical control of the sample, can realize better prediction of a subsequent sequencing result in the early stage of sample processing, and has the effect of improving the model accuracy.
Model validation
7 experimental parameters of 233 samples of the independent verification set 1 are input into the constructed model for verification, and the AUC value obtained by verification reaches 0.991, as shown in FIG. 3. The sensitivity and specificity of the model were 100% and 97.8%, respectively, as shown in table 3.
Table 5 model sensitivity and specificity in validation set 1
In order to further verify the accuracy of the model, 272 samples are collected prospectively as an independent verification set 2, the experimental parameters of the 272 samples are input into the constructed model for further verification of the model performance, and the AUC value obtained by verification reaches 1, as shown in fig. 4. The sensitivity and specificity of the model were: 100% and 100%, as shown in Table 4.
Table 6 model sensitivity and specificity in validation set 2
It can be seen that the model in the scheme can better predict whether the quality control is qualified.
Model package
In consideration of the convenience of actual use, the algorithm of the model is compiled and packaged into small software through PYTHON, the small software can be directly installed on a third-party computer, the DNA entry amount, the concentration after adding connector cleaning, the total amount after adding connector cleaning, the amplification cycle number, the concentration after PCR, the total amount after PCR and 7 experimental parameters of the amplification proportion are directly input on the interface of the software, and the prediction result can be displayed on the interface by clicking 'prediction'. As shown in fig. 5.