CN113528631B

CN113528631B - Method and system for predicting sample quality in NGS sequencing

Info

Publication number: CN113528631B
Application number: CN202110757007.8A
Authority: CN
Inventors: 何俊俊; 邵阳; 刘凯华; 朱伟; 高宇; 杨岚; 汪笑男; 王晓丹; 焦乐晨; 赵瑾
Original assignee: Nanjing Shihe Gene Biotechnology Co ltd; First Affiliated Hospital of Zhejiang University School of Medicine
Current assignee: Nanjing Shihe Gene Biotechnology Co ltd; First Affiliated Hospital of Zhejiang University School of Medicine
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2022-05-13
Anticipated expiration: 2041-07-05
Also published as: CN113528631A

Abstract

The invention relates to a model and software for predicting whether quality control is qualified or not through NGS early-stage experimental indexes of a tumor tissue sample, and belongs to the field of clinical laboratory and biotechnology. And (3) screening out factors related to quality control by extracting and pre-constructing the experimental indexes of the first day in the library construction process, and establishing an NGS quality control prediction model by using indexes at the early stage of the NGS experimental process through a ridge regression method. The sample predicted to be unqualified in quality control is timely informed to the patient, and the problem that the patient knows that the sample is unqualified and prepares again after sequencing and signal generation analysis is effectively solved. The model is compiled and packaged into small software through PYTHON, and an experimenter inputs relevant experiment indexes through an interface to predict whether the quality control of the sample is qualified or not, so that the method is suitable for predicting the quality control risk assessment after sample sequencing.

Description

Method and system for predicting sample quality in NGS sequencing

Technical Field

The invention discloses a model and software for predicting NGS detection quality control through experimental early-stage data, belonging to the field of clinical laboratory and biotechnology.

Background

Next Generation Sequencing (NGS), also known as Massively Parallel Sequencing (MPS), has the advantages of high throughput, lower cost of single base detection, higher speed, and capability of detecting a large number of target genes at a time, compared to the conventional sequencing technology, and thus is widely applied to the fields of tumor-targeted therapeutic gene mutation detection, genetic tumor detection, genetic disease and rare disease detection, chromosome aneuploidy noninvasive prenatal screening, pathogenic microorganism and metagenome detection, and the like.

The high-throughput sequencing method has multiple operation steps and complicated procedures and is divided into two stages of a wet experiment (wet bench) and a dry experiment (dry bench). The wet experiment comprises sample pretreatment, nucleic acid extraction, genome fragmentation, pre-library construction, enrichment, final library construction, preparation before sequencing, sequencing and the like; the dry experiment comprises links such as data quality analysis, comparison, variation identification, annotation, result report and explanation after sequencing. The sample of a patient needs about 7 days after the wet experiment and the dry experiment process, and the final quality control is obtained after 7 days, so that the patient with unqualified quality control not only can easily cause misunderstanding of the technology, but also can influence the treatment time of the patient. The quality of the sample is not qualified mainly because the quality of the sample, for example, the tissue sample is subjected to a plurality of uncontrollable factors such as paraffin embedding, and the like, and the risk of DNA degradation and fragmentation exists, so that the output of the final effective sequencing data of the sample is influenced.

Disclosure of Invention

The invention solves the problems that in the prior art, after a tissue sample is extracted, the quality of the sample cannot be effectively and quickly predicted, and further the quality of the sample is uncontrollable when the subsequent library building and sequencing process is carried out, so that the whole sequencing process is overlong and the cost is increased: the invention provides a model and software for predicting quality control of NGS detection, which realize the purpose of accurately predicting whether the quality control of an organization sample is qualified or not by carrying out data analysis and model construction on data of an early-stage experimental link of NGS. The method can realize accurate evaluation of sample quality on the first day of experiment, performs quality control prediction, and feeds back clinicians and patients in time.

The technical scheme is as follows:

a method for predicting the quality of a sample in NGS sequencing comprises the following steps:

step 1, sequentially carrying out DNA extraction, DNA fragmentation, end repair and joint treatment on a tissue sample;

step 2, carrying out PCR reaction on the DNA solution obtained in the step 1;

step 3, acquiring data of DNA entry amount, concentration after adding connector cleaning, total amount after adding connector cleaning, amplification cycle number, concentration after PCR, total amount after PCR and amplification proportion in the operation process in the step 1 and the step 2, using the data as input variables of a prediction model, and using the qualified condition of sequencing quality control as output variables of the model to construct the prediction model;

and 4, predicting the quality control result of the sample to be sequenced by adopting the constructed prediction model.

Preferably, in step 1, the tissue sample is extracted by phenol chloroform extraction, centrifugal column method or magnetic bead method to obtain DNA.

Preferably, in step 1, the tissue sample is subjected to a rapid freezing process, a paraffin fixation process or a formalin fixation process.

Preferably, in step 3, the prediction model is a classifier.

Preferably, in step 3, the classifier includes: support vector machine, decision tree, random forest, logistic regression, Bayes, K nearest neighbor, K mean, Markov, and regression ridge algorithms.

A system for predicting sample quality in NGS sequencing, comprising:

the data acquisition module is used for acquiring data of DNA (deoxyribonucleic acid) entry amount, concentration after cleaning of the adding connector, total amount after cleaning of the adding connector, amplification cycle number, concentration after PCR, total amount after PCR, amplification proportion and quality control result in the processes of sample extraction and PCR reaction;

and the prediction module is used for predicting the result of the sample processing process by taking the DNA entry amount, the concentration after adding the connector for cleaning, the total amount after adding the connector for cleaning, the amplification cycle number, the concentration after PCR, the total amount after PCR and the amplification proportion as input variables of the prediction model and taking the qualified condition of sequencing quality control as an output variable of the model.

Preferably, the prediction module is constructed based on a classifier.

Preferably, the classifier comprises: support vector machine, decision tree, random forest, logistic regression, Bayes, K nearest neighbor, K mean, Markov, and regression ridge algorithms.

A computer readable medium carrying a program for executing the method for predicting sample quality in NGS sequencing.

Advantageous effects

The quality control prediction model is constructed for the first time based on the experimental parameters of tissue sample extraction and pre-library, can predict the possibility of whether the quality control of the sample is qualified or not from early-stage experimental indexes, and has the advantages of high flux, high detection specificity and high sensitivity.

Drawings

FIG. 1: research design flow chart of quality control prediction model

FIG. 2: ridge regression modeling, training set ROC curves

FIG. 3: ridge regression modeling independent verification group 1ROC graph

FIG. 4: ridge regression modeling prospective independent verification group 2ROC curve graph

FIG. 5: interface diagram based on PYTHON packaging software

Detailed Description

According to the invention, a prediction model for predicting whether the quality control of the sample is qualified is established based on the experimental indexes of tissue sample extraction and pre-library construction for the first time, so that the specificity and sensitivity of the quality control prediction of the tissue sample are improved.

The prediction method provided by the invention is used for modeling a DNA fragment obtained after DNA extraction of a tissue sample and parameters of a PCR process and predicting a quality control result of subsequent library construction and sequencing.

The specific process is as follows:

sample extraction: DNA was extracted from the tissue sample and DNA concentration was measured in the present invention, and the amount of DNA taken in was calculated from the sample addition volume. The tissue sample DNA extraction method applicable to the invention comprises the following steps: phenol chloroform extraction, centrifugal column method, magnetic bead method, etc., and the treatment of the tissue sample may also include rapid freezing treatment, paraffin fixation treatment, formalin fixation treatment, etc. In the following examples, the DNA extraction method used was a centrifugal column method, and the samples were each subjected to paraffin embedding treatment.

DNA fragmentation, end repair and ligation: breaking the entered DNA, repairing the tail end of the obtained DNA fragment, adding a joint, purifying, measuring the concentration, and calculating the total amount of the cleaned joint according to the volume of the sample.

And (3) PCR amplification: and setting the PCR cycle number according to the DNA concentration and the entry amount, measuring the concentration after PCR after the PCR is finished, and calculating the total amount after PCR according to the volume.

Thus, the above steps obtain the amount of input data required by the prediction method. Then, the conventional library building and sequencing process can be adopted subsequently, and the following steps are briefly described as follows: hybridizing and capturing the PCR amplification reaction product obtained in the step through a designed probe, performing adsorption separation through magnetic beads with streptavidin, eluting the captured nucleic acid from the magnetic beads, performing PCR amplification on the eluted product due to loss, performing on-machine sequencing and off-machine data analysis on the amplified product, and obtaining a sequencing result.

The main term definitions in the present invention are:

DNA entry: total DNA amount of the tissue sample after DNA extraction and subsequent treatment. The data range is typically 50 to 250 ng.

Concentration after cleaning of the added joint: and (3) carrying out fragmentation, end repair, adaptor addition and cleaning treatment on the extracted DNA to obtain the DNA concentration in a solution. The data is related to the amount of entry and the sample mass, and can be generally between 1 and 50 ng/. mu.L.

Total amount of the joints after cleaning: and (3) carrying out fragmentation, end repair, adaptor addition and cleaning treatment on the extracted DNA to obtain the total DNA content in the solution. The data is related to the amount of entry and the sample mass, and can be generally between 20 ng and 700 ng.

Number of amplification cycles: number of cycles of PCR reaction. The data range is generally 5-30 cycles, and can be controlled to be about 7-15 cycles.

Concentration after PCR: DNA concentration in the solution after PCR reaction. The data is related to the amount of entry and sample mass/fold amplification and can typically range from 5 to 150 ng/. mu.L.

Total amount after PCR: total amount of DNA in solution after PCR reaction. The data is related to the amount of entry and the sample quality, and can be generally 100-3000 ng.

Amplification ratio: ratio of total amount after PCR to total amount after adaptor cleaning. The data range is generally greater than 1.5, and less than 1.5 is a warning.

Quality control pass/fail: in the patent, indexes that the average sequencing depth is more than 500 times, the effective sequencing depth is more than 200 times \ Q30 (%) > 75% and the comparison rate with the human genome is more than 90% are used for evaluating whether the quality control of the NGS sequencing process is qualified or not.

The experimental procedure of the present invention is shown in FIG. 1.

The case of the samples involved in the present invention

714 tissue samples were analyzed retrospectively from 2019.7-2021.2 and divided into a training set and a validation set. The training set is used for constructing an optimal sensitivity specificity model, and the verification set is used for verifying the accuracy of model prediction. 272 samples were then prospectively collected 2021.3-2021.5 while grouping the in-group samples into a training group and a validation group, with the following information:

TABLE 1 sample information

Model construction

The DNA entry amount, the concentration after the cleaning of the added connector, the total amount after the cleaning of the added connector, the amplification cycle number, the concentration after the PCR, the total amount after the PCR and the amplification proportion are subjected to parameter screening and modeling by the 7 variables of 481 samples of the training set, and the model effect obtained by simultaneously putting the 7 variables into the model is found to be optimal by a ridge regression algorithm.

The performance of the screened optimal model in the training set is shown in figure 2, wherein the model is qualified in quality control and unqualified in quality control. The training set AUC by ridge regression modeling was 0.955. Sensitivity and specificity were 88.9% and 95.9%, respectively, as shown in table 2.

TABLE 2 model sensitivity and specificity in training set

For comparison, a control model was constructed, and two variables, i.e., the amplification cycle number and the amplification ratio, were omitted as control models.

Input variables for control model 1: the DNA entry amount, the concentration after adding the connector for cleaning, the total amount after adding the connector for cleaning, the concentration after PCR, the total amount after PCR and the amplification ratio, and the amplification cycle number is removed.

Input variables for control model 2: the DNA entry amount, the concentration after adding the connector for cleaning, the total amount after adding the connector for cleaning, the amplification cycle number, the concentration after PCR and the total amount after PCR, and the amplification proportion is removed.

Modeling the model by using the same data set according to the same method, and obtaining the following prediction results:

table 3: comparative model 1 (after removal of amplification scale factor) modeling performance

Table 4: modeling performance after comparison model 2 (removal of amplification scale factor)

It can be seen that after the amplification cycle number and the amplification ratio are removed for modeling, the test effect of the model in the training set and the verification set, whether sensitivity or specificity, is significantly lower than that of the model modeled by using 7 variables, the model built by using 7 variables, whether in the training set or in the subsequent two verification sets, performs significantly better than other models, and when the influence degree of each variable on the prediction effect of the model is calculated, it can be seen that the influence degrees of the amplification cycle number and the amplification ratio on the result prediction are greater, respectively 0.08675629 and 0.1342217, and the influence degrees of each variable on the model result are as follows:

table 5: degree of influence of each factor on prediction result

In the method of the present invention, the amplification cycle number and the amplification ratio are two important variables, and in the experimental process, when the sample quality is poor, the template DNA is seriously damaged, and the DNA amplification capability is weak. Meanwhile, the amplification ratio also depends on the amplification cycle number, and when the sample quality is poor, the amount of the sample after the joint is added is low, so that the amplification of PCR needs to be increased, and therefore, the amplification cycle number and the amplification ratio can reflect the quality of the sample to a certain extent, can be used as a main index for predicting the mechanical control of the sample, can realize better prediction of a subsequent sequencing result in the early stage of sample processing, and has the effect of improving the model accuracy.

Model validation

7 experimental parameters of 233 samples of the independent verification set 1 are input into the constructed model for verification, and the AUC value obtained by verification reaches 0.991, as shown in FIG. 3. The sensitivity and specificity of the model were 100% and 97.8%, respectively, as shown in table 3.

Table 5 model sensitivity and specificity in validation set 1

In order to further verify the accuracy of the model, 272 samples are collected prospectively as an independent verification set 2, the experimental parameters of the 272 samples are input into the constructed model for further verification of the model performance, and the AUC value obtained by verification reaches 1, as shown in fig. 4. The sensitivity and specificity of the model were: 100% and 100%, as shown in Table 4.

Table 6 model sensitivity and specificity in validation set 2

It can be seen that the model in the scheme can better predict whether the quality control is qualified.

Model package

In consideration of the convenience of actual use, the algorithm of the model is compiled and packaged into small software through PYTHON, the small software can be directly installed on a third-party computer, the DNA entry amount, the concentration after adding connector cleaning, the total amount after adding connector cleaning, the amplification cycle number, the concentration after PCR, the total amount after PCR and 7 experimental parameters of the amplification proportion are directly input on the interface of the software, and the prediction result can be displayed on the interface by clicking 'prediction'. As shown in fig. 5.

Claims

1. A method for predicting the quality of a sample in NGS sequencing is characterized by comprising the following steps:

step 1, sequentially carrying out DNA extraction, DNA fragmentation, tail end repair and connector adding treatment on a tissue sample;

step 2, carrying out PCR reaction on the DNA solution obtained in the step 1;

step 3, acquiring data of DNA input amount, concentration after adding connector cleaning, total amount after adding connector cleaning, amplification cycle number, concentration after PCR, total amount after PCR and amplification proportion in the operation process in the step 1 and the step 2, using the data as input variables of a prediction model, and using the qualified condition of sequencing quality control as output variables of the model to construct the prediction model;

step 4, predicting a quality control result of a sample to be sequenced by adopting the constructed prediction model;

the DNA input amount refers to the total DNA amount of a tissue sample subjected to DNA extraction and then subjected to a subsequent treatment process;

the concentration after the adaptor is added and cleaned refers to the concentration of DNA in a solution obtained after fragmentation, end repair, adaptor addition and cleaning of the extracted DNA;

the total amount of the DNA subjected to adaptor adding and cleaning refers to the total amount of the DNA in a solution obtained by fragmenting, tail end repairing, adaptor adding and cleaning the extracted DNA;

the amplification cycle number refers to the PCR reaction cycle number;

the concentration after PCR refers to the concentration of DNA in the solution after PCR reaction;

the total amount after PCR refers to the total amount of DNA in the solution after PCR reaction;

the amplification ratio refers to the ratio of the total amount after PCR to the total amount after adaptor cleaning.

2. The method of claim 1, wherein in step 1, the tissue sample is extracted from DNA by phenol chloroform extraction, centrifugal column method or magnetic bead method.

3. The method of claim 1, wherein in the step 1, the tissue sample is subjected to a rapid freezing treatment, a paraffin fixation treatment or a formalin fixation treatment.

4. The method of claim 1, wherein in the step 3, the prediction model is a classifier.

5. The method of claim 4, wherein the classifier comprises: support vector machine, decision tree, random forest, logistic regression, Bayes, K nearest neighbor, K mean, Markov, and regression ridge algorithms.

6. A system for predicting quality of a sample in NGS sequencing is characterized by comprising:

the prediction module is used for predicting the result of the sample processing process by taking the DNA entry amount, the concentration after the cleaning of the added connector, the total amount after the cleaning of the added connector, the amplification cycle number, the concentration after the PCR, the total amount after the PCR and the amplification proportion as input variables of the prediction model and the qualified condition of sequencing quality control as output variables of the model;

the concentration after adding the connector and cleaning is the concentration of DNA in a solution obtained by fragmenting, repairing the tail end, adding the connector and cleaning the extracted DNA;

the total amount of the DNA subjected to adaptor adding and cleaning refers to the total amount of the DNA in a solution obtained by fragmenting, tail end repairing, adaptor adding and cleaning the extracted DNA; the amplification cycle number refers to the PCR reaction cycle number;

7. The system of claim 6, wherein the prediction module is constructed based on a classifier.

8. The system according to claim 7, wherein the classifier comprises: support vector machine, decision tree, random forest, logistic regression, Bayes, K nearest neighbor, K mean, Markov, and regression ridge algorithm.

9. A computer-readable medium storing a program for executing the method for predicting the sample quality in NGS sequencing according to claim 1.