CN115910349A

CN115910349A - Cancer early stage prediction method based on low-depth WGS sequencing end characteristics

Info

Publication number: CN115910349A
Application number: CN202310029968.6A
Authority: CN
Inventors: 万千惠; 张轶群; 李振聪; 张怡然; 裴志华; 王东亮; 牛孝亮
Original assignee: Beijing Qiuzhen Medical Laboratory Co ltd
Current assignee: Beijing Qiuzhen Medical Laboratory Co ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-04-04
Anticipated expiration: 2043-01-09
Also published as: CN115910349B

Abstract

The invention relates to the technical field of medical molecular biology, in particular to a low-depth WGS sequencing end characteristic-based early cancer prediction method.

Description

Cancer early stage prediction method based on low-depth WGS sequencing end characteristics

Technical Field

The invention relates to the technical field of medical molecular biology, in particular to a cancer early prediction method based on low-depth WGS sequencing end characteristics.

Background

At present, cancer early-stage screening can be mainly divided into two categories, one category is the traditional detection method depending on the computer scanning technology, the endoscopy technology, the cell smear technology and the like, and the method comprises the steps of detecting lung cancer by low-dose computer scanning (LDCT), detecting intestinal cancer by an enteroscope, detecting cervical cancer by a cervical cell smear and the like. This class of methods generally has low specificity and sensitivity, while being highly invasive.

Another type of method for early screening for cancer is a liquid biopsy method. The liquid biopsy method is less invasive than the above methods and is more advantageous for early cancer detection. At present, blood, urine, saliva or the like is mainly used as a sample for liquid biopsy, and cells derived from tumors, DNA, mRNA, microRNA, proteins and the like are detected from the sample to determine the state of a cancer patient. Among them, the cell-free DNA (cfDNA) in the plasma of peripheral blood is the most widely and most promising for the early screening of cancer. Cancer patients and healthy people have great differences in cfDNA, including significant differences in mutations, copy number, chromosomal recombination, fragment characteristics, and methylation. In recent studies, there are more and more studies to explore a cancer early screening method using cfDNA fragment characteristics, of which cfDNA end characteristics (motif) belong to one branch. The end characteristics of cfDNA refer to a number of bases at the 5' end of the cfDNA fragment, and studies have shown that the end characteristics of healthy human cfDNA are more preferential than cancer patients, e.g. the proportion of these end characteristics of CCCA, CCAG and CCTG is higher in healthy human plasma than in cancer patient plasma. At present, how to assist early cancer prediction by using the end characteristics of cfDNA fragments has a very important significance.

Disclosure of Invention

In view of the above-described deficiencies of the background art, the present invention provides a method for early cancer prediction based on low depth WGS sequencing end features.

The technical scheme adopted by the invention is as follows: the method for early cancer prediction based on the low-depth WGS sequencing end characteristics is characterized in that: the method comprises the following steps:

s1, performing gene targeted sequencing on a sample to obtain an original fastq file;

s2, performing data control on the original fastq file, and screening out low-quality data;

s3, comparing the quality-controlled fastq file with a reference genome to obtain a bam file, and performing data filtering on the bam file to remove a repeated sequence;

s4, respectively counting the quantity and the proportion of the end characteristics of the cfDNA fragment and the break-point end characteristics;

and S5, calculating the Mscore value for distinguishing cancer patients.

Preferably, the quality control conditions in S2 are: the sequencing depth of the sample is not less than 5x, the proportion of bases with base errors less than 0.1 percent to the total base number is more than 90 percent, the proportion of reads aligned to the genome to the used reads is more than 95 percent, and the coverage of the sequencing result to the genome sequence is more than 90 percent.

Preferably, the reference genomic sequence in S3 is hg19.

Preferably, the filtration conditions in S3 are: pairs of reads were taken based on CIGAR values (CIGAR values indicate which reads matched perfectly with the reference genome, which had deletions compared to the reference genome, and which had insertions compared to the reference genome), with a maximum of 3bp for mismatches, a maximum of 2bp for indels, and a maximum of 3bp for gap for the longest indel.

Preferably, the statistical method for the number and proportion of the cfDNA fragment end features in S4 comprises the following steps: and respectively taking 4-6bp terminal sequence fragments with the length of 4-6bp from the terminal 4-6bp of the 5' end of the positive strand and the negative strand of each read, and counting the number and the ratio of the terminal sequence fragments of the sequences.

Preferably, the statistical method for the number and proportion of break-point end features in S4 comprises the following steps: and (3) taking terminal sequence fragments of 2bp and 3bp from the 5 'end of the negative strand of each read, taking 2bp and 3bp sequences from the connection end of the reference genome and the 5' end of the negative strand of each read, splicing the sequences to obtain break-point characteristic sequences of 4bp and 6bp, and counting the number and the ratio of the break-point characteristic sequences.

Preferably, S5 is specifically: screening motif by using an LASSO method based on the quantity and the proportion of the end characteristics of the sample cfDNA fragment and the break-point end characteristics obtained in the step S4, and calculating the Mscore value of the ith sample by adopting a formula 1

/>

Formula 1

wherein ,t_ij Showing samples obtained by the polar difference methodiTo (1)jNormalized ratio values of the number of motifs; m is the number of the screened motifs;W _j is motifjOf the cell.

Preferably, thet _ij The calculation is performed using equation 2:

/>

formula 2

wherein ,P _ij is a sampleiTo (1) ajThe ratio of the number of motifs,irepresents the second in all samplesiThe number of samples was one,jrepresents the second of all motifsjA motif, toiThe first sample ofjAnd (4) motif.

Preferably, each motif is derived from a random forestjWeight ofw _j When anP _j When the content is more than or equal to 0,W _j =w _j (ii) a When it is notP _j When the ratio is less than 0, the reaction mixture is,W _j =-w _j ；

/>

formula 3

wherein ,∆P _j Is the firstjThe difference between the mean of the ratios of the individual motifs between the healthy and tumor groups;n _h indicating the number of samples of a healthy person,n _t indicates the number of tumor samples to be tested,i _h is shown asiA sample of a healthy person from a human,i _t is shown asi(ii) a sample of each of the tumors,P _ihj is a sample of a healthy personi _h To (1) ajThe ratio of the number of motif,P _itj is a sample of a tumor patienti _t To (1) ajRatio of individual motif.

Preferably, the samples are tissue fluid samples and massive samples from healthy people and tumor people, and the tissue fluid samples comprise any one of tissue grinding fluid, nasal swabs, virus fluid, blood, serum, plasma, semen, saliva and urine; the bulk sample includes any one of tissue bulk, transgenic mouse tail, toenail.

Has the beneficial effects that: compared with the prior art, the early cancer prediction method based on the low-depth WGS sequencing end characteristics provided by the invention has the advantages that the end characteristics of fragments with different lengths and different positions are comprehensively considered by collecting samples of healthy people and tumor people and simultaneously extracting and counting the end characteristics of cfDNA fragments and break-point end characteristics of the samples, an optimized Mscore algorithm is utilized, the healthy people group and the cancer patient group are more conveniently distinguished, and high stability can be still maintained under the condition of different data volumes.

Drawings

FIG. 1 is a graph of stability based on 5 Xdepth;

FIG. 2 is a graph of AUC of classification performance based on the present invention;

FIG. 3 is a graph of AUC for classification performance based on different cancer species;

FIG. 4 is a schematic of stability at different depths.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Example 1 sample data extraction

Plasma from two populations, one healthy (N =32, three random samplings N = 96), and one cancer (N = 112), were randomly selected for on-machine sequencing. The specific process is as follows:

cfDNA extraction: the cfDNA in the plasma sample is extracted by adopting a plasma extraction Kit, the specific operation is described in the specification of a QIAamp Circulating Nuleacid Kit of QIAGEN company, and the extracted DNA is quantified by using a qubit4.0 and a dsDNA HS Assay Kit.

Library construction: repairing the tail end and adding an A tail at the 3' tail end; 10-50ng cfDNA was taken into a PCR tube, supplemented to 50. Mu.L with Low TE, and reagents were added as in Table 1 below.

TABLE 1

Vortex mixing, microcentrifugation, and set the following procedure for reaction on a PCR instrument, table 2:

TABLE 2

Connecting joints: the corresponding reagents were added to the system after the end of the above reaction according to the following table 3:

TABLE 3

Vortex mixing, microcentrifuge, set up the following procedure for reaction on a PCR instrument (hot lid closed), table 4:

TABLE 4

And (3) purification after connection: storing the Beckman Agencourt AMPure XP magnetic beads 2~8 at the temperature of 5363 ℃, and balancing for at least 30min at room temperature; to each sample, 80. Mu.L (1 Xvolume) of AMPure XP magnetic beads were added and mixed well by pipetting or shaking. Standing for 5 minutes at room temperature; placing the magnetic frame for standing for 2 minutes, sucking and removing the supernatant by using a liquid moving machine when the magnetic beads are completely adsorbed to the side wall, and paying attention to avoid disturbing the magnetic beads; slowly adding 200 μ L of 80% ethanol into the tube wall of the magnetic frame along the direction opposite to the magnetic beads, standing for 30s-1min, sucking with a pipette, and removing the supernatant; repeating the above steps once, and using a 10 mu L pipette to suck and remove the residual ethanol as far as possible; drying the magnetic beads for 5 minutes at room temperature; resuspend the beads in 21. Mu.L of low TE buffer per sample; blowing or shaking by a pipettor, fully and uniformly mixing, and incubating for 1 minute at room temperature; placing on a magnetic frame, and incubating for 2 minutes at room temperature; after the magnetic beads are completely adsorbed to the side wall, transferring 20 mu L of supernatant into a new PCR tube for amplification; library amplification: the corresponding reagents were added to the system after the end of the above reaction according to table 5 below:

TABLE 5

Vortex mixing, microcentrifugation, and set up the following program for reaction on a PCR instrument, table 6:

TABLE 6

After the reaction was completed, the PCR product was purified using 1X volume of magnetic beads according to the procedure of magnetic bead purification, and then the pre-library concentration was determined using dsDNA HS Assay Kit, and fragment size detection was performed using QIAxcel nucleic acid electrophoresis analysis system.

And (3) cfDNA whole genome sequencing, namely performing on-machine sequencing on the library sample through a second-generation sequencer MGI2000, and adopting a sequencing mode of double-end sequencing, wherein the read length is 100bp, and the sequencing depth is 10 x.

Example 2 Mscore calculation to differentiate cancer patient groups from healthy groups

Splitting a BCL file acquired by a sequencing platform according to the index of a sample to obtain data in a fastq format of each sample, comparing the data in the fastq format with a genome sequence (hg 19) to obtain a bam file of each sample, and performing quality control on the data of each sample, wherein the quality control conditions comprise: the sequencing depth of the sample is 5x, the proportion of bases with base errors lower than 0.1 percent to the total base number is more than 90 percent, the proportion of reads aligned to a genome to the used reads is more than 95 percent, and the coverage of a sequencing result to a genome sequence is more than 90 percent;

and (3) filtering the sample data passing the quality control at the reads level, wherein the filtering conditions are as follows: taking paired reads (the CIGAR value is 83/163 and 99/147), wherein the maximum mismatching is 3bp, the maximum indel number is 2bp, and the gap of the longest indel is 3bp;

taking terminal sequence segments with the length of 4-6bp from the 5' terminal of each read, and counting the number and the proportion of the terminal sequence segments of the sequences; taking terminal sequence fragments of 2bp and 3bp from the 5 'end of each read, then taking 2bp and 3bp sequences corresponding to the upstream of the 5' end on a reference genome, splicing the sequences to obtain break-point characteristic sequences of 4bp and 6bp respectively, and counting the number and the ratio of the break-point characteristic sequences;

LASSO sifts out m motifs, each obtained from random forestsjWeight ofw _j When it isP _j When the content is more than or equal to 0,W _j = w _j (ii) a When anP _j When the ratio is less than 0, the reaction mixture is,W _j =-w _j (ii) a First, the

The average value of the proportions of several motifs is equal to the difference between the healthy group and the tumor groupP _j Calculated according to equation 3

/>

Formula 3

wherein ,∆P _j Is the firstjThe difference between the mean of the ratios of the individual motifs between healthy and tumor groups;n _h indicating the number of samples of a healthy person,n _t indicates the number of tumor samples to be tested,i _h is shown asiThe samples of the individual healthy persons were taken,i _t is shown asi(ii) a sample of each of the tumors,P _ihj is a sample of a healthy personi _h To (1) ajThe ratio of the number of motif,P _itj is a tumor patient samplei _t To (1) ajThe ratio of individual motifs;

normalization of the ratio of motif by the range methodt _ij The calculation is performed using equation 2:

/>

formula 2

wherein ,P _ij is a sampleiTo (1) ajThe ratio of the number of motif,irepresents the second in all samplesiThe number of the samples was measured,jrepresents the second of all motifsjA motif, toiA first sample ofjA motif;

finally, the Mscore value of the sample is calculated using equation 1

/>

Formula 1

In fig. 1 the abscissa is healthy and tumor group (including lung cancer, intestinal cancer, stomach cancer, liver cancer and pancreatic cancer) and the ordinate is Mscore, and in fig. 1 it is seen that Mscore based on 5x depth data can distinguish samples of healthy human combined cancer groups; FIG. 2 shows the results of Mscore-based ROC analysis, wherein the AUC was 0.9934 in the case of no classification of cancer species, the specificity was 1 in the case of a TAScore threshold of 0.3646, and the sensitivity was 0.9643; FIG. 3 shows the results of ROC analysis of Mscore of different cancer species, AUC of 0.9659 for lung cancer, 0.9926 for intestinal cancer, and AUC of 1 for stomach cancer, liver cancer and pancreatic cancer. When the threshold value is 0.3646, the specificity of the lung cancer is 1, and the sensitivity is 0.8182; the specificity of intestinal cancer is 1, and the sensitivity is 0.8571; specific behavior 1 of gastric cancer, sensitivity 0.9688; the specificity of the liver cancer is 1, and the sensitivity is 1; pancreatic cancer has a specificity of 1 and a sensitivity of 1.

Example 3 Performance verification

The data (0.1X, 0.5X, 1X, 3X, 5X, RAW data) of samples of two groups of people at different depths are selected as training data, mscore values are calculated, stability evaluation is carried out, the results are shown in figure 4, the Mscore can distinguish samples of a healthy person group and a tumor group under the condition that 0.1X, 0.5X, 1X, 3X, 5X and RAW data (RAW) are used, the Mscore at the position of a dotted line is 0.3646, and the scheme is shown to have good classification effect and strong sensitivity and specificity under the condition that the algorithm is stably put forward.

Finally, it should be noted that the above-mentioned description is only a preferred embodiment of the present invention, and those skilled in the art can make various similar representations without departing from the spirit and scope of the present invention.

Claims

1. The early cancer prediction method based on the low-depth WGS sequencing end characteristics is characterized by comprising the following steps of:

and S5, calculating the Mscore value for distinguishing cancer patients.

2. The method for early prediction of cancer based on low depth WGS sequencing end features of claim 1, wherein the quality control conditions in S2 are: the sequencing depth of the sample is not less than 5x, the proportion of bases with base errors less than 0.1 percent to the total base number is more than 90 percent, the proportion of reads aligned to the genome to the used reads is more than 95 percent, and the coverage of the sequencing result to the genome sequence is more than 90 percent.

3. The method of low depth WGS sequencing end signature-based early prediction of cancer according to claim 1, wherein the reference genomic sequence in S3 is hg19.

4. The method of low depth WGS sequencing end feature-based early prediction of cancer according to claim 1, wherein the filtering conditions in S3 are: taking paired reads, the maximum mismatching is 3bp, the maximum indel number is 2bp, and the gap of the longest indel is 3bp.

5. The method for early prediction of cancer based on low depth WGS sequencing end features of claim 1, characterized by the statistical method of the number and proportion of end features of cfDNA fragments in S4: and respectively taking 4-6bp terminal sequence fragments with the length of 4-6bp from the terminal 4-6bp of the 5' end of the positive strand and the negative strand of each read, and counting the number and the ratio of the terminal sequence fragments of the sequences.

6. The low depth WGS sequencing end feature-based early cancer prognosis of claim 1

The measuring method is characterized in that the statistical method for the quantity and the proportion of the break-point end characteristics in the S4 comprises the following steps: and (3) taking terminal sequence fragments of 2bp and 3bp from the 5 'end of the negative strand of each read, taking 2bp and 3bp sequences from the connection end of the reference genome and the 5' end of the negative strand of each read, splicing the sequences to obtain break-point characteristic sequences of 4bp and 6bp, and counting the number and the ratio of the break-point characteristic sequences.

7. The method for early prediction of cancer based on the end features of low depth WGS sequencing according to claim 1, wherein S5 is specifically: screening motif by using an LASSO method based on the quantity and the proportion of the end characteristics of the sample cfDNA fragment and the break-point end characteristics obtained in the step S4, and calculating the number 1iMscore value of each sample

Formula 1

wherein ,t_ij Showing samples obtained by the polar difference methodiTo (1) ajNormalized ratio values of the individual motifs; m is the number of the screened motifs;W _j is motifjOf the cell.

8. The method of claim 7, wherein the method for early prediction of cancer based on low depth WGS sequencing end featurest _ij The calculation is performed using equation 2:

formula 2

wherein ,P _ij is a sampleiTo (1) ajThe ratio of the number of motifs,irepresents the second in all samplesiThe number of samples was one,jrepresents the second of all motifsjA motif, toiA first sample ofjAnd (4) motif.

9. Root of herbaceous plantThe method of claim 7 for early prediction of cancer based on low depth WGS sequencing end features, wherein each motif obtained from random forestsjWeight ofw _j When it isP _j When the content is more than or equal to 0,W _j =w _j (ii) a When it is notP _j When the ratio is less than 0, the reaction mixture is,W _j =-w _j ；

formula 3

wherein ,∆P _j Is the firstjThe difference between the mean of the ratios of the individual motifs between the healthy and tumor groups;n _h indicating the number of samples of a healthy person,n _t indicates the number of tumor samples to be tested,i _h is shown asiThe samples of the individual healthy persons were taken,i _t denotes the firsti(ii) a sample of each of the tumors,P _ihj is a sample of a healthy personi _h To (1) ajThe ratio of the number of motifs,P _itj is a tumor patient samplei _t To (1)jRatio of individual motifs.

10. The method of low depth WGS sequencing end feature-based early prediction of cancer according to claim 1, wherein: the samples are tissue fluid samples and massive samples from healthy people and tumor people, and the tissue fluid samples comprise any one of tissue grinding fluid, nasal swabs, virus fluid, blood, serum, plasma, semen, saliva and urine; the bulk sample includes any one of tissue bulk, transgenic mouse tail, toenail.