CN115910349A - Cancer early stage prediction method based on low-depth WGS sequencing end characteristics - Google Patents
Cancer early stage prediction method based on low-depth WGS sequencing end characteristics Download PDFInfo
- Publication number
- CN115910349A CN115910349A CN202310029968.6A CN202310029968A CN115910349A CN 115910349 A CN115910349 A CN 115910349A CN 202310029968 A CN202310029968 A CN 202310029968A CN 115910349 A CN115910349 A CN 115910349A
- Authority
- CN
- China
- Prior art keywords
- cancer
- sample
- samples
- sequencing
- motifs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 55
- 201000011510 cancer Diseases 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 31
- 239000000523 sample Substances 0.000 claims description 38
- 239000012634 fragment Substances 0.000 claims description 20
- 239000012530 fluid Substances 0.000 claims description 8
- 210000002381 plasma Anatomy 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 8
- 210000001519 tissue Anatomy 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 6
- 210000004027 cell Anatomy 0.000 claims description 5
- 238000003908 quality control method Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000007619 statistical method Methods 0.000 claims description 4
- 210000004369 blood Anatomy 0.000 claims description 3
- 239000008280 blood Substances 0.000 claims description 3
- 108090000623 proteins and genes Proteins 0.000 claims description 3
- 238000007637 random forest analysis Methods 0.000 claims description 3
- 239000011541 reaction mixture Substances 0.000 claims description 3
- 210000003296 saliva Anatomy 0.000 claims description 3
- 210000002700 urine Anatomy 0.000 claims description 3
- 101100187346 Aspergillus sp. (strain MF297-2) notP gene Proteins 0.000 claims description 2
- 241000581650 Ivesia Species 0.000 claims description 2
- 108091081062 Repeated sequence (DNA) Proteins 0.000 claims description 2
- 241000953561 Toia Species 0.000 claims description 2
- 241000700605 Viruses Species 0.000 claims description 2
- 238000000227 grinding Methods 0.000 claims description 2
- 239000013610 patient sample Substances 0.000 claims description 2
- 210000000582 semen Anatomy 0.000 claims description 2
- 210000002966 serum Anatomy 0.000 claims description 2
- 210000004906 toe nail Anatomy 0.000 claims description 2
- 238000011830 transgenic mouse model Methods 0.000 claims description 2
- 238000004393 prognosis Methods 0.000 claims 1
- 239000011324 bead Substances 0.000 description 10
- 230000035945 sensitivity Effects 0.000 description 8
- 238000006243 chemical reaction Methods 0.000 description 6
- 238000012070 whole genome sequencing analysis Methods 0.000 description 6
- 108020004414 DNA Proteins 0.000 description 5
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 4
- 208000005016 Intestinal Neoplasms Diseases 0.000 description 4
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 4
- 201000002313 intestinal cancer Diseases 0.000 description 4
- 201000005202 lung cancer Diseases 0.000 description 4
- 208000020816 lung neoplasm Diseases 0.000 description 4
- 238000002156 mixing Methods 0.000 description 4
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 3
- 208000005718 Stomach Neoplasms Diseases 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 239000003153 chemical reaction reagent Substances 0.000 description 3
- 235000019506 cigar Nutrition 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 206010017758 gastric cancer Diseases 0.000 description 3
- 238000011528 liquid biopsy Methods 0.000 description 3
- 201000007270 liver cancer Diseases 0.000 description 3
- 208000014018 liver neoplasm Diseases 0.000 description 3
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 3
- 201000002528 pancreatic cancer Diseases 0.000 description 3
- 208000008443 pancreatic carcinoma Diseases 0.000 description 3
- 241000894007 species Species 0.000 description 3
- 201000011549 stomach cancer Diseases 0.000 description 3
- 239000006228 supernatant Substances 0.000 description 3
- 102000053602 DNA Human genes 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000003149 assay kit Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 238000000746 purification Methods 0.000 description 2
- NOIIUHRQUVNIDD-UHFFFAOYSA-N 3-[[oxo(pyridin-4-yl)methyl]hydrazo]-N-(phenylmethyl)propanamide Chemical compound C=1C=CC=CC=1CNC(=O)CCNNC(=O)C1=CC=NC=C1 NOIIUHRQUVNIDD-UHFFFAOYSA-N 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 108700011259 MicroRNAs Proteins 0.000 description 1
- 102100036049 T-complex protein 1 subunit gamma Human genes 0.000 description 1
- 239000007984 Tris EDTA buffer Substances 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007664 blowing Methods 0.000 description 1
- 101150062912 cct3 gene Proteins 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001839 endoscopy Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 108020004999 messenger RNA Proteins 0.000 description 1
- 230000011987 methylation Effects 0.000 description 1
- 238000007069 methylation reaction Methods 0.000 description 1
- 239000002679 microRNA Substances 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention relates to the technical field of medical molecular biology, in particular to a low-depth WGS sequencing end characteristic-based early cancer prediction method.
Description
Technical Field
The invention relates to the technical field of medical molecular biology, in particular to a cancer early prediction method based on low-depth WGS sequencing end characteristics.
Background
At present, cancer early-stage screening can be mainly divided into two categories, one category is the traditional detection method depending on the computer scanning technology, the endoscopy technology, the cell smear technology and the like, and the method comprises the steps of detecting lung cancer by low-dose computer scanning (LDCT), detecting intestinal cancer by an enteroscope, detecting cervical cancer by a cervical cell smear and the like. This class of methods generally has low specificity and sensitivity, while being highly invasive.
Another type of method for early screening for cancer is a liquid biopsy method. The liquid biopsy method is less invasive than the above methods and is more advantageous for early cancer detection. At present, blood, urine, saliva or the like is mainly used as a sample for liquid biopsy, and cells derived from tumors, DNA, mRNA, microRNA, proteins and the like are detected from the sample to determine the state of a cancer patient. Among them, the cell-free DNA (cfDNA) in the plasma of peripheral blood is the most widely and most promising for the early screening of cancer. Cancer patients and healthy people have great differences in cfDNA, including significant differences in mutations, copy number, chromosomal recombination, fragment characteristics, and methylation. In recent studies, there are more and more studies to explore a cancer early screening method using cfDNA fragment characteristics, of which cfDNA end characteristics (motif) belong to one branch. The end characteristics of cfDNA refer to a number of bases at the 5' end of the cfDNA fragment, and studies have shown that the end characteristics of healthy human cfDNA are more preferential than cancer patients, e.g. the proportion of these end characteristics of CCCA, CCAG and CCTG is higher in healthy human plasma than in cancer patient plasma. At present, how to assist early cancer prediction by using the end characteristics of cfDNA fragments has a very important significance.
Disclosure of Invention
In view of the above-described deficiencies of the background art, the present invention provides a method for early cancer prediction based on low depth WGS sequencing end features.
The technical scheme adopted by the invention is as follows: the method for early cancer prediction based on the low-depth WGS sequencing end characteristics is characterized in that: the method comprises the following steps:
s1, performing gene targeted sequencing on a sample to obtain an original fastq file;
s2, performing data control on the original fastq file, and screening out low-quality data;
s3, comparing the quality-controlled fastq file with a reference genome to obtain a bam file, and performing data filtering on the bam file to remove a repeated sequence;
s4, respectively counting the quantity and the proportion of the end characteristics of the cfDNA fragment and the break-point end characteristics;
and S5, calculating the Mscore value for distinguishing cancer patients.
Preferably, the quality control conditions in S2 are: the sequencing depth of the sample is not less than 5x, the proportion of bases with base errors less than 0.1 percent to the total base number is more than 90 percent, the proportion of reads aligned to the genome to the used reads is more than 95 percent, and the coverage of the sequencing result to the genome sequence is more than 90 percent.
Preferably, the reference genomic sequence in S3 is hg19.
Preferably, the filtration conditions in S3 are: pairs of reads were taken based on CIGAR values (CIGAR values indicate which reads matched perfectly with the reference genome, which had deletions compared to the reference genome, and which had insertions compared to the reference genome), with a maximum of 3bp for mismatches, a maximum of 2bp for indels, and a maximum of 3bp for gap for the longest indel.
Preferably, the statistical method for the number and proportion of the cfDNA fragment end features in S4 comprises the following steps: and respectively taking 4-6bp terminal sequence fragments with the length of 4-6bp from the terminal 4-6bp of the 5' end of the positive strand and the negative strand of each read, and counting the number and the ratio of the terminal sequence fragments of the sequences.
Preferably, the statistical method for the number and proportion of break-point end features in S4 comprises the following steps: and (3) taking terminal sequence fragments of 2bp and 3bp from the 5 'end of the negative strand of each read, taking 2bp and 3bp sequences from the connection end of the reference genome and the 5' end of the negative strand of each read, splicing the sequences to obtain break-point characteristic sequences of 4bp and 6bp, and counting the number and the ratio of the break-point characteristic sequences.
Preferably, S5 is specifically: screening motif by using an LASSO method based on the quantity and the proportion of the end characteristics of the sample cfDNA fragment and the break-point end characteristics obtained in the step S4, and calculating the Mscore value of the ith sample by adopting a formula 1
wherein ,t ij Showing samples obtained by the polar difference methodiTo (1)jNormalized ratio values of the number of motifs; m is the number of the screened motifs;W j is motifjOf the cell.
Preferably, thet ij The calculation is performed using equation 2:
wherein ,P ij is a sampleiTo (1) ajThe ratio of the number of motifs,irepresents the second in all samplesiThe number of samples was one,jrepresents the second of all motifsjA motif, toiThe first sample ofjAnd (4) motif.
Preferably, each motif is derived from a random forestjWeight ofw j When anP j When the content is more than or equal to 0,W j =w j (ii) a When it is notP j When the ratio is less than 0, the reaction mixture is,W j =-w j ;
wherein ,∆P j Is the firstjThe difference between the mean of the ratios of the individual motifs between the healthy and tumor groups;n h indicating the number of samples of a healthy person,n t indicates the number of tumor samples to be tested,i h is shown asiA sample of a healthy person from a human,i t is shown asi(ii) a sample of each of the tumors,P ihj is a sample of a healthy personi h To (1) ajThe ratio of the number of motif,P itj is a sample of a tumor patienti t To (1) ajRatio of individual motif.
Preferably, the samples are tissue fluid samples and massive samples from healthy people and tumor people, and the tissue fluid samples comprise any one of tissue grinding fluid, nasal swabs, virus fluid, blood, serum, plasma, semen, saliva and urine; the bulk sample includes any one of tissue bulk, transgenic mouse tail, toenail.
Has the beneficial effects that: compared with the prior art, the early cancer prediction method based on the low-depth WGS sequencing end characteristics provided by the invention has the advantages that the end characteristics of fragments with different lengths and different positions are comprehensively considered by collecting samples of healthy people and tumor people and simultaneously extracting and counting the end characteristics of cfDNA fragments and break-point end characteristics of the samples, an optimized Mscore algorithm is utilized, the healthy people group and the cancer patient group are more conveniently distinguished, and high stability can be still maintained under the condition of different data volumes.
Drawings
FIG. 1 is a graph of stability based on 5 Xdepth;
FIG. 2 is a graph of AUC of classification performance based on the present invention;
FIG. 3 is a graph of AUC for classification performance based on different cancer species;
FIG. 4 is a schematic of stability at different depths.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Example 1 sample data extraction
Plasma from two populations, one healthy (N =32, three random samplings N = 96), and one cancer (N = 112), were randomly selected for on-machine sequencing. The specific process is as follows:
cfDNA extraction: the cfDNA in the plasma sample is extracted by adopting a plasma extraction Kit, the specific operation is described in the specification of a QIAamp Circulating Nuleacid Kit of QIAGEN company, and the extracted DNA is quantified by using a qubit4.0 and a dsDNA HS Assay Kit.
Library construction: repairing the tail end and adding an A tail at the 3' tail end; 10-50ng cfDNA was taken into a PCR tube, supplemented to 50. Mu.L with Low TE, and reagents were added as in Table 1 below.
TABLE 1
Vortex mixing, microcentrifugation, and set the following procedure for reaction on a PCR instrument, table 2:
TABLE 2
Connecting joints: the corresponding reagents were added to the system after the end of the above reaction according to the following table 3:
TABLE 3
Vortex mixing, microcentrifuge, set up the following procedure for reaction on a PCR instrument (hot lid closed), table 4:
TABLE 4
And (3) purification after connection: storing the Beckman Agencourt AMPure XP magnetic beads 2~8 at the temperature of 5363 ℃, and balancing for at least 30min at room temperature; to each sample, 80. Mu.L (1 Xvolume) of AMPure XP magnetic beads were added and mixed well by pipetting or shaking. Standing for 5 minutes at room temperature; placing the magnetic frame for standing for 2 minutes, sucking and removing the supernatant by using a liquid moving machine when the magnetic beads are completely adsorbed to the side wall, and paying attention to avoid disturbing the magnetic beads; slowly adding 200 μ L of 80% ethanol into the tube wall of the magnetic frame along the direction opposite to the magnetic beads, standing for 30s-1min, sucking with a pipette, and removing the supernatant; repeating the above steps once, and using a 10 mu L pipette to suck and remove the residual ethanol as far as possible; drying the magnetic beads for 5 minutes at room temperature; resuspend the beads in 21. Mu.L of low TE buffer per sample; blowing or shaking by a pipettor, fully and uniformly mixing, and incubating for 1 minute at room temperature; placing on a magnetic frame, and incubating for 2 minutes at room temperature; after the magnetic beads are completely adsorbed to the side wall, transferring 20 mu L of supernatant into a new PCR tube for amplification; library amplification: the corresponding reagents were added to the system after the end of the above reaction according to table 5 below:
TABLE 5
Vortex mixing, microcentrifugation, and set up the following program for reaction on a PCR instrument, table 6:
TABLE 6
After the reaction was completed, the PCR product was purified using 1X volume of magnetic beads according to the procedure of magnetic bead purification, and then the pre-library concentration was determined using dsDNA HS Assay Kit, and fragment size detection was performed using QIAxcel nucleic acid electrophoresis analysis system.
And (3) cfDNA whole genome sequencing, namely performing on-machine sequencing on the library sample through a second-generation sequencer MGI2000, and adopting a sequencing mode of double-end sequencing, wherein the read length is 100bp, and the sequencing depth is 10 x.
Example 2 Mscore calculation to differentiate cancer patient groups from healthy groups
Splitting a BCL file acquired by a sequencing platform according to the index of a sample to obtain data in a fastq format of each sample, comparing the data in the fastq format with a genome sequence (hg 19) to obtain a bam file of each sample, and performing quality control on the data of each sample, wherein the quality control conditions comprise: the sequencing depth of the sample is 5x, the proportion of bases with base errors lower than 0.1 percent to the total base number is more than 90 percent, the proportion of reads aligned to a genome to the used reads is more than 95 percent, and the coverage of a sequencing result to a genome sequence is more than 90 percent;
and (3) filtering the sample data passing the quality control at the reads level, wherein the filtering conditions are as follows: taking paired reads (the CIGAR value is 83/163 and 99/147), wherein the maximum mismatching is 3bp, the maximum indel number is 2bp, and the gap of the longest indel is 3bp;
taking terminal sequence segments with the length of 4-6bp from the 5' terminal of each read, and counting the number and the proportion of the terminal sequence segments of the sequences; taking terminal sequence fragments of 2bp and 3bp from the 5 'end of each read, then taking 2bp and 3bp sequences corresponding to the upstream of the 5' end on a reference genome, splicing the sequences to obtain break-point characteristic sequences of 4bp and 6bp respectively, and counting the number and the ratio of the break-point characteristic sequences;
LASSO sifts out m motifs, each obtained from random forestsjWeight ofw j When it isP j When the content is more than or equal to 0,W j = w j (ii) a When anP j When the ratio is less than 0, the reaction mixture is,W j =-w j (ii) a First, theThe average value of the proportions of several motifs is equal to the difference between the healthy group and the tumor groupP j Calculated according to equation 3
wherein ,∆P j Is the firstjThe difference between the mean of the ratios of the individual motifs between healthy and tumor groups;n h indicating the number of samples of a healthy person,n t indicates the number of tumor samples to be tested,i h is shown asiThe samples of the individual healthy persons were taken,i t is shown asi(ii) a sample of each of the tumors,P ihj is a sample of a healthy personi h To (1) ajThe ratio of the number of motif,P itj is a tumor patient samplei t To (1) ajThe ratio of individual motifs;
normalization of the ratio of motif by the range methodt ij The calculation is performed using equation 2:
wherein ,P ij is a sampleiTo (1) ajThe ratio of the number of motif,irepresents the second in all samplesiThe number of the samples was measured,jrepresents the second of all motifsjA motif, toiA first sample ofjA motif;
finally, the Mscore value of the sample is calculated using equation 1
In fig. 1 the abscissa is healthy and tumor group (including lung cancer, intestinal cancer, stomach cancer, liver cancer and pancreatic cancer) and the ordinate is Mscore, and in fig. 1 it is seen that Mscore based on 5x depth data can distinguish samples of healthy human combined cancer groups; FIG. 2 shows the results of Mscore-based ROC analysis, wherein the AUC was 0.9934 in the case of no classification of cancer species, the specificity was 1 in the case of a TAScore threshold of 0.3646, and the sensitivity was 0.9643; FIG. 3 shows the results of ROC analysis of Mscore of different cancer species, AUC of 0.9659 for lung cancer, 0.9926 for intestinal cancer, and AUC of 1 for stomach cancer, liver cancer and pancreatic cancer. When the threshold value is 0.3646, the specificity of the lung cancer is 1, and the sensitivity is 0.8182; the specificity of intestinal cancer is 1, and the sensitivity is 0.8571; specific behavior 1 of gastric cancer, sensitivity 0.9688; the specificity of the liver cancer is 1, and the sensitivity is 1; pancreatic cancer has a specificity of 1 and a sensitivity of 1.
Example 3 Performance verification
The data (0.1X, 0.5X, 1X, 3X, 5X, RAW data) of samples of two groups of people at different depths are selected as training data, mscore values are calculated, stability evaluation is carried out, the results are shown in figure 4, the Mscore can distinguish samples of a healthy person group and a tumor group under the condition that 0.1X, 0.5X, 1X, 3X, 5X and RAW data (RAW) are used, the Mscore at the position of a dotted line is 0.3646, and the scheme is shown to have good classification effect and strong sensitivity and specificity under the condition that the algorithm is stably put forward.
Finally, it should be noted that the above-mentioned description is only a preferred embodiment of the present invention, and those skilled in the art can make various similar representations without departing from the spirit and scope of the present invention.
Claims (10)
1. The early cancer prediction method based on the low-depth WGS sequencing end characteristics is characterized by comprising the following steps of:
s1, performing gene targeted sequencing on a sample to obtain an original fastq file;
s2, performing data control on the original fastq file, and screening out low-quality data;
s3, comparing the quality-controlled fastq file with a reference genome to obtain a bam file, and performing data filtering on the bam file to remove a repeated sequence;
s4, respectively counting the quantity and the proportion of the end characteristics of the cfDNA fragment and the break-point end characteristics;
and S5, calculating the Mscore value for distinguishing cancer patients.
2. The method for early prediction of cancer based on low depth WGS sequencing end features of claim 1, wherein the quality control conditions in S2 are: the sequencing depth of the sample is not less than 5x, the proportion of bases with base errors less than 0.1 percent to the total base number is more than 90 percent, the proportion of reads aligned to the genome to the used reads is more than 95 percent, and the coverage of the sequencing result to the genome sequence is more than 90 percent.
3. The method of low depth WGS sequencing end signature-based early prediction of cancer according to claim 1, wherein the reference genomic sequence in S3 is hg19.
4. The method of low depth WGS sequencing end feature-based early prediction of cancer according to claim 1, wherein the filtering conditions in S3 are: taking paired reads, the maximum mismatching is 3bp, the maximum indel number is 2bp, and the gap of the longest indel is 3bp.
5. The method for early prediction of cancer based on low depth WGS sequencing end features of claim 1, characterized by the statistical method of the number and proportion of end features of cfDNA fragments in S4: and respectively taking 4-6bp terminal sequence fragments with the length of 4-6bp from the terminal 4-6bp of the 5' end of the positive strand and the negative strand of each read, and counting the number and the ratio of the terminal sequence fragments of the sequences.
6. The low depth WGS sequencing end feature-based early cancer prognosis of claim 1
The measuring method is characterized in that the statistical method for the quantity and the proportion of the break-point end characteristics in the S4 comprises the following steps: and (3) taking terminal sequence fragments of 2bp and 3bp from the 5 'end of the negative strand of each read, taking 2bp and 3bp sequences from the connection end of the reference genome and the 5' end of the negative strand of each read, splicing the sequences to obtain break-point characteristic sequences of 4bp and 6bp, and counting the number and the ratio of the break-point characteristic sequences.
7. The method for early prediction of cancer based on the end features of low depth WGS sequencing according to claim 1, wherein S5 is specifically: screening motif by using an LASSO method based on the quantity and the proportion of the end characteristics of the sample cfDNA fragment and the break-point end characteristics obtained in the step S4, and calculating the number 1iMscore value of each sample
wherein ,t ij Showing samples obtained by the polar difference methodiTo (1) ajNormalized ratio values of the individual motifs; m is the number of the screened motifs;W j is motifjOf the cell.
8. The method of claim 7, wherein the method for early prediction of cancer based on low depth WGS sequencing end featurest ij The calculation is performed using equation 2:
wherein ,P ij is a sampleiTo (1) ajThe ratio of the number of motifs,irepresents the second in all samplesiThe number of samples was one,jrepresents the second of all motifsjA motif, toiA first sample ofjAnd (4) motif.
9. Root of herbaceous plantThe method of claim 7 for early prediction of cancer based on low depth WGS sequencing end features, wherein each motif obtained from random forestsjWeight ofw j When it isP j When the content is more than or equal to 0,W j =w j (ii) a When it is notP j When the ratio is less than 0, the reaction mixture is,W j =-w j ;
wherein ,∆P j Is the firstjThe difference between the mean of the ratios of the individual motifs between the healthy and tumor groups;n h indicating the number of samples of a healthy person,n t indicates the number of tumor samples to be tested,i h is shown asiThe samples of the individual healthy persons were taken,i t denotes the firsti(ii) a sample of each of the tumors,P ihj is a sample of a healthy personi h To (1) ajThe ratio of the number of motifs,P itj is a tumor patient samplei t To (1)jRatio of individual motifs.
10. The method of low depth WGS sequencing end feature-based early prediction of cancer according to claim 1, wherein: the samples are tissue fluid samples and massive samples from healthy people and tumor people, and the tissue fluid samples comprise any one of tissue grinding fluid, nasal swabs, virus fluid, blood, serum, plasma, semen, saliva and urine; the bulk sample includes any one of tissue bulk, transgenic mouse tail, toenail.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310029968.6A CN115910349B (en) | 2023-01-09 | 2023-01-09 | Early cancer prediction method based on low-depth WGS sequencing tail end characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310029968.6A CN115910349B (en) | 2023-01-09 | 2023-01-09 | Early cancer prediction method based on low-depth WGS sequencing tail end characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115910349A true CN115910349A (en) | 2023-04-04 |
CN115910349B CN115910349B (en) | 2023-05-30 |
Family
ID=85753626
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310029968.6A Active CN115910349B (en) | 2023-01-09 | 2023-01-09 | Early cancer prediction method based on low-depth WGS sequencing tail end characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115910349B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016095093A1 (en) * | 2014-12-15 | 2016-06-23 | 天津华大基因科技有限公司 | Method for screening tumor, method and device for detecting variation of target region |
US20190228131A1 (en) * | 2015-08-06 | 2019-07-25 | Eone Diagnomics Genome Center Co., Ltd. | Novel method capable of differentiating fetal sex and fetal sex chromosome abnormality on various platforms |
CN112086129A (en) * | 2020-09-23 | 2020-12-15 | 深圳吉因加医学检验实验室 | Method and system for predicting cfDNA of tumor tissue |
CN113981090A (en) * | 2021-11-18 | 2022-01-28 | 杭州求臻医学检验实验室有限公司 | Breast cancer screening marker composition, selection method thereof and breast cancer screening kit |
CN114045345A (en) * | 2022-01-07 | 2022-02-15 | 臻和(北京)生物科技有限公司 | Free DNA-based genome canceration information detection system and detection method |
-
2023
- 2023-01-09 CN CN202310029968.6A patent/CN115910349B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016095093A1 (en) * | 2014-12-15 | 2016-06-23 | 天津华大基因科技有限公司 | Method for screening tumor, method and device for detecting variation of target region |
US20190228131A1 (en) * | 2015-08-06 | 2019-07-25 | Eone Diagnomics Genome Center Co., Ltd. | Novel method capable of differentiating fetal sex and fetal sex chromosome abnormality on various platforms |
CN112086129A (en) * | 2020-09-23 | 2020-12-15 | 深圳吉因加医学检验实验室 | Method and system for predicting cfDNA of tumor tissue |
CN113981090A (en) * | 2021-11-18 | 2022-01-28 | 杭州求臻医学检验实验室有限公司 | Breast cancer screening marker composition, selection method thereof and breast cancer screening kit |
CN114045345A (en) * | 2022-01-07 | 2022-02-15 | 臻和(北京)生物科技有限公司 | Free DNA-based genome canceration information detection system and detection method |
Non-Patent Citations (1)
Title |
---|
XINYIN HAN 等: "MSIsensor-ct: microsatellite instability detection using cfDNA sequencing data" * |
Also Published As
Publication number | Publication date |
---|---|
CN115910349B (en) | 2023-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107475375B (en) | A kind of DNA probe library, detection method and kit hybridized for microsatellite locus related to microsatellite instability | |
CN107771221B (en) | Mutation detection for cancer screening and fetal analysis | |
CN106755501B (en) | Method for simultaneously detecting microsatellite locus stability and genome change based on next-generation sequencing | |
CN106834275A (en) | The analysis method of the construction method, kit and library detection data in ctDNA ultralow frequency abrupt climatic changes library | |
CN108229103B (en) | Method and device for processing circulating tumor DNA repetitive sequence | |
CN114045345B (en) | Free DNA-based genome canceration information detection system and detection method | |
CN108595918B (en) | Method and device for processing circulating tumor DNA repetitive sequence | |
CN112218957A (en) | Systems and methods for determining tumor fraction in cell-free nucleic acids | |
TW201920683A (en) | Enhancement of cancer screening using cell-free viral nucleic acids | |
CN112980961B (en) | Method and device for jointly detecting SNV (single nucleotide polymorphism), CNV (CNV) and FUSION (FUSION mutation) | |
WO2020224159A1 (en) | Next generation sequencing-based panel for detecting glioma, detection kit, detection method, and application thereof | |
CN108319817B (en) | Method and device for processing circulating tumor DNA repetitive sequence | |
CN112259165B (en) | Method and system for detecting microsatellite instability state | |
CN108374047A (en) | A kind of kit detecting carcinoma of urinary bladder based on high throughput sequencing technologies | |
CN115831234A (en) | Chromosome instability based early cancer screening and diagnosing method | |
WO2023226939A1 (en) | Methylation biomarker for detecting colorectal cancer lymph node metastasis and use thereof | |
CN115910349B (en) | Early cancer prediction method based on low-depth WGS sequencing tail end characteristics | |
CN110408706A (en) | It is a kind of assess recurrent nasopharyngeal carcinoma biomarker and its application | |
CN115011695A (en) | Multiple cancer species identification marker based on free circular DNA gene, kit and application | |
CN113948150B (en) | JMML related gene methylation level evaluation method, model and construction method | |
CN115831355A (en) | Early tumor screening method for multiple cancer species WGS | |
CN110317877A (en) | Application of the unstable variation of one group chromosome in preparation diagnosis bladder transitional cell carcinoma, the reagent or kit of assessing prognosis | |
CN110964821A (en) | Detection panel for predicting liver cancer metastasis mode and risk and application thereof | |
WO2024099301A1 (en) | Detection and analysis of signals of positive and negative strands of cell-free dna molecule | |
WO2022262831A1 (en) | Substance and method for tumor assessment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |