CN113160889B - Cancer noninvasive early screening method based on cfDNA omics characteristics - Google Patents

Cancer noninvasive early screening method based on cfDNA omics characteristics Download PDF

Info

Publication number
CN113160889B
CN113160889B CN202110118814.5A CN202110118814A CN113160889B CN 113160889 B CN113160889 B CN 113160889B CN 202110118814 A CN202110118814 A CN 202110118814A CN 113160889 B CN113160889 B CN 113160889B
Authority
CN
China
Prior art keywords
cfdna
coverage
fragment
omics
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110118814.5A
Other languages
Chinese (zh)
Other versions
CN113160889A (en
Inventor
蓝勋
季加孚
步召德
李�杰
陈佳辉
孙克用
孙欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renke Beijing Biotechnology Co ltd
Original Assignee
Renke Beijing Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renke Beijing Biotechnology Co ltd filed Critical Renke Beijing Biotechnology Co ltd
Priority to CN202110118814.5A priority Critical patent/CN113160889B/en
Publication of CN113160889A publication Critical patent/CN113160889A/en
Application granted granted Critical
Publication of CN113160889B publication Critical patent/CN113160889B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biochemistry (AREA)
  • Genetics & Genomics (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Artificial Intelligence (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)

Abstract

The invention relates to a noninvasive early screening system for cancer based on cfDNA omics characteristics, which comprises a cfDNA omics characteristic model and a machine learning training model, and comprises a step of establishing the cfDNA omics characteristic model; extracting cfDNA by blood collection; performing library construction and sequencing on the extracted cfDNA; and extracting cfDNA omics characteristics for comparison. The method comprehensively characterizes the cfDNA in the gastric cancer patients by combining the length distribution characteristics, the copy number variation density distribution characteristics and the openness characteristics around the cfDNA promoter and by a low-depth genome-wide sequencing mode of the cfDNA, so that the early gastric cancer patients can be accurately identified.

Description

Cancer noninvasive early screening method based on cfDNA omics characteristics
Technical Field
The invention relates to the field of cancer early screening, in particular to a cancer noninvasive early screening method based on cfDNA omics characteristics.
Background
Liquid biopsy is the clinical application of early screening, molecular typing, prognosis, medication guidance, and recurrence monitoring of cancer by analyzing cancer components in blood. Liquid biopsy is used as a new precise medical technology, can qualitatively and quantitatively detect tumor cells and DNA directly related to tumors, has the characteristics of non-invasiveness, convenience in sampling, real-time monitoring and the like, and gradually plays an increasingly important role in tumor diagnosis and treatment.
Currently, the conventional method of studying fluid biopsy, cancer early screening, is to identify cfDNA released by tumors by mutation detection of oncogenes or cancer suppressor genes. Document Razavi, p., Li, b.t., Brown, d.n.et al.high-intensity sequencing reactions the sources of plasma circulating cell-free DNA variants. nat Med 25, 1928-. This can lead to a significant false positive rate in identifying cancer patients through mutations in specific target genes. Moreover, since cancer patients have great variability, a part of the cancer patients may be missed after defining a specific target gene. In addition, high depth target sequencing is also extremely expensive and cannot be used universally in the clinic. Moreover, the above studies have been directed only to patients with advanced cancer metastasis.
The documents Chen Xiaoji, Chang Ching-Wei, Spoerke Jill M, et al, Low-pass white-genome Sequencing of Circulating Cell-free DNA signatures Dynamic Changes in Genomic Copy Number in Squamous Lung Cancer Clinical code 2019,25(7): 2254-. Meaning that this approach is only useful for a large number of cancer species with copy number variations. Lower depth genome-wide sequencing is also prone to missing copy number variation regions of some cancer genes. Moreover, the above studies have been directed only to patients in the advanced stages of cancer metastasis.
Matthew W.Snyder, Martin Kircher, Andrew J.Hill, et al.cell-free DNA composites an In Vivo Nucleosome fontoprint orders Its Tissues-Of-origin-in.2016 (1-2):57-68. Matthew et al, through the separation Of cfDNA from circulating plasma, obtained a map Of Nucleosome occupancy within the genome, found that the distribution pattern Of cfDNA is closely related to the tissue site, and predicted the distribution pattern Of Nucleosome through the study Of cfDNA, to determine the specific source Of cfDNA, which can be used for non-invasive detection In clinical situations, but it is limited to the theoretical level and does not relate to specific applications, lacking comprehensive evaluation Of patient's DNA genome, lacking evaluation Of multiple cfDNA.
Therefore, it is necessary for those skilled in the art to design a noninvasive cancer early screening method which can completely predict early gastric cancer patients only by a low-depth whole genome sequencing mode, can greatly reduce the cost of cancer early screening and improve the screening accuracy.
Disclosure of Invention
In view of the above, the present application aims to provide a cancer noninvasive early screening method based on cfDNA omics characteristics, which comprehensively characterizes cfDNA in gastric cancer patients by combining cfDNA length distribution characteristics, copy number variation density distribution characteristics and cfDNA promoter periphery openness characteristics and by a cfDNA low-depth whole genome sequencing manner, thereby accurately identifying early gastric cancer patients.
In order to achieve the above object, the present application provides the following technical solutions.
A cancer noninvasive early screening system based on cfDNA omics characteristics comprises a cfDNA omics characteristic model and a machine learning training model, and is characterized in that the cancer noninvasive screening method comprises the following steps:
s101, establishing a cfDNA omics characteristic model;
s102, blood collection;
s103, extracting cfDNA;
s104, performing library construction and sequencing on the extracted cfDNA;
and S105, extracting cfDNA omics characteristics and comparing the cfDNA omics characteristics.
Preferably, the establishing a cfdnamics signature model in step S101 comprises the following steps:
s201, collecting blood;
s202, extracting cfDNA;
s203, performing library construction and sequencing on the extracted cfDNA;
s204, extracting cfDNA omics characteristics;
and S205, machine learning and training the model.
Preferably, the blood collection in step S102 and step S201 is performed by whole blood extraction using a blood collection tube. The blood collection tube contains a preservative which can stabilize nucleated blood cells, prevent the release of cell genome DNA, inhibit cfDNA nuclease-mediated degradation and contribute to the overall stability of cfDNA.
Preferably, the extracting of cfDNA in step S103 and step S202 includes the steps of:
s301, placing the blood collection tube in a centrifuge, and centrifuging until plasma is separated;
s302, adding protease K and ACL buffer into a centrifugal tube containing plasma, fully mixing uniformly and incubating;
s303, carrying out suction filtration on the incubated collection pipe by using a vacuum pump and washing off impurities;
s304, placing the mixture in a centrifuge for centrifugation;
s306, placing the collecting pipe in a metal bath to volatilize ethanol, adding AVE, and incubating;
s307, placing the collecting pipe in a centrifuge for centrifugation, carrying out DNA concentration determination on the filtrate, and detecting the fragment distribution of the cfDNA.
Preferably, the cf omics features comprise essentially one or more of fragment pattern, cnv diversity and TSS coverage.
Preferably, the cfdnamics feature extraction method in steps S101, S105 and S204 comprises:
s401, comparing the cfDNA sequence file with a reference genome to obtain a BAM file;
s402, removing low-quality sequences and repeated sequences in the BAM file;
s403, excluding the region with low coverage rate of the reference genome and the Duke black box region;
s404, dividing the chromosome into adjacent fragments without intersection;
s405, counting the number of long and short cfDNAs;
s406, correcting and processing the counted number by GC content;
s407, fragment pattern quantization is carried out by using a proportion; and carrying out median standardization by adopting a gold standard, and counting the density distribution of copy number variation.
S408, obtaining the coordinate of a transcription start site of the reference genome, comparing the coordinate to a BAM file, and obtaining the coverage of a sequence near the site;
and S409, obtaining TSS coverage through coverage calculation.
Preferably, the method for establishing the machine learning training model in step S205 includes:
s501, dividing a sample into a training set and a testing set;
s502, processing sample data in the training set;
s503, extracting omics characteristics of cfDNA in the training set and verifying the characteristics in the testing set;
and S504, evaluating the efficiency of the model.
Preferably, the test set in step 501 comprises n gastric cancer samples and m healthy samples, and the training set comprises n +1 gastric cancer samples and m healthy samples, wherein n and m are positive integers.
Preferably, the sample data processing method in step S502 includes employing an algorithm using ten-fold cross validation.
Preferably, the evaluation in step S504 includes calculation and evaluation of sensitivity, specificity, accuracy, recall, ROC, and AUG.
The beneficial technical effects obtained by the invention are as follows:
1) the invention adopts a low-depth whole genome sequencing mode, compared with the target sequencing with ultrahigh depth or high depth; the sequencing cost is greatly reduced, and the cost is lower;
2) according to the invention, the overall appearance of cfDNA in the gastric cancer patients can be reflected more comprehensively by means of whole genome sequencing, and the omission of gastric cancer patients with large heterogeneity is avoided;
3) the invention can more comprehensively excavate the specificity of cfDNA in the gastric cancer patient through the analysis mode of the trimomics.
The foregoing description is only an overview of the technical solutions of the present application, so that the technical means of the present application can be more clearly understood and the present application can be implemented according to the content of the description, and in order to make the above and other objects, features and advantages of the present application more clearly understood, the following detailed description is made with reference to the preferred embodiments of the present application and the accompanying drawings.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
Figure 1 cfdnamics features flow chart for noninvasive screening of patients in early stage of gastric cancer;
FIG. 2 is a schematic diagram of feature extraction and model training for cfDNA fragment patterns;
FIG. 3 schematic diagram of feature extraction and model training for cfDNA cnv diversity;
FIG. 4 is a schematic diagram of feature extraction and model training for cfDNA TSS coverage;
FIG. 5 is a schematic diagram of MUC2 showing the target genes of early gastric cancer.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. In the following description, specific details such as specific configurations and components are provided only to facilitate a thorough understanding of embodiments of the present application. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present application. In addition, descriptions of well-known functions and constructions are omitted in the embodiments for clarity and conciseness.
It should be appreciated that reference throughout this specification to "one embodiment" or "the present embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrase "one embodiment" or "the present embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Further, the present application may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
The term "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, which may mean: the three cases of A alone, B alone and A and B together exist, and the term "/and" in this document describes another associated object relationship, which means that two relationships may exist, for example, A/and B, which may mean: the presence of a alone, and both cases a and B alone, and further, the character "/" herein generally means that the former and latter associated objects are in an "or" relationship.
The term "at least one" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, at least one of a and B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion.
Example 1
A cancer noninvasive early screening system based on cfDNA omics characteristics comprises a cfDNA omics characteristic model and a machine learning training model, and is characterized in that the cancer noninvasive screening method comprises the following steps:
s101, establishing a cfDNA omics characteristic model;
s102, collecting blood;
s103, extracting cfDNA;
s104, performing library construction and sequencing on the extracted cfDNA;
and S105, extracting cfDNA omics characteristics and comparing the cfDNA omics characteristics.
Preferably, the establishing a cfdnamics feature model in step S101 comprises the following steps:
s201, blood collection;
s202, extracting cfDNA;
s203, performing library construction and sequencing on the extracted cfDNA;
s204, extracting cfDNA omics characteristics;
and S205, machine learning and training the model.
Preferably, the blood collection in step S102 and step S201 is a whole blood extraction using Streck blood collection tubes. The Streck blood collection tube contains a preservative which can stabilize nucleated blood cells, prevent the release of cell genome DNA, inhibit cfDNA nuclease-mediated degradation and contribute to the overall stability of cfDNA.
Preferably, the extracting of cfDNA in step S103 and step S202 includes the steps of:
s301, placing the Streck blood collection tube in a centrifuge, and centrifuging until plasma is separated;
s302, adding protease K and ACL buffer into a centrifugal tube containing plasma, fully mixing uniformly and incubating;
s303, carrying out suction filtration on the incubated collection pipe by using a vacuum pump and washing off impurities;
s304, placing the mixture in a centrifuge for centrifugation;
s306, placing the collecting pipe in a metal bath to volatilize ethanol, adding AVE, and incubating;
s307, placing the collecting tube in a centrifuge for centrifugation, collecting the filtrate in an EP tube, carrying out DNA concentration determination, and detecting the fragment distribution of the cfDNA.
Preferably, the cf omics features comprise essentially one or more combinations of fragment pattern, cnv diversity and TSS coverage.
Preferably, the cfdnamics feature extraction method in steps S101, S105 and S204 comprises:
s401, comparing the cfDNA sequence file with a reference genome to obtain a BAM file;
s402, removing low-quality sequences and repeated sequences in the BAM file;
s403, excluding the region with low coverage rate of the reference genome and the Duke black box region;
s404, dividing the chromosome into adjacent segments without intersection;
s405, counting the number of long and short cfDNAs;
s406, correcting and processing the counted number by GC content;
s407, fragment pattern quantization is carried out by using a proportion; and carrying out median standardization by adopting a gold standard, and counting the density distribution of copy number variation.
S408, obtaining the coordinate of a transcription start site of the reference genome, comparing the coordinate to a BAM file, and obtaining the coverage of a sequence near the site;
and S409, obtaining TSS coverage through coverage calculation.
Preferably, the method for establishing the machine learning training model in step S205 includes:
s501, dividing a sample into a training set and a testing set;
s502, processing sample data in the training set;
s503, extracting omics characteristics of cfDNA in the training set and verifying the characteristics in the testing set;
and S504, evaluating the efficiency of the model.
Preferably, the test set in step 501 includes n gastric cancer samples and m healthy samples, the training set includes n +1 gastric cancer samples and m healthy samples, and n and m are positive integers.
Preferably, the sample data processing method in step S502 includes employing an algorithm using ten-fold cross validation.
Preferably, the evaluation in step S504 includes the calculation of sensitivity, specificity, accuracy, recall, ROC and AUG.
Example 2
This embodiment is performed based on embodiment 1, and the same parts as embodiment 1 are not repeated.
This example introduces a method of cfDNA extraction, comprising the specific steps of:
s601, placing a Streck tube in a 4 ℃ centrifuge, centrifuging at 2000rpm for 10min, and separating plasma;
s602, adding 500ul of protease K and 4ML ACL buffer into a 50ML centrifuge tube containing plasma, fully and uniformly mixing with vortex30S, and placing in a 60 ℃ water bath kettle for incubation for 30 min;
s603, carrying out suction filtration on the incubated collection pipe for 10min by using a vacuum pump, sequentially adding 600ul of ACW1, 750ul of ACW2 and 750ul of 100% ethanol, and washing away impurities;
s604, centrifuging at 12000rpm for 3 min;
s605, placing the collecting pipe in a metal bath at 50 ℃ for 10min, volatilizing ethanol, adding 110ul AVE, placing the mixture in the collecting pipe, and incubating for 3min at room temperature;
s606, placing the collection tube in a centrifuge at 12000rpm, centrifuging for 1min, collecting the filtrate into a 1.5ml EP tube, performing DNA concentration determination by using Quibt 3.0, and detecting the fragment distribution of cfDNA through 2100.
Example 3
This embodiment is performed based on embodiment 1, and the same parts as embodiment 1 are not repeated.
This example introduces a method for feature extraction in cfDNA triomics, comprising the steps of:
fragment pattern first align cfDNA sequence files to reference genome hg19, discard low quality sequences from the resulting BAM files and filter out duplicate sequences; then excluding the region with low coverage of hg19 reference genome and the Duke black box region; next, hg19 autosomes were divided into 504 contiguous, non-intersecting segments, each 5mb in length; counting the number of cfDNAs with the length of more than 150bp and the number of cfDNAs with the length of less than 150bp in each fragment region; correcting the GC content of the number of the cfDNAs by using an LOESS regression method, and processing the number of the cfDNAs after GC correction by using a mean value standardization method; and finally, obtaining the number of the long and short cfDNA fragments in each 5mb interval, and finally quantifying fragment patterns by using the proportion.
Cnv diversity: removing low-quality and repeated sequences in the aligned BAM file, and dividing the chromosome into 51120 adjacent fragments without intersection, wherein each fragment is 50 kb; correcting the GC content in the same way as in the fragment pattern, taking the median of the number of cfDNA of each fragment of healthy people after GC correction as a gold standard, and carrying out median standardization on the number of cfDNA in each fragment of the cancer patient by using the gold standard; dividing the amplified and deleted fragments by taking 0.2 as a threshold value, counting the density distribution of copy number variation, and identifying abnormal amplified and deleted intervals; the path and the biological process in which the gene in the amplification interval is involved are explored.
Removing the low-quality sequence in the BAM file, downloading the coordinates of hg19 reference genome transcription initiation sites from an ENSEMBL database, and comparing the coordinates to the BAM file to obtain the coverage of the sequence near the sites; firstly, calculating the coverage of Nucleosome Deletion Region (NDR) near the initiation site, and then calculating the coverage from 1000bp upstream to 1000bp downstream (2k region) of the initiation site; then, in order to standardize the two coverage degrees, the average value of the coverage degrees of the upstream 3000bp to upstream 1000bp fragment and the downstream 1000bp to downstream 3000bp fragment of the initiation site is calculated to be used as a gold standard; NDR and 2k regions were divided by the gold standard as the final TSS coverage.
Example 4
This embodiment is performed on the basis of embodiment 1, and the same points as embodiment 1 are not repeated.
This example presents cfDNA combined feature extraction and model training.
This example collected preoperative peripheral blood from 81 gastric cancer patients and peripheral blood from 38 healthy people and performed cfDNA extraction, pooling and sequencing.
The first step is as follows: the feature extraction and model training of cfDNA fragment patterns are shown in FIG. 2.
A. Dividing the aligned BAM file into 504 bins without intersection, calculating the number of long fragments of which the cfDNA is more than 150bp and the number of short fragments of which the cfDNA is less than 150bp after GC correction in each bin, and calculating the ratio of the short fragments to the long fragments; the proportion of healthy persons was found to be relatively concentrated and the number of long fragments per bin was more proportional than in patients with gastric cancer; the proportion of gastric cancer patients is relatively diffuse and the proportion of short fragments per bin is greater compared to healthy people.
B. After the average value standardization of the proportion distribution of the gastric cancer patients and the healthy people is finished, the proportion of the healthy people is found to be stable and unchanged, and the proportion variability of the gastric cancer patients is strong.
C. The median of 504 bins of healthy people was used as the gold standard, and the similarity between each sample and the gold standard was sought. The similarity between healthy people is found to be strong, and a healthy person from a nature article is selected for comparison, so that the healthy person in the nature is found to be similar to the gold standard, and the similarity between the gastric cancer patient and the gold standard is obviously reduced; compared with healthy people, the difference p value is 0.0003313, compared with healthy people in nature, the difference p value is 3.686e-08, and the detection mode is rank sum detection.
D. Training the training set by a random gradient descent algorithm, extracting features by adopting a ten-fold cross validation mode, and finally evaluating the performance of the model in the test set. In the test set, the AUC was 0.96447, the sensitivity was 0.975, the specificity was 0.842, the accuracy was 0.929, and the recall was 0.941.
The second step is that: feature extraction and model training of cfDNA cnv diversity, the results are shown in fig. 3.
A. After calculating cnv of cfDNA, its density was evaluated. The cnv density of healthy people was found to be concentrated around the 0 value, while that of gastric cancer patients was more dispersed and a common feature. The cnv density of a healthy person and a patient with gastric cancer is shown.
B. Setting the interval larger than 0.2 as the gene fragment amplification interval and setting the interval smaller than-0.2 as the gene fragment deletion interval, and counting the proportion of the amplification intervals and the deletion intervals of all samples. The proportion of the cnv abnormal interval of the gastric cancer patient is far higher than that of a healthy person, the p value of the difference significance is 6.499e-12, and the detection mode is rank sum detection.
C. Training the training set by a random gradient descent algorithm, extracting features by adopting a ten-fold cross validation mode, and finally evaluating the performance of the model in the test set. In the test set, the AUC is 0.98947, the sensitivity is 1, the specificity is 0.895, the accuracy is 0.952, and the recall rate is 1.
The third step: feature extraction and model training of cfDNA TSS coverage, the results are shown in fig. 4.
A. The figure shows cfDNA coverage of a gene from 1mb upstream to 1mb downstream of the transcription start site for a particular gastric cancer patient. The red dotted line represents the transcription initiation site, near which the coverage of cfDNA is greatly down-regulated, representing here the promoter region, which can be recognized by transcription factors.
B. The figure shows cfDNA coverage of the transcription start site from 1kb upstream to 1kb downstream. The red dotted line represents the transcription start site, and similar to graph a, around the transcription start, coverage of cfDNA is greatly down-regulated, representing here the promoter region, which can be recognized by transcription factors.
C. The cfDNA coverage of the transcriptional start site from 150bp upstream to 50bp downstream Nucleosome Deletion Region (NDR) is shown. The red dotted line represents the transcription start site, and similar to graph a, around the transcription start, coverage of cfDNA is greatly down-regulated, representing here the promoter region, which can be recognized by transcription factors.
D. The figure shows twenty thousand protein coding genes, the lower the mean coverage of the cfDNA of 81 gastric cancer samples in 2k region, the stronger the openness of the transcription initiation site of the gene is shown, and after the mean coverage sequencing is completed, the more the gene is, the stronger the openness of the promoter is.
E. The figure shows twenty thousand protein coding genes, the mean coverage of cfDNA of 81 gastric cancer samples in a Nucleosome Deletion Region (NDR) is lower, the transcription initiation site of the gene has strong openness, and the more upward the genes are sequenced after the mean coverage, the stronger the openness of a promoter is.
F. Selecting 2k region and Nucleosome Deleted Region (NDR), and if the mean coverage of transcription initiation sites of the gene in more than 80% of gastric cancer samples is less than 1, judging the promoter of the gene as an open region. The genes are subjected to KEGG pathway enrichment analysis, and the pathways are found to be related to cell proliferation, autophagy and migration, and are significantly related to the occurrence and development of cancers.
G. Training the training set by a random gradient descent algorithm, extracting features by adopting a ten-fold cross validation mode, and finally evaluating the performance of the model in the test set. In the test set, AUC was 0.98947, sensitivity was 1, specificity was 0.895, accuracy was 0.952, and recall was 1.
The fourth step: cfDNA TSS coverage and single cell analysis identified MUC2 as the target gene for early gastric cancer, and the results are shown in FIG. 5.
After analyzing the characteristics of cfDNA TSS coverage, combined with gastric cancer single cell transcriptome data analysis, the gastric cancer patients in early stage have stronger opening property of MUC2 promoter region compared with healthy people, the lower normalized coverage represents stronger opening property, the stronger the opening property is in 2K region (graph A) and NDR (graph B), and the expression of MUC2 is also strong. The tumor area of the early stage gastric cancer patient was found to be stained with MUC2 fluorescent protein by HE staining and IF staining, as shown in panel C, while the tumor area of the late stage gastric cancer patient was only partially stained with MUC2 fluorescent protein or was not stained with MUC2 fluorescent protein directly, as shown in panel D.
The above description is only a preferred embodiment of the present invention, and it is not intended to limit the scope of the present invention, and it is obvious to those skilled in the art that various modifications and variations can be made in the present invention. Variations, modifications, substitutions, integrations and parameter changes of the embodiments may be made without departing from the principle and spirit of the invention, which may be within the spirit and principle of the invention, by conventional substitution or may realize the same function.

Claims (2)

1. A cancer noninvasive early screening method based on cfDNA omics characteristics comprises a cfDNA omics characteristic model established through a machine learning training model, and is characterized by comprising the following steps:
s101, establishing a cfDNA omics characteristic model; s102, blood collection; s103, extracting cfDNA; s104, performing library construction and sequencing on the extracted cfDNA; s105, extracting cfDNA omics characteristics and comparing the cfDNA omics characteristics;
the establishing of the cfdnamics feature model in the step S101 specifically includes the steps of:
s201, blood collection; s202, extracting cfDNA; s203, performing library construction and sequencing on the extracted cfDNA; s204, extracting cfDNA omics characteristics; s205, a machine learning training model;
in the step S102 and the step S201, a blood collection tube is used for whole blood collection; the blood collection tube contains a preservative which can stabilize nucleated blood cells, prevent the release of cell genome DNA, inhibit cfDNA nuclease-mediated degradation and contribute to the overall stability of cfDNA;
the cfDNA omics characteristics comprise fragment pattern, cnv diversity and TSS coverage;
the fragment pattern is as follows: firstly, comparing cfDNA sequence files to a reference genome hg19, discarding low-quality sequences and filtering out repeated sequences from the obtained BAM files, and then excluding sequences of a region with low coverage of hg19 reference genome and a Duke black box region; then dividing the processed BAM file into 504 adjacent fragments without intersection; counting the number of cfDNAs with the length of more than 150bp and the number of cfDNAs with the length of less than 150bp in each fragment region; correcting the GC content of the number of the cfDNAs by using an LOESS regression method, and processing the number of the cfDNAs after GC correction by using a mean value standardization method; finally, obtaining the number of the long and short cfDNA fragments in each interval, calculating the proportion of the short fragments to the long fragments, and finally quantifying fragment patterns by using the proportion;
the cnv diversity is: removing low-quality and repeated sequences in the compared BAM file, and dividing the processed BAM file into 51120 adjacent fragments without intersection; performing GC content correction like the fragment pattern, taking the median of the number of cfDNA of each fragment of a healthy person after GC correction as a gold standard, and performing median standardization on the number of the cfDNA in each fragment of the patient by using the gold standard; setting the interval larger than 0.2 as the gene fragment amplification interval, setting the interval smaller than-0.2 as the gene fragment deletion interval, and counting the proportion of the amplification intervals and the deletion intervals of all samples, namely the density distribution of copy number variation;
the TSS coverage is as follows: removing low-quality sequences in the BAM file, downloading coordinates of hg19 reference genome transcription start sites from an ENSEMBL database, and comparing the coordinates to the BAM file to obtain the coverage of sequences near the sites; firstly, calculating the coverage of a nucleosome deletion region NDR near an initiation site, and then calculating the coverage of 2k region from upstream 1000bp to downstream 1000bp of the initiation site; then, in order to standardize the two coverage degrees, calculating the average coverage degree of the fragments from 3000bp upstream to 1000bp upstream and from 1000bp downstream to 3000bp downstream of the starting site as a gold standard; NDR and 2k regions were divided by the gold standard as the final TSS coverage.
2. The cancer noninvasive early screening method based on cfDNA omics characteristics as claimed in claim 1, wherein the step S103 and the step S202 for extracting cfDNA comprise the following steps:
s301, placing the blood collection tube in a centrifuge, centrifuging until plasma is separated, and placing the plasma in a centrifuge tube;
s302, adding protease K and ACL buffer into a centrifugal tube containing plasma, fully mixing uniformly and incubating;
s303, performing suction filtration on the incubated centrifuge tube by using a vacuum pump, washing away impurities, and placing the filtrate in a collecting tube;
s304, placing the mixture in a centrifuge for centrifugation;
s306, placing the collecting pipe in a metal bath to volatilize ethanol, adding AVE, and incubating;
s307, placing the collecting pipe in a centrifuge for centrifugation, performing DNA concentration determination on the centrifuged filtrate, and detecting the fragment distribution of the cfDNA.
CN202110118814.5A 2021-01-28 2021-01-28 Cancer noninvasive early screening method based on cfDNA omics characteristics Active CN113160889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110118814.5A CN113160889B (en) 2021-01-28 2021-01-28 Cancer noninvasive early screening method based on cfDNA omics characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110118814.5A CN113160889B (en) 2021-01-28 2021-01-28 Cancer noninvasive early screening method based on cfDNA omics characteristics

Publications (2)

Publication Number Publication Date
CN113160889A CN113160889A (en) 2021-07-23
CN113160889B true CN113160889B (en) 2022-07-19

Family

ID=76879009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110118814.5A Active CN113160889B (en) 2021-01-28 2021-01-28 Cancer noninvasive early screening method based on cfDNA omics characteristics

Country Status (1)

Country Link
CN (1) CN113160889B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838533B (en) * 2021-08-17 2024-03-12 福建和瑞基因科技有限公司 Cancer detection model, construction method thereof and kit
CN114242164B (en) * 2021-12-21 2023-03-28 苏州吉因加生物医学工程有限公司 Analysis method, device and storage medium for whole genome replication
CN114613436B (en) * 2022-05-11 2022-08-02 北京雅康博生物科技有限公司 Blood sample Motif feature extraction method and cancer early screening model construction method
CN115662519B (en) * 2022-09-29 2023-11-03 南京医科大学 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning
CN115691667B (en) * 2022-12-30 2023-04-18 北京橡鑫生物科技有限公司 Urology early screening device, model construction method and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189798A (en) * 2019-06-26 2019-08-30 广州市雄基生物信息技术有限公司 A kind of clustering method and application based on peripheral blood plasma DNA nucleosome footprint difference
CN110739027A (en) * 2019-10-23 2020-01-31 深圳吉因加医学检验实验室 cancer tissue positioning method and system based on chromatin region coverage depth

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104560697A (en) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 Detection device for instability of genome copy number
CN106295245B (en) * 2016-07-27 2019-08-30 广州麦仑信息科技有限公司 Method of the storehouse noise reduction based on Caffe from coding gene information feature extraction
WO2018081130A1 (en) * 2016-10-24 2018-05-03 The Chinese University Of Hong Kong Methods and systems for tumor detection
EP3548632A4 (en) * 2016-11-30 2020-06-24 The Chinese University Of Hong Kong Analysis of cell-free dna in urine and other samples
CN107099577A (en) * 2017-03-06 2017-08-29 华南理工大学 Vaginal fluid humidity strip candida albicans detection method based on Hough loop truss and depth convolutional network
CN107133496B (en) * 2017-05-19 2020-08-25 浙江工业大学 Gene feature extraction method based on manifold learning and closed-loop deep convolution double-network model
EP3743518A4 (en) * 2018-01-24 2021-09-29 Freenome Holdings, Inc. Methods and systems for abnormality detection in the patterns of nucleic acids
US20210082111A1 (en) * 2018-03-29 2021-03-18 Sony Corporation Information processing device, information processing method, and program
CN108949979A (en) * 2018-07-11 2018-12-07 深圳市海普洛斯生物科技有限公司 A method of judging that Lung neoplasm is good pernicious by blood sample
CN109182526A (en) * 2018-10-10 2019-01-11 杭州翱锐生物科技有限公司 Kit and its detection method for early liver cancer auxiliary diagnosis
CN109360604B (en) * 2018-11-21 2021-09-24 南昌大学 Ovarian cancer molecular typing prediction system
CN109680049A (en) * 2018-12-03 2019-04-26 东南大学 A kind of method and its application based on the dissociative DNA in blood high-flux sequence analysis affiliated individual physiological state of cfDNA
CN109652513B (en) * 2019-02-25 2022-08-23 元码基因科技(北京)股份有限公司 Method and kit for accurately detecting individual mutation of liquid biopsy based on second-generation sequencing technology
CN110211632A (en) * 2019-05-06 2019-09-06 西安电子科技大学 A kind of nucleotide unit point mutation detection method neural network based
CN111081317B (en) * 2019-12-10 2023-06-02 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN111243673B (en) * 2019-12-25 2021-11-19 北京橡鑫生物科技有限公司 Tumor screening model, and construction method and device thereof
CN111254211A (en) * 2020-02-28 2020-06-09 广东药科大学 Aquilaria plant identification method based on ITS sequence and machine learning
CN112086129B (en) * 2020-09-23 2021-04-06 深圳吉因加医学检验实验室 Method and system for predicting cfDNA of tumor tissue

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189798A (en) * 2019-06-26 2019-08-30 广州市雄基生物信息技术有限公司 A kind of clustering method and application based on peripheral blood plasma DNA nucleosome footprint difference
CN110739027A (en) * 2019-10-23 2020-01-31 深圳吉因加医学检验实验室 cancer tissue positioning method and system based on chromatin region coverage depth

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Cell-free DNA analysis reveals POLR1Dmediated resistance to bevacizumab in colorectal cancer;Qing Zhou 等;《genome medicine》;20201231;第1-17页 *

Also Published As

Publication number Publication date
CN113160889A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN113160889B (en) Cancer noninvasive early screening method based on cfDNA omics characteristics
WO2019068082A1 (en) Dna methylation biomarkers for cancer diagnosing
CN113257350B (en) ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
KR102381252B1 (en) Method for Prognosing Hepatic Cancer Patients Based on Circulating Cell Free DNA
CN113284554B (en) Circulating tumor DNA detection system for screening micro residual focus after colorectal cancer operation and predicting recurrence risk and application
KR20190085667A (en) Circulating Tumor DNA Detection Method Using Sample comprising Cell free DNA and Uses thereof
CN113838533B (en) Cancer detection model, construction method thereof and kit
TWI679280B (en) Non-invasive detection of bladder cancer and method for monitoring its recurrence
He et al. Assessing the impact of data preprocessing on analyzing next generation sequencing data
Li et al. Detection of colorectal cancer in circulating cell-free DNA by methylated CpG tandem amplification and sequencing
CN109830264B (en) Method for classifying tumor patients based on methylation sites
CN105132407A (en) Method for low-frequency mutant-enriched sequencing of DNA of exfoliative cells
CN109652513B (en) Method and kit for accurately detecting individual mutation of liquid biopsy based on second-generation sequencing technology
CN115410713A (en) Hepatocellular carcinoma prognosis risk prediction model construction based on immune-related gene
CN116631508B (en) Detection method for tumor specific mutation state and application thereof
Fu et al. Improving the performance of somatic mutation identification by recovering circulating tumor DNA mutations
Bielo et al. Variant allele frequency: a decision-making tool in precision oncology?
CN113362893A (en) Construction method and application of tumor screening model
WO2020049485A1 (en) Method of treating a cancer patient without the need for a tissue biopsy
CN115954052B (en) Screening method and system for monitoring sites of tiny residual focus of solid tumor
Lawrence et al. Performance characteristics of mutational signature analysis in targeted panel sequencing
Dan et al. Distal fecal wash host transcriptomics identifies inflammation throughout the colon and terminal ileum
Wilmott et al. Tumour procurement, DNA extraction, coverage analysis and optimisation of mutation-detection algorithms for human melanoma genomes
CN105838720A (en) PTPRQ gene mutant and application thereof
CN114724631A (en) Chromosome copy number variation degree evaluation model, method and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Lan Xun

Inventor after: Ji Jiafu

Inventor after: Bu Zhaode

Inventor after: Li Jie

Inventor after: Chen Jiahui

Inventor after: Sun Keyong

Inventor after: Sun Xin

Inventor before: Lan Xun

Inventor before: Ji Jiafu

Inventor before: Bu Zhaode

Inventor before: Li Jie

Inventor before: Chen Jiahui

Inventor before: Sun Keyong

Inventor before: Sun Xin

CB03 Change of inventor or designer information
TA01 Transfer of patent application right

Effective date of registration: 20220218

Address after: B215, floor 2, No. 5, Kaifa Road, Haidian District, Beijing 100089

Applicant after: Renke (Beijing) Biotechnology Co.,Ltd.

Address before: No. 30 Shuangqing Road, Haidian District, Beijing 100084

Applicant before: TSINGHUA University

Applicant before: BEIJING CANCER HOSPITAL (BEIJING CANCER Hospital)

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant