CN113257350B - ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device - Google Patents

ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device Download PDF

Info

Publication number
CN113257350B
CN113257350B CN202110650420.4A CN202110650420A CN113257350B CN 113257350 B CN113257350 B CN 113257350B CN 202110650420 A CN202110650420 A CN 202110650420A CN 113257350 B CN113257350 B CN 113257350B
Authority
CN
China
Prior art keywords
mutation
model
level
module
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110650420.4A
Other languages
Chinese (zh)
Other versions
CN113257350A (en
Inventor
李庆原
谢泓禹
刘异倩
刘小莉
洪媛媛
王小庆
韩天澄
杨顺莉
于佳宁
陈维之
何骥
杜波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Precision Medical Laboratory Co.,Ltd.
Wuxi Zhenhe Biotechnology Co.,Ltd.
Zhenhe (Beijing) Biotechnology Co.,Ltd.
Original Assignee
Wuxi Precision Medical Laboratory Co ltd
Wuxi Zhenhe Biotechnology Co ltd
Zhenhe Beijing Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Precision Medical Laboratory Co ltd, Wuxi Zhenhe Biotechnology Co ltd, Zhenhe Beijing Biotechnology Co ltd filed Critical Wuxi Precision Medical Laboratory Co ltd
Priority to CN202110650420.4A priority Critical patent/CN113257350B/en
Publication of CN113257350A publication Critical patent/CN113257350A/en
Application granted granted Critical
Publication of CN113257350B publication Critical patent/CN113257350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Abstract

The invention provides a ctDNA mutation degree analysis method and device based on liquid biopsy and a ctDNA performance analysis device, wherein the mutation degree analysis method comprises the following steps: performing capture sequencing on a plasma sample to be detected to obtain a FASTQ file; respectively extracting the molecular tags in the paired reads and storing the molecular tags as uBAM files; comparing the gene sequence of the FASTQ file with a reference genome, removing duplication, and combining the gene sequence with a uBAM file to obtain a BAM file containing a molecular tag; gathering and de-duplicating reads in the BAM file; obtaining a sample original mutation set in a gene mutation panel area, and counting gene mutation parameters in the sample original mutation set; filtering the original mutation set of the samples, and counting gene mutation parameters of each sample; and (3) evaluating the mutation degree of the plasma sample to be detected according to the gene mutation parameters of the sample, so that the sensitivity of ctDNA mutation detection is improved.

Description

ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
Technical Field
The invention relates to the technical field of biomedicine, in particular to a ctDNA mutation degree analysis method and device based on liquid biopsy and a ctDNA performance analysis device.
Background
Early screening, early diagnosis and timely treatment are effective ways for reducing the death rate of cancer. The european medical oncology society (ESMO) states: the incidence of cancer and mortality in western countries has decreased year by year, mainly due to early screening for cancer, early benign adenomatous resection and early treatment of cancer lesions. The discovery and utilization of tumor specific biomarkers, and the adoption of high-precision detection and analysis methods to lock the generating organs and implement treatment in the early stage of tumor generation are key factors for improving the tumor treatment effect and prolonging the life of patients. The early screening and diagnosis of the tumor has profound social and economic significance for improving the quality of life of the whole people and reducing the medical cost of the whole society.
Currently, typical tumor early screening and early diagnosis approaches can be roughly divided into two categories: the first type introduces a more sensitive electronic data analysis means on the basis of the existing clinical detection platform (such as pathological section, CT image, enteroscope, gastroscope, and the like) so as to improve the detection sensitivity, reduce the dependence on manual interpretation, reduce human errors and assist clinical decision; the second type researches tumor markers of somatic cells, genetics, epigenetics, metabolites and other types at clinical level and molecular level, which are potentially related to tumorigenesis and development from a mechanism angle, and develops a new detection platform and a new detection means based on the screening sites.
In the first category of research, researchers successfully apply machine learning algorithms such as artificial neural networks, multi-objective optimization and the like to interpretation of colonography CT films to detect colon polyps more sensitively and discover the possibility of canceration in advance. However, image recognition capabilities for smaller colon polyps (6-9 mm in diameter) have yet to be improved. Based on similar concepts, some machine learning algorithms have also been successfully used to automatically interpret PET/CT images of the lung to distinguish benign and malignant lung nodules for early diagnosis of lung cancer. Representative algorithms include support vector machines, random forests, convolutional neural networks, or deep learning, among others. These methods have found some application in the field of early detection of lung cancer. Although machine learning based interpretation algorithms are generally more specific for the determination of low dose PET images, sensitivity is to be improved. In the field of liver cancer detection, machine learning algorithms are also used to distinguish and identify different types of liver lesions including liver cysts, local nodule hyperplasia, hepatic hemangioma, chronic hepatitis, cirrhosis, hepatocellular carcinoma, etc. from CT images, and early and accurate identification of liver cancer lesions from CT images is very beneficial to the therapeutic effect. Similar applications include breast X-ray imaging for early screening of breast cancer, and interpretation of H & E stained biopsy of prostate tissue to effectively exclude cancer negative samples, among others.
In the second category of research, tumor markers commonly used in clinical practice, such as carcinoembryonic antigen (CEA), alpha-fetoprotein (AFP), cancer antigen 125(CA125), carbohydrate antigen 19-9(CA19-9), Prostate Specific Antigen (PSA), etc., have certain guiding significance for tumor screening. But their sensitivity or specificity is often inadequate for clinical diagnosis. Therefore, in practice, clinicians will usually measure multiple markers at a time and take into account other means such as clinical symptoms and imaging examinations. Therefore, the extensive screening of healthy people is not highly generalizable in terms of tumor markers alone.
Liquid biopsy technology, especially the detection technology based on plasma extraction of free dna (cfdna), has become an important and minimally invasive tumor detection means in recent years, and is widely used in tumor diagnosis, disease tracking, efficacy evaluation and prognosis prediction. In recent studies, fluid biopsy technology based on detection of genetic variation in cfDNA has shown great potential in early detection of cancer, where detection of plasma ctDNA mutation signals is an important branch.
ctDNA mutation analysis for early screening typically utilizes a combination of tumor-characteristic hot-spot mutations as markers. However, the sites of the mutation markers are different among different cancer types, and even in lung cancer and intestinal cancer in which hot spot mutations are concentrated, detection of hundreds of sites of at least a dozen genes is required to cover most patients of the cancer types, so that the purpose of screening is achieved. For the detection of the locus, if a commonly used PCR (polymerase chain reaction) method is used, hundreds of milliliters of blood samples are needed, and the feasibility for the common early screening is low; and the PCR method has higher false positive of technical source and clonal hematopoietic source when detecting mutation. It can be seen that in early screening assays using DNA mutations as markers, it is not feasible to do so by PCR methods.
Disclosure of Invention
In order to solve the problems, the invention provides a ctDNA mutation degree analysis method and device based on liquid biopsy and a ctDNA performance analysis device, which are used for analyzing the ctDNA mutation degree of a plasma sample to be detected and improving the detection sensitivity.
The technical scheme provided by the invention is as follows:
in one aspect, the present invention provides a ctDNA mutation degree analysis method based on liquid biopsy, comprising:
acquiring a FASTQ file by capturing and sequencing a plasma sample to be detected according to a pre-created gene mutation panel, wherein cfDNA in the plasma sample to be detected carries a pre-accessed molecular tag;
respectively extracting the molecular tags in the paired reads in the FASTQ file and storing the molecular tags as uBAM files;
comparing the gene sequence of the FASTQ file with a reference genome, performing de-duplication to obtain a BAM file, and combining the BAM file with the uBAM file to obtain a BAM file containing a molecular tag;
aggregating and de-duplicating reads in the BAM file according to the molecular label;
obtaining a sample original mutation set in the gene mutation panel region by using a pileup method;
counting gene mutation parameters in the original mutation set of the sample, wherein the gene mutation parameters comprise: gene mutation grade, gene mutation quantity and mutation frequency of each grade;
filtering the original mutation set of the samples according to a pre-constructed filtering rule, and counting gene mutation parameters of each sample;
and (3) evaluating the mutation degree of the plasma sample to be detected by using a pre-constructed mutation analysis model according to the gene mutation parameters of the sample.
Further preferably, the gene mutation levels include level I, level II, level III and level IV, wherein the level I includes oncogenes in a preset cancer database, the level II includes cancer suppressor genes or other cancer suppressor genes that are functionally determined to be harmful and are not in the preset cancer database, the level III includes exon region genes in the level I and the level II, and the level IV includes genes in the level I, the level II and the level III;
the gene mutation parameters include: number of mutations of class I, maximum mutation frequency value of class I, number of mutations of class II, maximum mutation frequency value of class II, number of mutations of class III, maximum mutation frequency value of class III, number of mutations of class IV, and maximum mutation frequency value of class IV.
Further preferably, in the filtering the original sample mutation set according to the pre-established filtering rule and counting the gene mutation parameters of each sample, the rule for filtering the original sample mutation set includes:
germline mutations in peripheral blood leukocytes above a given frequency;
blacklist sites that occur repeatedly in a database specific for a large number of historical samples of a given panel, and sites in the database where the population frequency exceeds a set threshold;
background noise baselines were constructed from cfDNA of more than a given number of healthy human plasma samples under the same sequencing conditions.
Further preferably, the aggregating and de-duplicating reads in the BAM file according to the molecular tag comprises:
forming a gene family based on the molecular tag, wherein: the editing distance between the molecular labels is smaller than a first preset value, and the difference between the initial positions of reads with the same molecular labels is a second preset value;
filtering the gene family according to rules, wherein the rules comprise: for family with a corresponding double-stranded molecular label, the number of reads contained in the family is not less than a third preset value; for family without the corresponding double-stranded molecule label, the number of reads contained in the family is not less than the fourth preset value.
In another aspect, the present invention provides a ctDNA mutation analysis device based on liquid biopsy, comprising:
the capture sequencing module is used for performing capture sequencing on a plasma sample to be detected according to a pre-established gene mutation panel to obtain a FASTQ file, wherein cfDNA in the plasma sample to be detected carries a pre-accessed molecular tag;
the molecular tag extraction module is used for respectively extracting the molecular tags in the paired reads in the FASTQ file and storing the molecular tags as uBAM files;
the file forming module is used for comparing the gene sequence of the FASTQ file with a reference genome and performing de-duplication to obtain a BAM file, and combining the BAM file with the uBAM file to obtain a BAM file containing a molecular tag;
the identification module is used for aggregating and de-duplicating reads in the BAM file according to the molecular label, and obtaining a sample original mutation set in the gene mutation panel region by using a pileup method;
a parameter statistics module, configured to perform statistics on gene mutation parameters in the sample original mutation set, where the gene mutation parameters include: gene mutation grade, gene mutation quantity and mutation frequency of each grade; the system is used for filtering the original mutation set of the samples according to a pre-constructed filtering rule and counting gene mutation parameters of each sample;
and the mutation evaluation module is used for evaluating the mutation degree of the plasma sample to be detected by using a pre-constructed mutation analysis model according to the gene mutation parameters of the sample.
Further preferably, the gene mutation levels include level I, level II, level III and level IV, wherein the level I includes oncogenes in the cancer-related database, the level II includes cancer suppressor genes or other cancer suppressor genes that are functionally judged to be harmful and are not level I in the cancer-related data, the level III includes exon region genes that are not level I and level II, and the level IV includes genes that are not level I, level II and level III;
the gene mutation parameters include: number of mutations of class I, maximum mutation frequency value of class I, number of mutations of class II, maximum mutation frequency value of class II, number of mutations of class III, maximum mutation frequency value of class III, number of mutations of class IV, and maximum mutation frequency value of class IV.
Further preferably, the ctDNA mutation analysis device further includes a filtering module respectively connected to the identification module and the parameter statistics module, and configured to filter the obtained original mutation set of the sample, where the filtering conditions include:
germline mutations in peripheral blood leukocytes above a given frequency;
blacklist sites that occur repeatedly in a database specific for a large number of historical samples of a given panel, and sites in the database where the population frequency exceeds a set threshold;
background noise baselines were constructed from cfDNA of more than a given number of healthy human plasma samples under the same sequencing conditions.
Further preferably, in the identification module, the method includes:
a gene family forming unit for forming a gene family based on the molecular tag, wherein: the editing distance between the molecular labels is smaller than a first preset value, and the difference between the initial positions of reads with the same molecular labels is a second preset value;
a filtering unit, configured to filter the gene family according to a rule, where the rule includes: for family with a corresponding double-stranded molecular label, the number of reads contained in the family is not less than a third preset value; for family without the corresponding double-stranded molecule label, the number of reads contained in the family is not less than the fourth preset value.
In yet another aspect, the present invention further provides a ctDNA performance analysis device based on liquid biopsy, comprising:
the data preprocessing module is used for preprocessing multidimensional performance parameters to be analyzed in the plasma sample to be detected, and the multidimensional performance parameters comprise the gene mutation parameters;
the characteristic selection module is connected with the data preprocessing module and is used for respectively carrying out characteristic screening on the multi-dimensional performance parameters to be analyzed;
the model building module is connected with the feature selection module and used for respectively building performance analysis models and building multidimensional omics integrated models according to multidimensional performance parameters to be analyzed, wherein the outputs of the built performance analysis models are respectively connected with the input of the multidimensional omics integrated models, and the built performance analysis models comprise the mutation analysis models and are used for analyzing the gene mutation parameters;
and the performance analysis module is connected with the model construction module and is used for respectively inputting the characteristics of the multidimensional performance parameters screened by the characteristic selection module into the corresponding trained performance analysis model for preliminary analysis, and the multidimensional omics integrated model further analyzes the preliminary analysis result of the performance analysis model to obtain a comprehensive analysis result aiming at the multidimensional performance parameters so as to complete the performance analysis of the ctDNA of the plasma sample to be detected.
Further preferably, the multidimensional performance parameters further include clinical data and/or methylation level and/or tumor marker concentration, the features obtained by screening by the feature selection module include clinical data features and/or methylation level features and/or preset tumor marker concentration features, and the multiple performance analysis models constructed by the model construction module include clinical models and/or methylation analysis models and/or tumor marker models;
in the performance analysis module, after the created mutation analysis model, clinical model and/or methylation analysis model and/or tumor marker model performs preliminary analysis on corresponding parameters, the multidimensional omics integrated model further analyzes the preliminary analysis results of the performance analysis model to obtain comprehensive analysis results of multidimensional performance parameters.
Further preferably, in the model construction module, after the construction of the plurality of performance analysis models and the multidimensional omics integrated model is completed, a parameter with the largest AUC is selected as a trained model from the prediction results of the performance analysis models, and the multidimensional omics integrated model is further trained based on the trained model, and the multidimensional omics integrated model is constructed in a sample splitting and cross validation manner.
The ctDNA mutation degree analysis method and device based on liquid biopsy and the ctDNA performance analysis device based on liquid biopsy provided by the invention can at least bring the following beneficial effects:
1. detecting gene mutation in a plasma sample to be detected based on a designed pan-oncogene mutation panel and NGS (next generation gene sequencing technology) detection platform, and respectively counting the number and the mutation frequency of each type of gene mutation according to the divided gene mutation types to be used as gene mutation analysis indexes. Compared with a method for detecting mutation by a PCR method and a method for only identifying single mutation by an NGS platform, the ctDNA mutation degree analysis method based on liquid biopsy and the device have the advantages that a plurality of gene mutation analysis indexes are combined, the mutation degree is analyzed by creating a mutation analysis module, the problem that the detection sensitivity is reduced due to the fact that false positive occurs on a single mutation site is solved, the limitation of the number of the detection sites on the PCR platform is broken through, a basis is provided for subsequently distinguishing whether a plasma sample to be detected is from cancer tissues, the detection sensitivity of certain benign nodules and early cancer patients can be particularly improved, early diagnosis of cancer and early screening of cancer are effectively assisted, and the screening efficiency and the screening precision are improved. Compared with the traditional clinical single tumor marker protein CEA and the clinical routine PET-CT screening result, the sensitivity of ctDNA mutation modeling analysis based on multiple gene mutation analysis index features is higher.
2. In addition to creating a mutation analysis model aiming at ctDNA mutation, corresponding performance analysis modules are created aiming at other performance parameters (including clinical data and/or methylation level and/or preset tumor marker concentration), a multidimensional omics integrated model is created simultaneously, further comprehensive evaluation is carried out according to the primary analysis result of each performance analysis module on the corresponding performance parameters (different dimensions), compared with a single performance analysis model (tumor marker protein of serum is used alone to assist diagnosis, such as CA125, CEA, AFP and the like, because the tumor marker protein information can be detected in non-tumor patients, the sensitivity and specificity are lower, when the detection is judged by using ctDNA mutation detection-based detection-positive alone, because human body tissues can release trace mutated DNA into plasma, and meanwhile, technical errors can be generated by an instrument to cause false positive, therefore, sensitivity and specificity are poor; when the methylation signal of the ctDNA CpG point in the plasma is independently used for detection, the sensitivity and the specificity are low), different omics data can be supplemented and corrected mutually, the specificity of the model can be improved, the overall specificity and the prediction effect of the diagnosis can be improved, and the assistance is provided for the diagnosis of a subsequent doctor. In addition, in the model, each omics adopts a non-invasive detection mode, so that convenience is provided for clinical application, and information of each aspect of the plasma sample to be detected can be collected as much as possible, so that the condition of the plasma sample to be detected can be known more comprehensively. In addition, in the practical application of assisting in lung cancer screening, a sample with positive LDCT (low-dose spiral CT) detection can be used as a plasma sample to be detected, so that the characteristic of high LDCT detection sensitivity is kept, and the overall diagnosis sensitivity is improved.
Drawings
The foregoing features, technical features, advantages and embodiments are further described in the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic flow chart of an embodiment of the ctDNA mutation degree analysis method based on liquid biopsy according to the present invention;
FIG. 2 is a graph of the ctDNA mutation ROC in one example;
FIG. 3 is a schematic structural diagram of an embodiment of a ctDNA mutation analysis device based on liquid biopsy according to the present invention;
FIG. 4 is a schematic structural diagram of a ctDNA performance analysis device based on liquid biopsy according to the present invention;
fig. 5 is a schematic structural diagram of a terminal device in the present invention.
Reference numerals:
110-capture sequencing module, 120-molecular tag extraction module, 130-file formation module, 140-identification module, 150-parameter statistics module, 160-mutation evaluation module, 310-data preprocessing module, 320-feature selection module, 330-model construction module and 340-performance analysis module.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
In a first embodiment of the present invention, a ctDNA mutation degree analysis method based on liquid biopsy, as shown in fig. 1, includes: s10, capturing and sequencing a plasma sample to be detected according to a pre-created gene mutation panel to obtain a FASTQ file, wherein cfDNA in the plasma sample to be detected carries a pre-accessed molecular tag; s20, respectively extracting molecular tags in paired reads in the FASTQ file and storing the molecular tags as uBAM files; s30, comparing the gene sequence of the FASTQ file with a reference genome, carrying out de-duplication to obtain a BAM file, and combining the BAM file with a uBAM file to obtain a BAM file containing a molecular tag; s40, aggregating and de-duplicating reads in the BAM file according to the molecular label; s50 obtaining a sample original mutation set in a gene mutation panel region by using a pileup method; s60, gene mutation parameters in the original mutation set of the sample are counted, wherein the gene mutation parameters comprise: gene mutation grade, gene mutation quantity and mutation frequency of each grade; s70, filtering the original mutation set of the samples according to a pre-constructed filtering rule, and counting the gene mutation parameters of each sample; s80, the mutation degree of the plasma sample to be detected is evaluated according to the gene mutation parameters of the sample by using a pre-constructed mutation analysis model.
In this example, after splitting the bcl file of the Novaseq sequencing result, a FASTQ format was obtained and the 3' end residual linker sequence was processed: the molecular tag (UMI) sequences at the ends of read1 and read 25' are extracted and stored in a uBAM file. Meanwhile, the FASTQ file is de-duplicated with a reference genome (hg 19) by using Bwa to obtain a BAM file, and the BAM file is combined with a uBAM file to obtain a BAM file with UMI (tag information RX of the UMI is combined into the BAM file), so that the identification operation of a subsequent consistency sequence is facilitated, and reads derived from the same molecule are combined into a read sequence.
In Consensus sequence recognition (Call Consensus Reads), Reads in BAM files are aggregated and de-duplicated according to molecular tags. In this process, genes family are formed based on molecular tags, and in each gene family: the edit distance (edge distance) between the molecular tags is smaller than a first preset value (such as 2, 3, and the like, which can be set according to the actual situation), and the difference between the initial positions of reads with the same molecular tags is a second preset value (such as 1bp, 2bp, and the like, which can be set according to the actual situation); the gene family is filtered according to the rules, and the gene family supported by a double-stranded molecular tag (duplex UMI) or not is treated differently, wherein the rules comprise: for family with a corresponding double-stranded molecule label, the number of reads contained in the family is not less than a third preset value (for example, 1, which can be set according to actual conditions); for family without corresponding double-stranded molecular labels, the number of reads contained in the family is not less than a fourth preset value (for example, 3 reads can be set according to actual conditions); identifying consistent sequences in each filtered gene family, merging the two groups of consistent sequences, and comparing the two groups of consistent sequences with a reference genome again to obtain a final BAM file.
The sample original mutation collection (Variant Calling) including SNVs (single nucleotide variants), InDels (insertions and deletions), MNVs (polynucleotide variants) and the like was then obtained using the pileup method (samtools pileup) for the gene mutation panel region of the generated final BAM file. In this process, in order to improve the detection efficiency, the gene family with the number of the filtered reads being 1 but the covered mutation having other consistent sequence support can be selectively supplemented. In practical application, the UMI can be extracted, aligned, identified by consistent sequences, identified by gene mutation and the like through the fgbio software.
Then, gene mutation parameters in the original mutation set of the sample, including but not limited to gene mutation types, the number of each type of gene mutation and mutation frequency, are counted, and in practical application, in order to improve the identification sensitivity, the gene mutation types can be set according to practical situations. In one example, the gene mutation levels include level I, level II, level III and level IV, wherein the level I includes oncogenes in a predetermined cancer database, the level II includes cancer suppressor genes or other cancer suppressor genes that are functionally determined to be harmful and are not in the predetermined cancer database, the level III includes exon region genes in the level I and not in the level II, and the level IV includes genes in the level I, not in the level II and not in the level III; the gene mutation parameters include: the number of mutations in level I, the maximum mutation frequency value in level I, the number of mutations in level II, the maximum mutation frequency value in level II, the number of mutations in level III, the maximum mutation frequency value in level III, the number of mutations in level IV, and the maximum mutation frequency value in level IV are 8 indexes.
After counting the gene mutation parameters in the original mutation set of the sample, further filtering the original mutation set of the sample according to a pre-constructed filtering rule, and counting the gene mutation parameters of each sample. Specifically, the filtering rules include: germline mutations in peripheral blood leukocytes (BC) above a given frequency (e.g., 10%, 15%, 20%, etc.); blacklisted sites that occur repeatedly in a database specific for a large historical sample of a given panel (embodied as non-somatic mutations that are judged to occur repeatedly in a normal sample that has been sequenced using the same panel), and sites in the database where the population frequency exceeds a set threshold (e.g., 15%, 20%, 25%, etc.); background noise baselines were constructed from cfDNA of more than a given number (e.g., 20, 30, etc., or even more) of healthy human plasma samples under the same sequencing conditions (calculating the probability that the mutation to be tested differs significantly from the baseline, e.g., below a given threshold (e.g., 0.05, 0.06, etc.) is considered background noise). After the filtering is finished, the gene mutation parameters of each sample are further counted from the sample level, and the gene mutation grades, the gene mutation quantity of each grade and the mutation frequency of the samples are counted.
And finally, evaluating the mutation degree of the plasma sample to be detected by using a pre-constructed mutation analysis model according to the gene mutation parameters of the sample. For the output result of the mutation analysis model, the prediction and the prediction probability of the mutation analysis model on the attributes of the to-be-detected plasma sample, such as the possibility of predicting that the to-be-detected plasma sample has malignant nodules and the possibility of predicting that the to-be-detected plasma sample has benign nodules, can be used for providing partial basis for the diagnosis of a follow-up doctor. For the constructed mutation analysis model, a suitable network model can be selected according to actual conditions, for example, in one example, a Support Vector Machine (SVM) model is selected to evaluate the mutation degree of the plasma sample to be detected. In model training, a linear kernel SVM is used for training mutation analysis model expression based on 13-fold cross validation. In each fold, 70% of samples are randomly selected as a training set, 30% of samples are selected as a testing set, and the optimal mutation analysis model is obtained by searching for the optimized hyper-parameters through cross-validation exhaustive grids. And finally, using an independent sample set as a verification set to verify the mutation analysis model obtained by training. It should be clear that, the structure of the mutation analysis model and the training method thereof are only given by way of example, in other examples, the structure of the mutation analysis model and the training parameters thereof may be adjusted according to actual situations, and are not specifically limited herein, so long as the purpose of the embodiment can be achieved.
In other embodiments, before the step S10 of performing capture sequencing on the plasma sample to be detected according to the pre-created gene mutation panel to obtain the FASTQ file, a step of screening gene mutation sites based on big data to form pan-oncogene mutation panel is further included.
Pan-oncogene mutant panel was screened from large data including the tumor genomic database. In one example, the pan-oncogene mutant panel is based on TCGA, ICGC, COSMIC, MSK, etc. foreign mainstream databases and self-constructed Chinese tumor genome data from over 10 million samples. The mutation sites screened by each database are shown in table 1.
Table 1: database mutation site table
Database with a plurality of databases TCGA ICGC COSMIC MSK
Number of Cohort 32 81 48 31
Number of samples 7120 19729 1434488 24592
Number of somatic mutations 1283201 2986166 4138403 3499
In addition to the TCGA, ICGC, COSMIC, MSK public databases, the panel design also incorporated cancer SEEK and Genetron _ HCC reported cancer hotspot mutations. The finally obtained pan-oncogene mutant panel covered 23 chromosomes, contained 915 tumor hotspot genes, and had a size of 200 kb.
The method for analyzing the degree of ctDNA mutation based on liquid biopsy and the beneficial effects thereof are described below by way of an example:
firstly, an experimental process:
1. plasma cfDNA extraction
The BC @ DNA and cfDNA extraction was performed according to the extraction methods in the blood genome DNA extraction kit and free nucleic acid extraction kit (magnetic bead method), respectively, with the plasma and blood cell extraction amounts of 4 mL and 200. mu.L, respectively. DNA quantification was done by the Qubit analyzer. Quality inspection of cfDNA was performed using an Agilent bioanalyzer 2100, and samples without large fragment genome contamination (> 600 bp/less than 30%) were selected for subsequent experiments. DNA samples were stored at-80 ℃.
Building a library of cfDNA
Library construction was performed according to the KAPA Hper Prep Kit protocol. The initial amount of cfDNA is not more than 30 ng, Adapter ligation is carried out by using a Duplex Seq Adapter carrying a special molecular tag, and library construction is carried out according to the initial amount by the ratio of 1: 200; the BC @ DNA was broken by sonication (Covaris M220) into fragments of 100-200 bp to construct a library. The library was quantified and quality controlled using the Qubit dsDNA HS Assay Kit and Agilent bioanalyzer 2100, respectively. Library samples were stored at-20 ℃.
3. Library Capture
The capture operation was performed with an equal amount of 1 μ g library (one capture not more than 12 libraries), targeted region capture was performed by Panel9_ IDT probe at 65 ℃ for 16 h of hybridization, the captured samples were purified and eluted and 14 cycles of PCR amplification to obtain the final captured library, and library quantification was performed by Qubit.
4. Operating the machine after capture
The captured sample is loaded onto the illumina platform.
Secondly, a data analysis process:
1) data splitting and preparation. And splitting a bcl file of a Novaseq sequencing result to obtain a fastq format, processing a residual linker sequence at the 3 'end, extracting UMI sequences at the read1 end and the read 25' end respectively, and storing the UMI sequences in a uBAM file. Meanwhile, the BAM file generated after alignment of the FASTQ file with the genome was merged with uBAM using Bwa.
2) And identifying a consistency sequence.
3) And (4) identifying gene mutation. And (4) carrying out gene mutation identification on the generated BAM file by using a customized identifier.
Three, machine learning modeling
3.1 two groups of samples, one group of cancer patients (N = 70) and one group of benign nodule patients (N = 70) are selected, and the mutation analysis models with 8 indexes of gene mutation number and mutation frequency of grade I, grade II, grade III and grade IV are obtained after data preprocessing respectively.
3.2 taking independent validation sets, including known cancer patients (N = 30) and benign nodule patients (N = 30), the constructed mutation analysis model was validated and the results were counted. As shown in fig. 2, the area AUC under the final Roc curve =0.85, and the sensitivity was 68% at a specificity of 95%.
The present invention also provides a ctDNA mutation analysis apparatus 100 based on liquid biopsy, as shown in fig. 3, including: the capture sequencing module 110 is configured to perform capture sequencing on a plasma sample to be detected according to a pre-created gene mutation panel to obtain a FASTQ file, where cfDNA in the plasma sample to be detected carries a pre-accessed molecular tag; a molecular tag extraction module 120, configured to respectively extract molecular tags in paired reads in the FASTQ file and store the molecular tags as a upam file; the file forming module 130 is used for comparing the gene sequence of the FASTQ file with a reference genome and performing deduplication to obtain a BAM file, and combining the BAM file with a uBAM file to obtain a BAM file containing a molecular tag; the identification module 140 is used for aggregating and de-duplicating reads in the BAM file according to the molecular tag, and obtaining a sample original mutation set in a gene mutation panel region by using a pileup method; a parameter statistics module 150, configured to perform statistics on gene mutation parameters in the original mutation set of the sample, where the gene mutation parameters include: gene mutation grade, gene mutation quantity and mutation frequency of each grade; the system is used for filtering the original mutation set of the samples according to a pre-constructed filtering rule and counting the gene mutation parameters of each sample; and the mutation evaluation module 160 is used for evaluating the mutation degree of the plasma sample to be detected by using a pre-constructed mutation analysis model according to the gene mutation parameters of the sample.
In this example, after splitting the bcl file of the Novaseq sequencing result, a FASTQ format was obtained and the 3' end residual linker sequence was processed: the molecular tag (UMI) sequences at the ends of read1 and read 25' are extracted and stored in a uBAM file. Meanwhile, the FASTQ file is de-duplicated with a reference genome (hg 19) by using Bwa to obtain a BAM file, and the BAM file is combined with a uBAM file to obtain a BAM file with UMI (tag information RX of the UMI is combined into the BAM file), so that the identification operation of a subsequent consistency sequence is facilitated, and reads derived from the same molecule are combined into a read sequence.
The identification module comprises: a gene family forming unit for forming a gene family based on the molecular tag, wherein: the editing distance between the molecular labels is smaller than a first preset value, and the difference between the initial positions of reads with the same molecular labels is a second preset value; and the gene family filtering unit is used for filtering the gene family according to a rule, wherein the rule comprises the following components: for family with a corresponding double-stranded molecular label, the number of reads contained in the family is not less than a third preset value; for family without corresponding double-stranded molecular labels, the number of reads contained in the family is not less than a fourth preset value; the consistent sequence identification unit is used for identifying consistent sequences in each gene family; and the alignment unit is used for aligning the gene sequence of the identified consistent sequence with the reference genome.
In Consensus sequence recognition (Call Consensus Reads), Reads in BAM files are aggregated and de-duplicated according to molecular tags. Specifically, genes family are formed based on molecular tags, and in each gene family: the edit distance (edge distance) between the molecular tags is smaller than a first preset value (such as 2, 3, and the like, which can be set according to the actual situation), and the difference between the initial positions of reads with the same molecular tags is a second preset value (such as 1bp, 2bp, and the like, which can be set according to the actual situation); the gene family is filtered according to the rules, and the gene family supported by a double-stranded molecular tag (duplex UMI) or not is treated differently, wherein the rules comprise: for family with a corresponding double-stranded molecule label, the number of reads contained in the family is not less than a third preset value (for example, 1, which can be set according to actual conditions); for family without corresponding double-stranded molecular labels, the number of reads contained in the family is not less than a fourth preset value (for example, 3 reads can be set according to actual conditions); identifying consistent sequences in each filtered gene family, merging the two groups of consistent sequences, and comparing the two groups of consistent sequences with a reference genome again to obtain a final BAM file.
The identification module then uses pileup method (samtools pileup) to obtain a sample original mutation set (Variant Calling) for the gene mutation panel region of the generated final BAM file, including SNVs (single nucleotide variants), InDels (insertions and deletions), MNVs (polynucleotide variants), and the like. In this process, in order to improve the detection efficiency, the gene family with the number of the filtered reads being 1 but the covered mutation having other consistent sequence support can be selectively supplemented. In practical application, the UMI can be extracted, aligned, identified by consistent sequences, identified by gene mutation and the like through the fgbio software.
Then, the parameter statistics module performs statistics on the identified gene mutation parameters including but not limited to gene mutation types, the number of each type of gene mutation and mutation frequency, and in practical application, in order to improve identification sensitivity, the gene mutation types can be set according to practical situations. In one example, the gene mutation levels include level I, level II, level III and level IV, wherein the level I includes oncogenes in a predetermined cancer database, the level II includes cancer suppressor genes or other cancer suppressor genes that are functionally determined to be harmful and are not in the predetermined cancer database, the level III includes exon region genes in the level I and not in the level II, and the level IV includes genes in the level I, not in the level II and not in the level III; the gene mutation parameters include: the number of mutations in level I, the maximum mutation frequency value in level I, the number of mutations in level II, the maximum mutation frequency value in level II, the number of mutations in level III, the maximum mutation frequency value in level III, the number of mutations in level IV, and the maximum mutation frequency value in level IV are 8 indexes. After counting the gene mutation parameters in the original mutation set of the sample, further filtering the original mutation set of the sample according to a pre-constructed filtering rule, and counting the gene mutation parameters of each sample. Specifically, the filtering rules include: germline mutations in peripheral blood leukocytes (BC) at more than a given frequency (example: 15%); blacklisted sites that occur repeatedly in a database specific for a large number of historical samples of a given panel, and sites in the database where the population frequency exceeds a set threshold (example: 20%); background noise baselines were constructed from cfDNA of more than a given number of healthy human plasma samples under the same sequencing conditions (calculating the probability that the mutation to be tested is significantly different from the baseline, and considered background noise if below a given threshold (e.g.: 0.05)). After the filtering is finished, the gene mutation parameters of each sample are further counted from the sample level, and the gene mutation grades, the gene mutation quantity of each grade and the mutation frequency of the samples are counted.
And finally, the mutation evaluation module evaluates the mutation degree of the plasma sample to be detected by using a pre-constructed mutation analysis model according to the gene mutation parameters of the sample. For the output result of the mutation analysis model, the prediction and the prediction probability of the mutation analysis model on the attributes of the to-be-detected plasma sample, such as the possibility of predicting that the to-be-detected plasma sample has malignant nodules and the possibility of predicting that the to-be-detected plasma sample has benign nodules, can be used for providing partial basis for the diagnosis of a follow-up doctor. For the constructed mutation analysis model, a suitable network model can be selected according to actual conditions, for example, in one example, a Support Vector Machine (SVM) model is selected to evaluate the mutation degree of the plasma sample to be detected. In model training, a linear kernel SVM is used for training mutation analysis model expression based on 13-fold cross validation. In each fold, 70% of samples are randomly selected as a training set, 30% of samples are selected as a testing set, and the optimal mutation analysis model is obtained by searching for the optimized hyper-parameters through cross-validation exhaustive grids. And finally, using an independent sample set as a verification set to verify the mutation analysis model obtained by training. It should be clear that, the structure of the mutation analysis model and the training method thereof are only given by way of example, in other examples, the structure of the mutation analysis model and the training parameters thereof may be adjusted according to actual situations, and are not specifically limited herein, so long as the purpose of the embodiment can be achieved.
In other embodiments, the ctDNA mutation analysis device further comprises a gene mutation panel creation module for screening gene mutation sites to form pan-oncogene mutation panels based on the big data. Pan-oncogene mutant panel was screened from large data including the tumor genomic database. In one example, the pan-oncogene mutant panel is based on TCGA, ICGC, COSMIC, MSK, etc. foreign mainstream databases and self-constructed Chinese tumor genome data from over 10 million samples. The mutation sites screened by each database are shown in table 1. To improve the accuracy of pan-oncogene mutant panel, the panel design also incorporated cancer hotspot mutations reported by cancer seek and Genetron HCC, in addition to the common database of TCGA, ICGC, cosinc, MSK. The finally obtained pan-oncogene mutant panel covered 23 chromosomes, contained 915 tumor hotspot genes, and had a size of 200 kb.
On this basis, the present invention also provides a ctDNA performance analysis device 300 based on liquid biopsy, as shown in fig. 4, comprising: the data preprocessing module 310 is configured to perform preprocessing operation on multidimensional performance parameters to be analyzed in a plasma sample to be detected, where the multidimensional performance parameters include the gene mutation parameters; the characteristic selection module 320 is connected with the data preprocessing module 310 and is used for respectively performing characteristic screening on the multidimensional performance parameters to be analyzed; the model building module 330 is connected to the feature selection module 320, and is configured to build performance analysis models and multidimensional omics integrated models respectively for the multidimensional performance parameters to be analyzed, wherein outputs of the multiple built performance analysis models are respectively connected to inputs of the multidimensional omics integrated models, and the multiple built performance analysis models include the mutation analysis model and are used for analyzing gene mutation parameters; and the performance analysis module 340 is connected with the model construction module 330, and is configured to input the features of the multidimensional performance parameters screened by the feature selection module into the trained corresponding performance analysis models respectively for preliminary analysis, and the multidimensional omics integrated model further analyzes the preliminary analysis results of the performance analysis models to obtain comprehensive analysis results for the multidimensional performance parameters, so as to complete performance analysis of the ctDNA of the plasma sample to be detected.
In the ctDNA performance analysis device, the multidimensional performance parameters comprise clinical data and/or methylation level and/or tumor marker concentration besides the gene mutation parameters, the characteristics obtained by screening by the characteristic selection module comprise clinical data characteristics and/or methylation level characteristics and/or preset tumor marker concentration characteristics, and the multiple performance analysis models constructed by the model construction module comprise clinical models and/or methylation analysis models and/or tumor marker models. The method comprises the steps of establishing corresponding models aiming at clinical data and/or methylation levels and/or preset tumor marker concentrations, and simultaneously establishing a multi-dimensional omics integrated model to comprehensively evaluate ctDNA performance from different dimensions. In order to improve the specificity of the model, in practical application, a mutation analysis model, a clinical model, a methylation analysis model, a tumor marker model and a multidimensional omics integrated model can be created simultaneously, wherein the mutation analysis model analyzes ctDNA mutation of a plasma sample to be detected based on gene mutation parameters, the clinical model analyzes ctDNA performance of the plasma sample to be detected based on clinical data, the methylation analysis model analyzes the methylation degree of the plasma sample to be detected based on methylation level characteristics, the tumor marker model analyzes the ctDNA performance of the plasma sample to be detected based on tumor marker concentration, and finally output results of the four models are input into the multidimensional omics integrated model for further comprehensive analysis. For each performance parameter, the analysis result of the created performance analysis model includes, but is not limited to, the property of the plasma to be detected and its probability, such as probability from a healthy person.
When the methylation level is included in the multi-dimensional performance parameters, the panel creating module is included in the feature selection module, and before the methylation analysis model is created, a panel (gene mutation panel, methylation panel, tumor marker protein panel and the like) is created for a corresponding index according to application requirements, wherein the pan-oncogene mutation panel is created through the process of creating the panel, such as the process of creating the panel in the ctDNA mutation analysis method and the device, and is screened by big data including a tumor genome database. In one example, the pan-oncogene mutant panel is based on TCGA, ICGC, COSMIC, MSK, etc. foreign mainstream databases and self-constructed Chinese tumor genome data from over 10 million samples. The pan-oncogene mutant panel obtained covered 23 chromosomes, contained 915 tumor hotspot genes, and had a size of 200kb and a sequencing depth of 35,000X.
When creating a methylated panel, the panel creation module comprises: the sample selecting unit is used for acquiring methylation modification data of tumor tissues and normal tissues of the pan-cancer cohort recorded in the public database and methylation modification data of peripheral blood of the healthy person recorded in the public data set, and selecting a tissue sample of the healthy person and a tissue sample of the cancer tissue from the methylation modification data; the significant difference site screening module is used for screening a first significant methylation level difference site between the cancer tissue and the tissue beside the cancer and screening a second significant methylation level difference site between the cancer tissue and the blood cells of the healthy human; and the core site acquisition module is used for combining the first methylation level difference significant site and the second methylation level difference significant site to obtain a core site of the methylated panel so as to finish the creation of the methylated panel.
In the process, because the cfDNA in the blood plasma of the healthy person is mainly derived from blood cells, and the blood plasma of the cancer patient also contains ctDNA released by cancer tissues, in addition to screening a first significant methylation level difference site (DMP) between the cancer tissues and the paracancerous tissues, a second significant methylation level difference site between the cancer tissues and the blood cells of the healthy person is further screened, and then two significant methylation level difference sites are combined to obtain a difference interval DMR which is used as a core site of the methylated panel, so that the difference of the methylated panel between the cancer patients and the healthy persons is maximized. In other embodiments, for convenience of panel design, the difference intervals DMR obtained by combining may be further combined, for example, two DMPs having a spacing of not more than 250bp may be combined in one difference interval DMR. In one example, a methylated panel was created using 450K methylation data of the public database TCGA combining sites of significant differences in methylation levels of cancer and paracancerous tissues, the panel being 1.1Mb in size and 1500X in sequencing depth.
In order to further improve the detection efficiency, before screening a first site with significant methylation level difference between the cancer tissue and the para-cancer tissue and screening a second site with significant methylation level difference between the cancer tissue and the blood cells of a healthy person, the method further comprises the step of screening CpG sites in the cancer tissue, and specifically comprises the following steps: selecting CpG sites meeting preset conditions from randomly selected partial cancer tissue samples (such as 1/2 samples, 2/3 samples, 3/4 samples and the like) in a plurality of times (such as 5 times, 10 times, 15 times or more); and further screening the CpG sites obtained by each screening, and taking the intersection as the final selected CpG site. In this way, a first number (e.g., 400, 500, 600, etc. or even more) of CpG sites that are most significantly differentiated between the cancer tissue and the paracarcinoma tissue are screened based on all cancer tissue samples and the selected CpG sites as first sites with significant methylation level differences; screening a second number (such as 4500, 5000, 5500 and more) of CpG sites with the most significant differences between cancer tissues and healthy human blood cells based on all cancer tissue samples and the selected CpG sites as a second significant methylation level difference site, and finally combining the two parts to obtain the significant methylation level difference site which is the core site of the methylated panel.
In the screening of CpG sites satisfying the predetermined condition in this embodiment, the number of cancer tissue samples selected each time is the same for the same methylated panel, for example, CpG sites satisfying the predetermined condition are sequentially screened from 2/3 randomly selected cancer tissue samples in 5 times. Specifically, the preset conditions for screening CpG sites include: a false discovery rate FDR of the statistical test between the healthy human sample and the cancer tissue sample is less than a first preset threshold (e.g., 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, etc.); the sum of the mean value and the standard deviation of the blood cells of the healthy person is less than a second preset threshold (such as 0.05, 0.1, 0.2, 0.5 and the like); filtering CpG sites of non-CpG islands and related areas (such as filtering Open Sea areas, etc.); the mean value in the cancer tissue is not less than a third predetermined threshold (e.g., 0.1, 0.2, 0.3, 0.5, etc.); and the sum of the mean and the standard deviation of the paracancer normal tissues (the normal tissues corresponding to the cancer species should be selected as much as possible) is less than a fourth preset threshold (such as 0.05, 0.1, 0.2, 0.5 and the like). It should be clear that in practical applications, the selection conditions for CpG sites can be set according to practical situations, and even some of the conditions can be selected as the basis for selection. When the tumor marker protein panel is created, the panel is combined and designed according to different types of tumor markers (embryonic antigen, glycoprotein antigen, protein, enzyme or isoenzyme, hormone and the like) and related cancer types of the tumor markers, so that the panel for detection comprises multiple types of indexes and covers wider cancer types, and the sensitivity and specificity of panel detection are improved to the greatest extent. In practical application, the tumor markers can be selected according to specific conditions, and corresponding types can be selected as much as possible in order to improve the detection precision.
After the creation of the panel is completed, the corresponding model is further created according to the indexes, for the clinical model, clinical data can be selected according to actual conditions, such as clinical characteristics of the age, sex, nodule size, drinking history and the like of a patient, in the process of training the clinical model, the characteristics with large difference in good and malignant nodules are reserved, the model with the largest AUC is selected, and the prediction result (the possibility of predicting the good and malignant nodules) of the test sample is reserved for the subsequent training of the multidimensional omics integrated model. The clinical model can be created by a method such as logistic, SVM, etc., and is not limited herein.
For the tumor marker model, the concentration of various blood tumor markers of patients in a training set is subjected to feature screening, in the process of training the established tumor marker model, the features with large difference in good and malignant nodules are reserved (the features with large difference can be selected according to the actual situation according to the difference degree, such as the features with large difference degree of 20 percent and 30 percent are selected according to the difference degree sequence), the model with the largest AUC is selected, and the prediction result of a test sample (the possibility of predicting the malignant nodule) is reserved for the subsequent training of the multidimensional omics integrated model. The tumor marker model can be created by logistic, SVM, and the like, and is not particularly limited herein.
For the mutation analysis model, panel was tested for cfDNA mutations and their mutation frequency for each patient based on ctDNA somatic mutation detection. By performing feature screening or feature transformation on cfDNA mutations of patients in the training set, a model with the largest AUC is selected, and the prediction result (predicted to be the possibility of malignant nodules) of the test sample is reserved for subsequent training of the multidimensional omics integrated model. The mutation analysis model can be created by a method such as logistic, SVM, etc., and is not particularly limited herein.
For the methylation analysis model, panel was tested for cfDNA methylation sites and their degree of methylation for each patient based on ctDNA methylation modification. By methylation of patient cfDNA in training setAnd (3) carrying out feature screening on the sites, retaining the features with large difference in good and malignant nodules (the features with large difference can be selected according to the actual situation and according to the difference degree, such as the features with large difference degree of 20 percent and 30 percent are selected according to the difference degree sequence), selecting the model with the largest AUC, and retaining the prediction result of the test sample (the possibility of predicting the test sample into the malignant nodules) for subsequent training of the multidimensional omics integration model. The methylation analysis model can be created by a method such as logistic, SVM, etc., and is not particularly limited herein. Log2 of methylation level for each methylation-linked region before constructing and training a methylation analysis model (x+1) transformation using median padding of the same set corresponding to the methylation-linked region for missing data, wherein,xrepresents the methylation level of the methylation linkage region; then according to formulaz=(x–mean(X))/std(X) A normalization process is performed to calculate the z-score value, wherein,Xindicating that the same sample group corresponds to the methylation level of MCB. Then, the methylation linkage region is further subjected to Feature screening by using a Cross-Validation Recursive Feature Elimination (RFECV) method to optimize the effect of the model. To further improve the detection accuracy, log2 is performed on the methylation level of each methylation chain region (x+1) prior to the transformation, further comprising the step of screening for methylated linked regions comprising: respectively performing capture sequencing on the cancer tissue sample and the healthy human tissue sample according to the pre-established methylated panel; for one type of cancer species, the degree of difference of each methylated linkage region between the cancer tissue sample and the healthy human tissue sample was calculated by 6 indexes of analysis of variance (ANOVA), Fisher's exact test (Fisher's exact test), Chi-Square test (Chi-Square test), Wilcoxon rank sum test (Wilcoxon rank test), Mann-Whitney test (Mann-Whitney test), and t test (Student's t-test), respectively; screening the methylation chain region according to the calculation result, and when at least 4 of 6 indexes of a methylation chain region result in that the p value between the cancer tissue sample and the healthy human tissue sample is less than a preset value (which can be set according to the actual situation, such as 0.1), reserving the differenceDifferentially significant methylation-linked regions. The methylation analysis model is then trained based on the remaining methylation linked regions. The selected test method for calculating the difference degree of each methylation linkage region between the cancer tissue sample and the healthy human tissue sample can be adjusted according to practical application in other embodiments, such as test methods based on binomial distribution and poisson distribution, etc., as long as the object of the invention can be achieved.
Further, when the ctDNA performance analysis is created with a methylation analysis model, the ctDNA performance analysis apparatus further includes a methylation feature screening module, including: the device comprises a to-be-detected plasma sample processing unit, a to-be-detected plasma sample processing unit and a to-be-detected plasma sample processing unit, wherein the to-be-detected plasma sample processing unit is used for performing capture sequencing on the to-be-detected plasma sample according to a pre-established methylated panel and performing pretreatment operation (including de-duplication, filtering, sequencing, index establishment and the like) to obtain a Bam file; a linkage region dividing unit, configured to divide the Bam file according to a predefined division rule to obtain a methylation-corrected block (MCB), where the division rule includes: the Pearson correlation coefficient between any two adjacent CpG sites in the same methylation linkage region is greater than a preset value, and the number of the CpG sites in the same methylation linkage region is greater than a preset number; a methylation level calculation unit for calculating the methylation level of each methylation linkage region; and the methylation degree evaluation unit is used for evaluating the methylation degree of the plasma sample to be detected by using a pre-constructed methylation analysis model according to the methylation level.
In the division of the methylation linkage region, the Pearson correlation coefficient between any two adjacent CpG sites in the same MCB is larger than a preset value, the number of the CpG sites in the same MCB is larger than a preset number, and the mean value of Beta values of all the CpG sites contained in the MCB is used as the methylation level of the MCB. Finally, the methylation degree of the plasma sample to be detected is evaluated by using a pre-constructed methylation analysis model (a logistic model, an SVM model and the like) according to the methylation level, and if the methylation degree of the plasma sample to be detected is judged to be high, the plasma sample to be detected is possibly derived from the cancer plasma sample; if the methylation degree of the plasma sample to be detected is judged to be low, the plasma sample to be detected is possibly from a healthy human plasma sample, and the high/low methylation degree is judged by the trained methylation analysis model. On the basis, the diagnosis system can assist doctors in comprehensive judgment in the subsequent diagnosis process, provide partial basis for diagnosis results, and assist cancer screening work, particularly diagnosis and screening of early cancers. For the output result of the methylation analysis model, the prediction of the methylation analysis model on the attributes of the to-be-detected plasma sample and the prediction probability of the methylation analysis model, such as the prediction of the possibility that the to-be-detected plasma sample has malignant nodules and the possibility that the to-be-detected plasma sample has benign nodules, can be further used, and a partial basis is provided for the diagnosis of a follow-up doctor. The preset value of the pearson correlation coefficient and the preset number of CpG sites in the same MCB can be set according to the actual application, for example, the preset value of the pearson correlation coefficient can be set to 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, etc. according to the actual application; the predetermined number of CpG sites in the same MCB may be set to 3, 4, 5, 6, etc. according to practical applications. In one example, the preset value of the pearson correlation coefficient is 0.9; the predetermined number of CpG sites in the same MCB is 3.
For a multidimensional omics integrated model, a prediction result based on each optimal model of the single omics of each patient is constructed by using a training set, and the effect of the model is evaluated by using an independent verification set. The specific method comprises the following steps:
data arrangement: the prediction results (mutation analysis model, clinical model/methylation analysis model/tumor marker model) for each single set of the optimal models for each sample were integrated, as well as the true diagnosis results.
Constructing a model: and (3) taking the prediction result of the single-group optimal mathematical model in the training set as the characteristic of the integrated model, and carrying out 13 times of split cross validation (10 times of StratifiedShuffluSplit + 3 times of StratifiedKFold) on the training set to construct the model. And (3) splitting each time, randomly selecting 70% of samples in the training set to perform hyper-parameter tuning modeling, and taking 30% of samples as a test set to evaluate the model. An algorithm for constructing the model is an integration algorithm, such as Random Forest (Random Forest), Adaboost (adaptive Boosting), GBDT (gradient Boosting Decision Tree) and the like. And obtaining a probability value predicted as a malignant lung nodule in each splitting of each sample, and finally taking the average value of 13 times as the final prediction result of the sample. And obtaining the optimal threshold value of the model through ROC drawing, and calculating to obtain the sensitivity, specificity and accuracy of the model in a training set.
It should be clear that, in the present invention, models are established for parameters such as clinical data, methylation level, tumor marker concentration, etc., and in other embodiments, in order to further improve the detection accuracy, modeling may also be performed for other parameters, such as plasma miRNA, plasma cfDNA fragment length, etc.
In one example, the ctDNA performance analysis device based on liquid biopsy is applied to distinguish 100 benign patients and 100 malignant patients with lung nodules that are pathologically diagnosed, wherein 70% of samples are used as a training set to establish a mutation analysis model, a clinical model, a methylation analysis model, a tumor marker model and a multi-dimensional omics integrated model, and 30% of samples are used as an independent verification set to evaluate the model effect. All patients underwent preoperative blood collection to detect plasma DNA mutations, methylation levels, and tumor markers.
In the detection, firstly, a machine learning method is applied to data of four dimensions of plasma DNA mutation, methylation level, tumor markers and clinical information to respectively and independently establish corresponding models. And then integrating the model prediction results of 4 dimensions to construct a final multidimensional omics integrated model. Finally, the prediction effects of 4 independent prediction models and the multidimensional omics integrated model are verified in the independent verification set of 60 samples. The results show that in the independent validation set, the AUCs of 4 independent models of clinical information, tumor markers, plasma DNA mutation and methylation level are 0.73, 0.67, 0.85 and 0.9 respectively, and the AUC of the multidimensional omics integrated model is 0.95. The results prove that the prediction effect of the integrated multidimensional omics integrated model is higher than that of each independent model, and the sensitivity of the multi-group chemical model is 85% when the specificity is 100%. Therefore, compared with a single-group chemical model, the ctDNA performance analysis device can greatly improve the accuracy of early lung cancer screening, thereby effectively assisting the early diagnosis of cancer and the early screening of cancer, and improving the screening efficiency and precision.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of program modules is illustrated, and in practical applications, the above-described distribution of functions may be performed by different program modules, that is, the internal structure of the apparatus may be divided into different program units or modules to perform all or part of the above-described functions. Each program module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one processing unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software program unit. In addition, the specific names of the program modules are only used for distinguishing the program modules from one another, and are not used for limiting the protection scope of the application.
Fig. 5 is a schematic structural diagram of a terminal device provided in an embodiment of the present invention, and as shown, the terminal device 200 includes: a processor 220, a memory 210, and a computer program 211 stored in the memory 210 and executable on the processor 220, such as: ctDNA mutation degree analysis method correlation program based on liquid biopsy. Processor 220 implements the steps in the various embodiments of ctDNA mutation level analysis based on fluid biopsy described above when executing computer program 211, or processor 220 implements the functions of the various modules in the embodiments of ctDNA mutation level analysis based on fluid biopsy described above when executing computer program 211.
The terminal device 200 may be a notebook, a palm computer, a tablet computer, a mobile phone, or the like. Terminal device 200 may include, but is not limited to, processor 220, memory 210. Those skilled in the art will appreciate that fig. 5 is merely an example of terminal device 200, does not constitute a limitation of terminal device 200, and may include more or fewer components than shown, or some components may be combined, or different components, such as: terminal device 200 may also include input-output devices, display devices, network access devices, buses, and the like.
The Processor 220 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor 220 may be a microprocessor or the processor may be any conventional processor or the like.
The memory 210 may be an internal storage unit of the terminal device 200, such as: a hard disk or a memory of the terminal device 200. The memory 210 may also be an external storage device of the terminal device 200, such as: a plug-in hard disk, an intelligent TF memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the terminal device 200. Further, the memory 210 may also include both an internal storage unit of the terminal device 200 and an external storage device. The memory 210 is used to store the computer program 211 and other programs and data required by the terminal device 200. The memory 210 may also be used to temporarily store data that has been output or is to be output.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described apparatus/terminal device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by sending instructions to relevant hardware by the computer program 211, where the computer program 211 may be stored in a computer-readable storage medium, and when the computer program 211 is executed by the processor 220, the steps of the method embodiments may be implemented. Wherein the computer program 211 comprises: computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the code of computer program 211, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the content of the computer readable storage medium can be increased or decreased according to the requirements of the legislation and patent practice in the jurisdiction, for example: in certain jurisdictions, in accordance with legislation and patent practice, the computer-readable medium does not include electrical carrier signals and telecommunications signals.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for persons skilled in the art, numerous modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should be considered as within the scope of the present invention.

Claims (6)

1. A ctDNA mutation analysis device based on liquid biopsy, characterized by comprising:
the capture sequencing module is used for performing capture sequencing on a plasma sample to be detected according to a pre-established gene mutation panel to obtain a FASTQ file, wherein cfDNA in the plasma sample to be detected carries a pre-accessed molecular tag;
the molecular tag extraction module is used for respectively extracting the molecular tags in the paired reads in the FASTQ file and storing the molecular tags as uBAM files;
the file forming module is used for comparing the gene sequence of the FASTQ file with a reference genome and performing de-duplication to obtain a BAM file, and combining the BAM file with the uBAM file to obtain a BAM file containing a molecular tag;
the identification module is used for aggregating and de-duplicating reads in the BAM file according to the molecular label, and obtaining a sample original mutation set in the gene mutation panel region by using a pileup method;
a parameter statistics module, configured to perform statistics on gene mutation parameters in the sample original mutation set, where the gene mutation parameters include: gene mutation grade, gene mutation quantity and mutation frequency of each grade; the system is used for filtering the original mutation set of the samples according to a pre-constructed filtering rule and counting gene mutation parameters of each sample;
the mutation evaluation module is used for evaluating the mutation degree of the plasma sample to be detected by using a pre-constructed mutation analysis model according to the gene mutation parameters of the sample;
the ctDNA mutation analysis device further comprises a filtering module respectively connected with the identification module and the parameter statistics module, and is used for filtering the obtained original mutation set of the sample, wherein the filtering conditions comprise:
germline mutations in peripheral blood leukocytes above a given frequency;
blacklist sites that occur repeatedly in a database specific for a large number of historical samples of a given panel, and sites in the database where the population frequency exceeds a set threshold;
background noise baselines were constructed from cfDNA of more than a given number of healthy human plasma samples under the same sequencing conditions.
2. The ctDNA mutation analysis device according to claim 1, wherein the gene mutation levels include level I, level II, level III, and level IV, wherein level I includes oncogenes in the cancer-related database, level II includes cancer suppressor genes other than level I or other cancer suppressor genes functionally judged to be harmful in the cancer-related data, level III includes exon region genes other than level I and level II, and level IV includes genes other than level I, level II, and level III;
the gene mutation parameters include: number of mutations of class I, maximum mutation frequency value of class I, number of mutations of class II, maximum mutation frequency value of class II, number of mutations of class III, maximum mutation frequency value of class III, number of mutations of class IV, and maximum mutation frequency value of class IV.
3. The ctDNA mutation analysis apparatus according to claim 1 or 2, comprising, in the recognition module:
a gene family forming unit for forming a gene family based on the molecular tag, wherein: the editing distance between the molecular labels is smaller than a first preset value, and the difference between the initial positions of reads with the same molecular labels is a second preset value;
a filtering unit, configured to filter the gene family according to a rule, where the rule includes: for family with a corresponding double-stranded molecular label, the number of reads contained in the family is not less than a third preset value; for family without the corresponding double-stranded molecule label, the number of reads contained in the family is not less than the fourth preset value.
4. A ctDNA performance analysis device based on liquid biopsy, characterized by comprising:
a data preprocessing module, configured to perform preprocessing operation on multidimensional performance parameters to be analyzed in a plasma sample to be tested, where the multidimensional performance parameters include a gene mutation parameter according to any one of claims 1 to 3;
the characteristic selection module is connected with the data preprocessing module and is used for respectively carrying out characteristic screening on the multi-dimensional performance parameters to be analyzed;
a model construction module connected with the feature selection module and used for respectively constructing a performance analysis model and a multidimensional omics integrated model aiming at multidimensional performance parameters to be analyzed, wherein the outputs of the constructed performance analysis models are respectively connected with the input of the multidimensional omics integrated model, and the constructed performance analysis models comprise a mutation analysis model as claimed in any one of claims 1 to 3 and are used for analyzing the gene mutation parameters;
and the performance analysis module is connected with the model construction module and is used for respectively inputting the characteristics of the multidimensional performance parameters screened by the characteristic selection module into the corresponding trained performance analysis model for preliminary analysis, and the multidimensional omics integrated model further analyzes the preliminary analysis result of the performance analysis model to obtain a comprehensive analysis result aiming at the multidimensional performance parameters so as to complete the performance analysis of the ctDNA of the plasma sample to be detected.
5. The ctDNA performance analysis device according to claim 4, wherein the multi-dimensional performance parameters further comprise clinical data and/or methylation level and/or tumor marker concentration, the features screened by the feature selection module comprise clinical data features and/or methylation level features and/or preset tumor marker concentration features, and the plurality of performance analysis models constructed by the model construction module comprise clinical models and/or methylation analysis models and/or tumor marker models;
in the performance analysis module, after the created mutation analysis model, clinical model and/or methylation analysis model and/or tumor marker model performs preliminary analysis on corresponding parameters, the multidimensional omics integrated model further analyzes the preliminary analysis results of the performance analysis model to obtain comprehensive analysis results of multidimensional performance parameters.
6. The ctDNA performance analysis device according to claim 4 or 5, wherein after the plurality of performance analysis models and the integrated multidimensional omics model are built in the model building module, the parameter with the largest AUC is selected as the trained model from the prediction results of the performance analysis models, and the integrated multidimensional omics model is further trained based on the trained model, and the integrated multidimensional omics model is built by a sample splitting cross validation method.
CN202110650420.4A 2021-06-10 2021-06-10 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device Active CN113257350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110650420.4A CN113257350B (en) 2021-06-10 2021-06-10 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110650420.4A CN113257350B (en) 2021-06-10 2021-06-10 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device

Publications (2)

Publication Number Publication Date
CN113257350A CN113257350A (en) 2021-08-13
CN113257350B true CN113257350B (en) 2021-10-08

Family

ID=77187579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110650420.4A Active CN113257350B (en) 2021-06-10 2021-06-10 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device

Country Status (1)

Country Link
CN (1) CN113257350B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643759B (en) * 2021-10-15 2022-01-11 臻和(北京)生物科技有限公司 Chromosome stability evaluation method and device based on liquid biopsy, terminal equipment and storage medium
CN113903401B (en) * 2021-12-10 2022-04-08 臻和(北京)生物科技有限公司 ctDNA length-based analysis method and system
CN115578307B (en) * 2022-05-25 2023-09-15 广州市基准医疗有限责任公司 Lung nodule benign and malignant classification method and related products
CN115064211B (en) * 2022-08-15 2023-01-24 臻和(北京)生物科技有限公司 ctDNA prediction method and device based on whole genome methylation sequencing
CN115497561B (en) * 2022-09-01 2023-08-29 北京吉因加医学检验实验室有限公司 Methylation marker layered screening method and device
CN115148283B (en) * 2022-09-05 2022-12-20 北京泛生子基因科技有限公司 Device and computer-readable storage medium for predicting DLBCL patient prognosis based on first-line treatment mid-term peripheral blood ctDNA
CN115565606B (en) * 2022-09-19 2024-02-06 深圳市海普洛斯生物科技有限公司 Detection method, equipment and computer readable storage medium for automatically screening mutation subset
CN116364178B (en) * 2023-04-18 2024-01-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment
CN116994656B (en) * 2023-09-25 2024-01-02 北京求臻医学检验实验室有限公司 Method for improving second generation sequencing detection accuracy

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109652513A (en) * 2019-02-25 2019-04-19 元码基因科技(北京)股份有限公司 The method and kit of liquid biopsy idiovariation are accurately detected based on two generation sequencing technologies
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data
CN112601826A (en) * 2018-02-27 2021-04-02 康奈尔大学 Ultrasensitive detection of circulating tumor DNA by whole genome integration

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105063208B (en) * 2015-08-10 2018-03-06 北京吉因加科技有限公司 A kind of target dna low frequency mutation enrichment sequence measurement to dissociate in blood plasma
US20190385700A1 (en) * 2018-06-04 2019-12-19 Guardant Health, Inc. METHODS AND SYSTEMS FOR DETERMINING The CELLULAR ORIGIN OF CELL-FREE NUCLEIC ACIDS
CN112029861B (en) * 2020-09-07 2021-09-21 臻悦生物科技江苏有限公司 Tumor mutation load detection device and method based on capture sequencing technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112601826A (en) * 2018-02-27 2021-04-02 康奈尔大学 Ultrasensitive detection of circulating tumor DNA by whole genome integration
CN109652513A (en) * 2019-02-25 2019-04-19 元码基因科技(北京)股份有限公司 The method and kit of liquid biopsy idiovariation are accurately detected based on two generation sequencing technologies
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Analysis of solid tumor mutation profiles in liquid biopsy";Sai A. Balaji et al.;《Cancer Medicine》;20180927;第7卷(第11期);全文 *
"Multilaboratory Assessment of a New Reference Material for Quality Assurance of Cell-Free Tumor DNA Measurements";Hua-Jun He et al.;《The Journal of Molecular Diagnostics》;20190731;第21卷(第4期);全文 *

Also Published As

Publication number Publication date
CN113257350A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN113257350B (en) ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
CN112951418B (en) Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN112888459B (en) Convolutional neural network system and data classification method
CN110800063B (en) Detection of tumor-associated variants using cell-free DNA fragment size
CN109072309B (en) Cancer evolution detection and diagnosis
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
US20220215900A1 (en) Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics
CN113838533B (en) Cancer detection model, construction method thereof and kit
CN113903401B (en) ctDNA length-based analysis method and system
CN112218957A (en) Systems and methods for determining tumor fraction in cell-free nucleic acids
CA3049457C (en) Methods for non-invasive assessment of copy number alterations
CN112941180A (en) Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit
CN116403644B (en) Method and device for predicting cancer risk
WO2022072537A1 (en) Systems and methods for using a convolutional neural network to detect contamination
CN115132274B (en) Methylation level analysis method and device for circulating cell-free DNA transcription factor binding site
CN115244622A (en) Systems and methods for calling variants using methylation sequencing data
CN110819700A (en) Method for constructing small pulmonary nodule computer-aided detection model
CN113862351B (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
US11535896B2 (en) Method for analysing cell-free nucleic acids
CN113643759B (en) Chromosome stability evaluation method and device based on liquid biopsy, terminal equipment and storage medium
Chieruzzi Identification of RAS co-occurrent mutations in colorectal cancer patients: workflow assessment and enhancement
CN116987789A (en) UTUC molecular typing, single sample classifier and construction method thereof
CN116403719A (en) Construction method of breast nodule malignancy differential diagnosis model
WO2024020036A1 (en) Dynamically selecting sequencing subregions for cancer classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100191 room 205, 2nd floor, building 9, 35 Huayuan North Road, Haidian District, Beijing

Patentee after: Zhenhe (Beijing) Biotechnology Co.,Ltd.

Patentee after: Wuxi Precision Medical Laboratory Co.,Ltd.

Patentee after: Wuxi Zhenhe Biotechnology Co.,Ltd.

Address before: 100191 room 205, 2nd floor, building 9, 35 Huayuan North Road, Haidian District, Beijing

Patentee before: Zhenhe (Beijing) Biotechnology Co.,Ltd.

Patentee before: Wuxi Precision Medical Laboratory Co.,Ltd.

Patentee before: Wuxi Zhenhe Biotechnology Co.,Ltd.

CP01 Change in the name or title of a patent holder