CN115662519B - cfDNA fragment characteristic combination and system for predicting cancer based on machine learning - Google Patents

cfDNA fragment characteristic combination and system for predicting cancer based on machine learning Download PDF

Info

Publication number
CN115662519B
CN115662519B CN202211201445.7A CN202211201445A CN115662519B CN 115662519 B CN115662519 B CN 115662519B CN 202211201445 A CN202211201445 A CN 202211201445A CN 115662519 B CN115662519 B CN 115662519B
Authority
CN
China
Prior art keywords
cancer
fragment
cfdna
cfdna fragment
algorithms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211201445.7A
Other languages
Chinese (zh)
Other versions
CN115662519A (en
Inventor
汪强虎
吴维
吴玲祥
张若寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ankai Life Technology Suzhou Co ltd
Original Assignee
Nanjing Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Medical University filed Critical Nanjing Medical University
Priority to CN202211201445.7A priority Critical patent/CN115662519B/en
Publication of CN115662519A publication Critical patent/CN115662519A/en
Application granted granted Critical
Publication of CN115662519B publication Critical patent/CN115662519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application discloses a cfDNA fragment characteristic combination, a cfDNA fragment characteristic system and an application for predicting cancers based on machine learning, and belongs to the technical field of cancer genomics. The fragment characteristic combination comprises at least one of 41 fragment characteristics, and the cfDNA fragment characteristic combination and the system are utilized for carrying out cancer prediction, so that the requirement and the dependence of a cfDNA fragment analysis-based cancer prediction method on an upstream experimental end are reduced, the interpretability and the utilization rate of other histology sequencing data are remarkably widened, the experimental cost of cfDNA-based tumor diagnosis is greatly reduced, and the accuracy of cfDNA-based cancer prediction is improved.

Description

cfDNA fragment characteristic combination and system for predicting cancer based on machine learning
Technical Field
The application belongs to the technical field of cancer genomics, and particularly relates to a cfDNA fragment characteristic combination, a cfDNA fragment characteristic system and an application for predicting cancer based on machine learning.
Background
Free DNA (cfDNA, circulating free DNA or Cell free DNA) in blood can change in concentration with tissue damage, cancer, inflammatory response and the like, and has important potential value in early diagnosis, prognosis, monitoring and the like of diseases. In recent years, cfDNA has been widely used in research fields such as early screening of cancer. Studies have shown that tumor tissue sources can be classified using specific cfDNA fragment characteristics, and that cfDNA fragment lengths can also reveal tissue origin or tumor origin.
However, most liquid biopsy methods are currently focused on detecting genetic mutations or chromosomal abnormalities in blood, and existing methods of fragment histology rely on Whole Genome Sequencing (WGS) methods, which cannot fully exploit other histologic sequencing data information.
Disclosure of Invention
In order to solve at least one of the above technical problems, the present application develops a system for analyzing fragment histology based on multiple histology data to identify cfDNA fragments to distribute tumor markers, thereby identifying whether the sample is a tumor sample. Specifically, the technical scheme adopted by the application is as follows:
the first aspect of the present application provides a cfDNA fragment signature combination comprising at least one of the following fragment signatures: 163-164bp, 157-159bp, 157-160bp, 159-160bp, 147-148bp, 151-153bp, 277-279bp, 277-278bp, 137-138bp, 283-284bp, 142-144bp, 107-108bp, 141-144bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 337-338bp, 327-328bp, 375-376bp, 217-218bp, 382-384bp, 383-389bp, 385-387bp, 386-390bp, 195-196bp, 191-192bp, 227-228bp, 189-192bp, 319-320bp, 187-189bp, 189-190bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp and 67-72bp.
In some embodiments of the application, the cfDNA fragment signature combination includes at least one of the following fragment signatures: 163-164bp, 157-159bp, 157-160bp, 159-160bp, 151-153bp, 277-279bp, 137-138bp, 283-284bp, 142-144bp, 107-108bp, 141-144bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 375-376bp, 217-218bp, 383-384bp, 383-389bp, 385-387bp, 386-390bp, 195-196bp, 191-192bp, 227-228bp, 189-192bp, 319-320bp, 187-189bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp and 67-72bp.
In other embodiments of the application, the cfDNA fragment signature combination includes at least one of the following fragment signatures: 163-164bp, 157-159bp, 159-160bp, 147-148bp, 151-153bp, 277-279bp, 277-278bp, 107-108bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 337-338bp, 327-328bp, 217-218bp, 382-384bp, 383-384bp, 195-196bp, 191-192bp, 189-190bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp.
In some preferred embodiments of the application, the cfDNA fragment signature combination comprises at least one of the following fragment signatures: 163-164bp, 157-159bp, 159-160bp, 151-153bp, 277-279bp, 107-108bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 217-218bp, 383-384bp, 195-196bp, 191-192bp, 189-192bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp and 69-70bp.
In the present application, at least one is meant to include one, two, three or all of the listed cfDNA fragment features. The cfDNA fragment characteristic combinations belong to a general inventive concept due to the fact that the cfDNA fragment characteristic combinations have the same properties.
In the present application, the definition of the relevant terms is as follows:
fragment characteristics: the cfDNA fragments are divided into different fragment intervals according to different lengths, and all cfDNA fragments in each fragment interval are one fragment characteristic. For example, the fragment features are: 61-65bp, including cfDNA fragments with fragment lengths of 61bp, 62bp, 63bp, 64bp and 65 bp. For example, the fragment features are: 74-75bp, including cfDNA fragments with fragment lengths of 74bp and 75 bp.
Ratio of number of fragments: refers to the ratio of cfDNA fragments in a fragment signature to total fragments.
In the present application, the cfDNA fragment length and number data refers to data obtained using a sequencing method selected from any one of the group consisting of WGS sequencing, WES sequencing, meDIP and MBD-Seq. In fact, one skilled in the art may use any method, either sequenced or not, provided that the length and number of cfDNA fragments can be obtained.
A second aspect of the application provides a system for predicting whether a subject has or is at risk of having cancer, comprising the following modules:
the data input module is used for inputting the cfDNA fragment length and quantity data of the subject;
the distribution spectrum analysis module is connected with the data input module and is used for obtaining the fragment quantity proportion of each cfDNA fragment characteristic in the cfDNA fragment characteristic combination according to the first aspect of the application;
and the cancer prediction module is connected with the distribution spectrum analysis module and is used for judging whether the subject has cancer or is at risk of having cancer according to the fragment quantity proportion of the cfDNA fragment characteristics.
In some embodiments of the application, the machine learning model in the cancer prediction module is utilized to determine whether the subject has cancer or is at risk of having cancer.
Further, the machine learning model is trained by any one of the following algorithms: random forest algorithms, support vector machine algorithms, linear regression algorithms, logistic regression algorithms, bayesian classifiers, and neural network algorithms.
In some embodiments of the application, the machine learning model is derived using a Lasso regression algorithm that is generalized linear regression.
In some embodiments of the application, the Lasso regression is performed using glrnet in the R language. Glanet is mainly used for fitting generalized linear models. Screening may minimize the loss by regularization parameter λ. The algorithm is very fast and may use a sparse matrix as input. Lasso regression is characterized by variable screening (variable selection) and complexity adjustment (regularization) while fitting a generalized linear model. Thus, regardless of whether the target dependent variable (dependent/response varaible) is continuous or binary or multiple discrete, the target dependent variable can be modeled and then predicted using Lasso regression. Variable screening herein refers to fitting not all variables into the model, but rather selectively placing variables into the model to obtain better performance parameters. Complexity adjustment refers to controlling the complexity of a model through a series of parameters, thereby avoiding overfitting.
Further, in some embodiments of the application, predictions are made using Lasso score obtained by Lasso regression. When the Lasso score exceeds a preset threshold, it is judged to have cancer or to be at risk of having cancer.
In some embodiments of the application, the predictive threshold is determined from a population cancer sample Lasso score value and/or a population normal sample Lasso score value.
Optionally, the predictive threshold is determined from a representative value of a Lasso score value of a population cancer sample.
Optionally, the predictive threshold is determined from a representative value of a population normal sample Lasso score value.
Optionally, the predictive threshold is determined from a representative value of an increased value of the Lasso score value of the population cancer sample relative to the Lasso score value of the population normal sample. The cancer sample and the normal sample are paired samples, so that the added value has clinical significance.
In some embodiments of the application, the population of cancer samples refers to more than 10 cancer samples, such as 10, 20, 50, 100, 200, 500 or more.
In some embodiments of the application, the representative value refers to one of an average, a mode, a median, a 1/4 fraction, and a 3/4 fraction.
In the present application, the cancers include, but are not limited to, solid tumors and blood cancers such as fibrosarcoma, myosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, chordoma, angiosarcoma, endothelial sarcoma, lymphangiosarcoma, lymphangioendothelioma, synovioma, mesothelioma, ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, colon cancer, pancreatic cancer, prostate cancer, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, sebaceous gland carcinoma, papillary adenocarcinoma, cystic adenocarcinoma, medullary carcinoma, bronchogenic carcinoma, hepatoma, cholangiocarcinoma, choriocarcinoma, seminoma, embryonal carcinoma, nephroblastoma, cervical cancer, testicular tumor, lung cancer, small cell lung cancer, epithelial carcinoma, glioma, astrocytoma, medulloblastoma, craniopharyoma, ependymoma, pineal tumor, angioblastoma, auditory glioma, oligodendroglioma, meningioma, melanoma, neuroblastoma, glioblastoma; leukemias such as acute lymphoblastic leukemia and acute myeloblastic leukemia (myeloblastic, promyelocytic, myelomonocytic, monocytic and erythrocytic leukemia); chronic leukemia (chronic myelogenous (granulocytic) leukemia and chronic lymphocytic leukemia); and polycythemia vera, lymphomas (hodgkin's and non-hodgkin's), multiple myeloma, waldenstrom's macroglobulinemia and heavy chain disease.
In the present application, the identification, classification, judgment, and prediction have the same or similar meaning, and refer to distinguishing cancer samples from normal samples.
In a third aspect, the present application provides the use of a detection reagent of a cfDNA fragment feature combination according to the first aspect of the present application for the preparation of a kit for predicting whether a subject has cancer or is at risk of having cancer.
In some embodiments of the application, the detection reagent comprises a capture reagent and/or a sequencing reagent.
In some embodiments of the application, the kit further comprises cfDNA extraction reagents.
The beneficial effects of the application are that
Compared with the prior art, the application has the following effective effects:
the cfDNA fragment feature combination and system of the present application can be used for cancer prediction, not only using data of any one of the sequencing methods selected from the group consisting of WGS sequencing, WES sequencing, meDIP and MBD-Seq, but also using data obtained by any sequencing or non-sequencing method, as long as the length and number of cfDNA fragments can be obtained.
The cfDNA fragment characteristic combination and the cfDNA fragment characteristic system are utilized to predict cancers, comprehensive characteristic analysis of cfDNA fragments can be utilized, and the prediction performance on cancers is better.
The cfDNA fragment characteristic combination and the cfDNA fragment characteristic system are utilized for carrying out cancer prediction, so that the requirement and the dependence of a cfDNA fragment analysis-based cancer prediction method on an upstream experimental end are reduced, the interpretability and the utilization rate of other histology sequencing data are remarkably widened, the experimental cost of cfDNA-based tumor diagnosis is greatly reduced, and the accuracy of cfDNA-based cancer prediction is improved.
Drawings
FIG. 1 shows a Lasso regression cvfit curve in example 2 of the present application.
FIG. 2 shows the results of tumor recognition by the lambda.min model in the training set and validation set.
FIG. 3 shows the results of tumor recognition by the lambda.1se model in the training set and validation set.
FIG. 4 shows the results of tumor recognition by the lambda.min model in an external dataset.
FIG. 5 shows the results of tumor recognition by the lambda.1se model in an external dataset.
Fig. 6 shows the classification performance of the 21-feature classification model in the training set and the test set.
Fig. 7 shows the classification performance of the 21-feature classification model in the external dataset.
Detailed Description
Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this specification is incorporated herein by reference in its entirety, and the equivalent patents to those same patents are incorporated herein by reference, particularly as if they were defined in the art to which this disclosure pertains. If the definition of a particular term disclosed in the prior art is inconsistent with any definition provided in the present application, the definition of the term provided in the present application controls.
The numerical ranges in the present application are approximations, so that it may include the numerical values outside the range unless otherwise indicated. The numerical range includes all values from the lower value to the upper value that increase by 1 unit, provided that there is a spacing of at least 2 units between any lower value and any higher value. These are merely specific examples of what is intended to be provided, and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this disclosure.
In order to make the technical problems, technical schemes and beneficial effects solved by the application more clear, the application is further described in detail below with reference to the embodiments.
Examples
The following examples are presented herein to demonstrate preferred embodiments of the present application. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the application, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the application described herein. Such equivalents are intended to be encompassed by the claims.
The molecular biology experiments described in the following examples, which are not specifically described, were performed according to the specific methods listed in the "guidelines for molecular cloning experiments" (fourth edition) (j. Sambrook, m.r. Green, 2017) or according to the kit and product specifications. Other experimental methods, unless otherwise specified, are all conventional. The instruments used in the following examples are laboratory conventional instruments unless otherwise specified; the test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores.
Example 1 identification of cfDNA fragment distribution tumor markers
Cfdna sequencing
To obtain cfDNA fragment distribution tumor markers, the inventors obtained blood samples of 417 tumor patients (183 colorectal cancer, 40 liver cancer, 92 gastric cancer, 68 pancreatic cancer, 9 esophageal cancer and 25 glioblastomas) and 813 normal persons. cfDNA was extracted and sequenced using methylation DNA enrichment Sequencing technique (MBD-Seq, methylated DNA Binding Domain-Sequencing).
2. Data preprocessing
a) Data cleaning: the adaptor sequence introduced during library construction was removed using fastp-0.20.0 software and low quality base fragments (more than 40% of the bases had a mass value below Q15 and more than 5N for the whole fragment, with a sliding window based cut end average mass < 4 bases of Q20).
b) Data comparison: the base sequence of fastq file was aligned to human reference genome hg19 (GRCH 37) using bowtie2-2.3.4.2 software to generate bam file, and the bam file was ranked according to genome coordinates, the ranked bam was deduplicated using picard MarkDuplicates-2.18.25-snappshot, and paired reads were screened for reads aligned to the reference genome and MAPQ > 20.
c) cfDNA screening: to delete cfDNA fragments of MBD proteins that are non-specifically captured, fragments of the bam file that do not contain CG base pairs are filtered out. cfDNA with fragment length (60, 400) was further retained for subsequent analysis.
Cfdna fragment distribution profile
The final processed bam file was analyzed using R package Rsamtools to calculate the fragment length of each cfDNA. Then, dividing the cfDNA fragment length into different fragment intervals according to the lengths of 2bp, 3bp, 4bp and 5bp and … … bp respectively (if the step length is 2bp, the divided fragment intervals are 61-62bp, 63-64bp … … and 398-400bp, if the step length is 3bp, the divided fragment intervals are 61-63bp, 64-66bp … … and 396-399bp, if the step length is 10bp, the divided fragment intervals are 61-70bp, 71-80bp … … 391-400 bp), defining all cfDNA fragments included in each fragment interval as fragment characteristics, and calculating the proportion of the cfDNA fragments in each fragment characteristic to the total fragment number so as to generate the fragment distribution spectrum of the cfDNA.
Example 2 establishment of tumor recognition model
In both tumor and healthy samples, wilcox rank sum test was performed on each cfDNA fragment feature and BH correction was used to obtain corrected p-values, and the area under ROC curve (AUC) values for each fragment feature to distinguish tumor from healthy samples were further calculated. Fragment characteristics with corrected p-value <0.05 and AUC >0.6 were identified as differentially distributed in tumor and healthy samples.
In order to obtain fragment characteristic combinations capable of more accurately predicting tumors, the inventor constructs a model for recognizing the tumors by using Lasso regression of a generalized linear model. Specifically: based on the above identified fragment features, the R-wrap glmnet is used to fit a model, and then the optimal lambda value is chosen. To avoid overfitting, the inventors used a 10-fold cross validation (cross validation) fit to choose the model.
The cvfit curve of Lasso regression is shown in fig. 1. The two dashed lines on the right indicate two particular lambda values: lambda.min and lambda.1se, 0.0004348304 and 0.001063428 respectively.
lambda.min means that one of the smallest binomial deviations (Binomial Deviance) is obtained among all lambda values. And lambda.1se refers to the lambda value that gives the simplest model over a variance range of lambda.min.
As the lambda value reaches a certain size, the lambda value is continuously increased, namely the lambda value is reduced, the performance of the model is not remarkably improved, and lambda.1se gives a model with excellent performance but the minimum number of the segment features.
The model corresponding to lambda. Min includes 34 segment features, respectively: 163-164bp, 157-159bp, 157-160bp, 159-160bp, 151-153bp, 277-279bp, 137-138bp, 283-284bp, 142-144bp, 107-108bp, 141-144bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 375-376bp, 217-218bp, 383-384bp, 383-389bp, 385-387bp, 386-390bp, 195-196bp, 191-192bp, 227-228bp, 189-192bp, 319-320bp, 187-189bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp and 67-72bp.
The model corresponding to lambda.1se includes 28 fragment features, respectively: 163-164bp, 157-159bp, 159-160bp, 147-148bp, 151-153bp, 277-279bp, 277-278bp, 107-108bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 337-338bp, 327-328bp, 217-218bp, 382-384bp, 383-384bp, 195-196bp, 191-192bp, 189-190bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp.
The classification performance of the two classification models in the training set and the test set is shown in table 1 and fig. 2 to 3.
Table 1 classification efficacy of two Lasso classification models
From this, it can be seen that better tumor recognition can be achieved using both the lambda.min model and the lambda.1se model.
Example 4 determination of overfitting of tumor Classification model
For a linear model, the complexity has a direct relationship with the number of variables of the model, and the more the number of variables, the higher the model complexity. More variables tend to give a seemingly better model when fitted, but at the same time also face the risk of overfitting. At this time, if the model (validation) is validated with completely new data, the effect is generally poor. In general, there is a chance that the number of variables is much greater than the number of data points, or that a discrete variable has too many unique values.
To determine whether the tumor model of example 3 had an overfitting, the inventors used external data (i.e., data other than the training set and the test set, including 73 tumor samples and 79 normal samples) to make the determination, the results of which are shown in table 2 and fig. 4 to 5.
Table 2 classification efficacy of two Lasso classification models in external data
It can be seen that there is no overfitting of both classification models of example 3 and there is a very good judgment.
Example 5 fragment characteristic tumor recognition model
The model for lambda.min in example 3 includes 34 segment features and the model for lambda.1se includes 28 segment features, together with 21 intersection features, respectively: 163-164bp, 157-159bp, 159-160bp, 151-153bp, 277-279bp, 107-108bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 217-218bp, 383-384bp, 195-196bp, 191-192bp, 189-192bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp.
The inventors examined whether 21 intersection features alone also have the ability to distinguish between tumor and normal samples. Specifically: a new 21-feature Lasso model is reconstructed using the 21 intersection features, with the training set, test set, and external data set remaining unchanged.
The classification performance of the 21 feature classification model in the training set and the test set is shown in table 3 and fig. 6.
TABLE 3 classification efficacy of feature Lasso classification model
The classification performance of the 21 feature classification model in the external dataset is shown in table 4 and fig. 7.
It follows that using only 21 fragment features, a tumor can also be well identified and can be used to predict whether a subject has a tumor or is at risk of having a tumor.
All documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims (5)

1. Use of a detection reagent of a cfDNA fragment signature combination in the preparation of a kit for predicting whether a subject has cancer, characterized in that the cfDNA fragment signature combination comprises the following fragment signatures:
(1) 163-164bp, 157-159bp, 159-160bp, 151-153bp, 277-279bp, 107-108bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 217-218bp, 383-384bp, 195-196bp, 191-192bp, 189-192bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp and 69-70bp; or alternatively
(2) 163-164bp, 157-159bp, 157-160bp, 159-160bp, 151-153bp, 277-279bp, 137-138bp, 283-284bp, 142-144bp, 107-108bp, 141-144bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 375-376bp, 217-218bp, 383-384bp, 383-389bp, 385-387bp, 386-390bp, 195-196bp, 191-192bp, 227-228bp, 189-192bp, 319-320bp, 187-189bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp, 69-70bp and 67-72bp; or alternatively
(3) 163-164bp, 157-159bp, 159-160bp, 147-148bp, 151-153bp, 277-279bp, 277-278bp, 107-108bp, 267-268bp, 117-118bp, 141-142bp, 298-300bp, 339-340bp, 337-338bp, 327-328bp, 217-218bp, 382-384bp, 383-384bp, 195-196bp, 191-192bp, 189-190bp, 61-62bp, 64-66bp, 239-240bp, 67-68bp and 69-70bp,
the cancer is colorectal cancer, liver cancer, gastric cancer, pancreatic cancer, esophageal cancer or glioblastoma.
2. The use according to claim 1, wherein the detection reagent further comprises a capture reagent and/or a sequencing reagent.
3. The use of claim 1, wherein the detection reagent further comprises a cfDNA extraction reagent.
4. A system for predicting whether a subject has cancer, comprising the following modules:
the data input module is used for inputting the cfDNA fragment length and quantity data of the subject;
the distribution spectrum analysis module is connected with the data input module and is used for obtaining the fragment quantity proportion of each cfDNA fragment characteristic in the cfDNA fragment characteristic combination of claim 1;
and the cancer prediction module is connected with the distribution spectrum analysis module and is used for judging whether the subject suffers from cancer or not by utilizing a machine learning model according to the fragment quantity proportion of the cfDNA fragment characteristics, wherein the cancer is colorectal cancer, liver cancer, gastric cancer, pancreatic cancer, esophageal cancer or glioblastoma.
5. The system of claim 4, wherein the machine learning model is trained using any one of the following algorithms:
random forest algorithms, support vector machine algorithms, linear regression algorithms, logistic regression algorithms, bayesian classifiers, and neural network algorithms.
CN202211201445.7A 2022-09-29 2022-09-29 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning Active CN115662519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211201445.7A CN115662519B (en) 2022-09-29 2022-09-29 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211201445.7A CN115662519B (en) 2022-09-29 2022-09-29 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning

Publications (2)

Publication Number Publication Date
CN115662519A CN115662519A (en) 2023-01-31
CN115662519B true CN115662519B (en) 2023-11-03

Family

ID=84985906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211201445.7A Active CN115662519B (en) 2022-09-29 2022-09-29 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning

Country Status (1)

Country Link
CN (1) CN115662519B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019110750A1 (en) * 2017-12-07 2019-06-13 INSERM (Institut National de la Santé et de la Recherche Médicale) Method for screening a subject for a cancer
CN112820407A (en) * 2021-01-08 2021-05-18 清华大学 Deep learning method and system for detecting cancer by using plasma free nucleic acid
CN113160889A (en) * 2021-01-28 2021-07-23 清华大学 Cancer noninvasive early screening method based on cfDNA omics characteristics
CN113817822A (en) * 2020-06-19 2021-12-21 中国医学科学院肿瘤医院 Tumor diagnosis kit based on methylation detection and application thereof
CN114093517A (en) * 2021-11-29 2022-02-25 季凯 Cancer screening method and system based on blood indexes and cfDNA
CN114974430A (en) * 2021-02-25 2022-08-30 博尔诚(北京)科技有限公司 System for cancer screening and method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3776555A2 (en) * 2018-04-13 2021-02-17 Grail, Inc. Multi-assay prediction model for cancer detection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019110750A1 (en) * 2017-12-07 2019-06-13 INSERM (Institut National de la Santé et de la Recherche Médicale) Method for screening a subject for a cancer
CN113817822A (en) * 2020-06-19 2021-12-21 中国医学科学院肿瘤医院 Tumor diagnosis kit based on methylation detection and application thereof
CN112820407A (en) * 2021-01-08 2021-05-18 清华大学 Deep learning method and system for detecting cancer by using plasma free nucleic acid
CN113160889A (en) * 2021-01-28 2021-07-23 清华大学 Cancer noninvasive early screening method based on cfDNA omics characteristics
CN114974430A (en) * 2021-02-25 2022-08-30 博尔诚(北京)科技有限公司 System for cancer screening and method thereof
CN114093517A (en) * 2021-11-29 2022-02-25 季凯 Cancer screening method and system based on blood indexes and cfDNA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Ultrasensitive and affordable assay for early detection of primary liver cancer using plasma cell-free DNA fragmentomics;Xiangyu Zhang et al;《Hepatology》;第76卷(第2期);第317-329页 *
细胞游离DNA片段组学用于慢性肝病诊断的研究进展;曾伟兰 等;《Chinese Hepatology》;第27卷(第8期);第842-844页 *

Also Published As

Publication number Publication date
CN115662519A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
Li et al. MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning
Pavlidis et al. Searching for footprints of positive selection in whole-genome SNP data from nonequilibrium populations
CN109767810B (en) High-throughput sequencing data analysis method and device
US20230222311A1 (en) Generating machine learning models using genetic data
CN104302781B (en) A kind of method and device detecting chromosomal structural abnormality
CN112750502A (en) Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment
JP6066924B2 (en) DNA sequence data analysis method
CA2877436C (en) Systems and methods for generating biomarker signatures
CN111304308A (en) Method for auditing detection result of high-throughput sequencing gene variation
Pool Genetic mapping by bulk segregant analysis in Drosophila: experimental design and simulation-based inference
CN112289376A (en) Method and device for detecting somatic cell mutation
CN107760783B (en) Gastric cancer peritoneal metastasis prediction model based on 108 genes and application thereof
CN115274136A (en) Tumor cell line drug response prediction method integrating multiomic and essential genes
Chen et al. A nonparametric approach to detect nonlinear correlation in gene expression
CN112599190B (en) Method for identifying deafness-related genes based on mixed classifier
WO2014083018A1 (en) Method and system for processing data for evaluating a quality level of a dataset
CN115662519B (en) cfDNA fragment characteristic combination and system for predicting cancer based on machine learning
Vijayan et al. Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods
JP2007526979A (en) Method for characterization of biomolecular samples
EP2710152A1 (en) Computer-implemented method and system for detecting interacting dna loci
CN114694752B (en) Method, computing device and medium for predicting homologous recombination repair defects
CN115558716B (en) cfDNA fragment characteristic combination, system and application for predicting cancer
KR20220133516A (en) Method for detecting tumor derived mutation from cell-free DNA based on artificial intelligence and Method for early diagnosis of cancer using the same
US11535896B2 (en) Method for analysing cell-free nucleic acids
US20220292363A1 (en) Method for automatically determining disease type and electronic apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230909

Address after: 140 Hanzhong Road, Nanjing, Jiangsu 210000

Applicant after: NANJING MEDICAL University

Address before: 215004 Room 301, Building 12, No. 8, Jinfeng Road, High tech Zone, Suzhou, Jiangsu Province

Applicant before: Ankai Life Technology (Suzhou) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240718

Address after: 215000 Room 301, building 12, No. 8, Jinfeng Road, high tech Zone, Suzhou, Jiangsu

Patentee after: Ankai Life Technology (Suzhou) Co.,Ltd.

Country or region after: China

Address before: 140 Hanzhong Road, Nanjing, Jiangsu 210000

Patentee before: NANJING MEDICAL University

Country or region before: China