CN113903398A - Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium - Google Patents

Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium Download PDF

Info

Publication number
CN113903398A
CN113903398A CN202111053742.7A CN202111053742A CN113903398A CN 113903398 A CN113903398 A CN 113903398A CN 202111053742 A CN202111053742 A CN 202111053742A CN 113903398 A CN113903398 A CN 113903398A
Authority
CN
China
Prior art keywords
model
artificial sequence
dna
characteristic value
taking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111053742.7A
Other languages
Chinese (zh)
Inventor
邵阳
彭俊杰
李雅琪
吴雪
刘睿
包华
吴舒雨
鲍海蓉
唐皖湘夫
常双
杨珊珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co Ltd
Original Assignee
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Shihe Medical Devices Co ltd, Nanjing Shihe Gene Biotechnology Co Ltd filed Critical Nanjing Shihe Medical Devices Co ltd
Priority to CN202111053742.7A priority Critical patent/CN113903398A/en
Publication of CN113903398A publication Critical patent/CN113903398A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a colorectal cancer early screening marker, a detection method, a detection device and a computer readable medium, wherein a high-throughput sequencing result is subjected to high-resolution length distribution of colorectal cancer healthy people differential DNA fragments, sequence reading segment proportion and 1MB window copy number change analysis at a breakpoint at a reading segment 5 end, a gradient elevator, random forest and deep network learning are respectively used for training and modeling, and finally a multi-feature multi-algorithm integration model is constructed through a generalized linear model, so that the purpose of noninvasive and accurate diagnosis of colorectal cancer is achieved.

Description

Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
Technical Field
The invention relates to a Colorectal Cancer (CRC) early screen, belonging to the technical field of molecular biomedicine.
Background
The colorectal cancer is a common malignant tumor, and according to the book of colorectal cancer and precancerous lesion white skin book of Chinese physical examination population, the five-year survival rates of the colorectal cancer of Chinese population are respectively 90.1%, 72.6%, 53.8% and 10.4% in stages I, II, III and IV; the five-year survival rate of the patients with local metastatic cancer is 89% in 2009-2015 years and is far higher than that of the patients with remote metastatic cancer by 21%. The early discovery and early diagnosis of the tumor are crucial to the improvement of the survival rate of colorectal cancer patients, and the burden of medical expenditure can be intercropped.
The colorectal endoscopy is the 'gold standard' for colorectal cancer diagnosis, the detection rate is up to 95%, but the enteroscopy belongs to invasive screening, the process is painful, the requirement on the constitution of a patient is high, and the operation risk and the complication risk exist, so the compliance of the patient is low; meanwhile, because of the limitations of insufficient resources of enteroscopy in China and the like, the screening permeability is low, large-scale popularization cannot be realized by short-term enteroscopy as a screening means, and the support and supplement of a non-invasive detection means are urgently needed.
The main non-invasive examination mode in China is mainly fecal occult blood detection, mainly comprises guaiac fecal occult blood test (gFOBT) and immunochemical fecal occult blood test (FIT), has high sensitivity, is simple and convenient to obtain samples and easy to store, but is generally high in price and not brought into medical insurance. Therefore, the development of an effective, economic and practical screening means suitable for a wide range of people is urgently needed in China.
Disclosure of Invention
The invention provides a method for performing WGS sequencing on cfDNA of a plasma sample, which comprises the steps of performing high resolution segmentation size distribution (high resolution segmentation size distribution) on colorectal cancer healthy human differential DNA fragments, analyzing a sequence read segment occupation ratio (motifbread point 8mer) and 1MB window copy number variation (1Mb-bin copy number variation) at a break point of a read segment 5 end, respectively performing training modeling by using a gradient elevator (GBM), a Random Forest (Random Forest) and Deep network learning (Deep learning), and finally constructing a multi-feature multi-algorithm integration model through a Generalized Linear Model (GLM) to achieve the purpose of noninvasive accurate diagnosis of colorectal cancer.
A construction method of a colon cancer early-screening model comprises the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data results to a reference genome, and obtaining the number of the reading in different length intervals in different window ranges on the reference genome as a first characteristic value;
step 3, comparing the reading data result to a reference genome to obtain the position of the 5' end of the reading on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of all the obtained base fragments in all the fragments as a second characteristic value;
step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third characteristic value;
step 5, the first characteristic value, the second characteristic value and the third characteristic value are jointly used as initial characteristic values, characteristic values with significant differences between samples of a positive group and samples of a control group in the initial characteristic values are screened out and used as model characteristic vectors;
and 6, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the intestinal cancer probability as a model output value to obtain the early-screening model.
The step 2 comprises the following steps:
step 2-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of all reads, the number of short reads and the number of ultra-long reads within the range of each window;
step 2-2, respectively taking the long arm and the short arm on each chromosome as region ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;
and 2-3, taking the data obtained in the steps 2-1 and 2-2 together as a first characteristic value.
The short read segment is 40-80bp in length, and the number of the ultra-long read segments is 200-300 bp; all reads refer to a length in the range of 40-300 bp.
The window size in step 2-1 is in the range of 2-7 Mb.
The different length gradient intervals in the step 2-2 refer to different length gradient ranges obtained by increasing steps of 8-12bp within a range of 40-300 bp.
The number of reads is normalized.
And m is any integer between 2 and 5.
In the step 5, the steps include: and respectively taking the first characteristic value, the second characteristic value and the third characteristic value as input values of a gradient lifting algorithm model, a random forest model and a deep network learning model, training samples by taking whether the intestinal cancer exists as an output value, and respectively obtaining characteristic vectors with significant differences in the models.
In the step 6, the steps include: and (3) taking the feature vectors with the significant differences as input values of the classifier model, taking the probability of suffering from intestinal cancer as an output value, and training the model by adopting sample data of a positive group and a control group to obtain the early-screening model.
The classifier model is a linear model, and variables contained in the model are obtained by respectively inputting the feature vectors with significant differences in the first, second and third feature values into the gradient lifting algorithm model, the random forest model and the deep network learning model obtained by training in the step 5.
An intestinal cancer early-screening detection device comprising:
the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data; the comparison module is used for comparing the reading data result to a reference genome;
the first characteristic value acquisition module is used for acquiring the number of reads in different length intervals in different window ranges on a reference genome as a first characteristic value;
a second eigenvalue acquisition module for acquiring the position of the 5' end of the read on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of all the obtained base fragments in all the fragments as a second characteristic value;
the third characteristic value acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data in each window range as a third characteristic value;
the screening module is used for screening out a characteristic value with a significant difference between the samples of the positive group and the control group in the initial characteristic values as a model characteristic vector by taking the first characteristic value, the second characteristic value and the third characteristic value as the initial characteristic values;
and the model construction module is used for inputting the model characteristic vectors of the samples of the positive group and the control group into the model, taking the probability of suffering from intestinal cancer as a model output value, training the model and obtaining the early-screening model.
The first characteristic value obtaining module comprises:
the first reading number counting module is used for dividing the reference genome into a plurality of windows and respectively obtaining the total reading number, the short reading number and the ultra-long reading number in each window range;
a second read number counting module, configured to take the long arm and the short arm on each chromosome as region ranges, and obtain the number of reads in gradient intervals of different lengths in each range;
and the merging module is used for taking the data obtained in the first reading number counting module and the second reading number counting module as the first characteristic value together.
A computer-readable medium storing a computer program which can execute the method for constructing the early colon cancer model.
The above model can also use sub-models among them alone:
a construction method of a colon cancer early-screening model comprises the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data results to a reference genome, and obtaining the number of the reading in different length intervals in different window ranges on the reference genome as an initial characteristic value;
step 3, screening out characteristic values with significant difference between the samples of the positive group and the control group in the initial characteristic values as model characteristic vectors;
and 4, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the intestinal cancer probability as a model output value to obtain the early-screening model.
The step 2 comprises the following steps:
step 2-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of all reads, the number of short reads and the number of ultra-long reads within the range of each window;
step 2-2, respectively taking the long arm and the short arm on each chromosome as region ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;
and 2-3, taking the data obtained in the steps 2-1 and 2-2 together as an initial characteristic value.
The step 3 comprises the following steps: classifying samples by taking the initial characteristic values as input values of the models and taking whether the models have intestinal cancer as output values, and sequencing according to contribution values of all the characteristic vectors to respectively obtain the characteristic vectors with significant differences in all the models;
the model is selected from a gradient lifting algorithm model, a random forest model or a deep network learning model.
A construction method of a colon cancer early-screening model comprises the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data result to a reference genome to obtain the position of the 5' end of the reading on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of all the obtained base fragments in all the fragments as an initial characteristic value;
step 3, screening out characteristic values with significant difference between the samples of the positive group and the control group in the initial characteristic values as model characteristic vectors;
and 4, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the intestinal cancer probability as a model output value to obtain the early-screening model.
The step 3 comprises the following steps: classifying samples by taking the initial characteristic values as input values of the models and taking whether the models have intestinal cancer as output values, and sequencing according to contribution values of all the characteristic vectors to respectively obtain the characteristic vectors with significant differences in all the models;
the model is selected from a gradient lifting algorithm model, a random forest model or a deep network learning model.
A construction method of a colon cancer early-screening model comprises the following steps:
a construction method of a colon cancer early-screening model comprises the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in each window range as an initial characteristic value;
step 3, screening out characteristic values with significant difference between the samples of the positive group and the control group in the initial characteristic values as model characteristic vectors;
and 4, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the intestinal cancer probability as a model output value to obtain the early-screening model.
The step 3 comprises the following steps: classifying samples by taking the initial characteristic values as input values of the models and taking whether the models have intestinal cancer as output values, and sequencing according to contribution values of all the characteristic vectors to respectively obtain the characteristic vectors with significant differences in all the models;
the model is selected from a gradient lifting algorithm model, a random forest model or a deep network learning model.
Advantageous effects
The method is characterized in that statistics is carried out on the WGScfDNA read length distribution, the sequence proportion at a breakpoint and the change of the region copy number of 115 healthy people and 195 intestinal cancer/advanced intestinal adenoma patients, three different training learning algorithms are respectively utilized to construct models, secondary set training is carried out on all the models, and the prediction performance of the models on health and cancer groups is improved. The invention provides a multi-molecular-characteristic multi-training-algorithm secondary integration diagnosis model based on high-flux low-depth sequencing of plasma cfDNA for the first time, the model can diagnose early intestinal cancer and advanced intestinal adenoma, and the model has the advantages of non-invasive detection, low flux, high detection specificity and high sensitivity.
Drawings
FIG. 1 is a schematic diagram of a model building process;
FIG. 2 is a schematic diagram of a quadratic set model construction process;
FIG. 3 is a heat map of the difference in the length proportion distribution characteristics of the first 50 high resolution DNA fragments between cancer and healthy groups;
FIG. 4 is a heat map of the differences in the sequence percentage characteristic of the breakpoint at the 5' end of the first 50 reads between cancer and healthy groups
FIG. 5 is a heat map of the differences in copy number change characteristics of the first 501Mb window between cancer and healthy groups;
FIG. 6 is a predicted AUC curve for a classifier of a training algorithm with different high resolution DNA fragment length to proportion distribution characteristics on validation and test sets;
FIG. 7 is a predicted AUC curve for training algorithm classifiers with different sequence-over-feature ratios at break points at the 5' ends of reads on the validation set and test set;
FIG. 8 is a predicted AUC curve for a differently trained algorithm classifier featuring copy number variations on validation and test sets;
FIG. 9 is a graph of the AUC predicted by a quadratic set training classifier for different features of the classifier on the validation set;
FIG. 10 is a graph of the secondary set of different features of a classifier on a validation set training the classifier predictors;
FIG. 11 is a different set of predicted AUC curves for a classifier after a full set of models on the validation set and test set;
FIG. 12 is a different set of predicted AUC curves for a classifier after a full set of models on the validation set and test set;
FIG. 13 is a graph of the predicted outcome of the classifier after the entire set of models on the validation set and the test set;
FIG. 14 is a graph of the predicted outcome of the classifier after the entire set of models on the validation set;
FIG. 15 is a graph of the predicted results of a classifier after a full set of models on a test set;
Detailed Description
The calculation method of the invention is detailed as follows:
the invention firstly needs to carry out the steps of extraction, library construction, sequencing and the like of cfDNA from blood samples. The extraction and library construction method is not particularly limited, and can be adjusted from the extraction methods in the prior art. The base information of cfDNA can be obtained using a sequencing technique in the related art in the sequencing process here.
The data set conditions adopted in the model construction process of the invention are as follows:
Figure BDA0003253394800000061
method for extracting and sequencing plasma cfDNA sample
Adopt purple blood collection pipe (EDTA anticoagulation pipe) to collect patient 8ml whole blood sample, in time centrifugation plasma (in 2 hours), after transporting to the laboratory, the plasma sample adopts QIAGEN plasma DNA extraction kit to carry out ctDNA according to the instruction and draws. And establishing a library for the collected cfDNA sample, and performing WGS-2 multiplication sequencing. After the off-line data is obtained, the data is compared to the human reference genome to obtain the base data information of the corresponding reading.
Data processing
The marking data in the invention mainly utilizes three molecular characteristics:
1. high Resolution DNA fragment Size Distribution (HRFSD) for DNA fragment Size Distribution, it is a Distribution characteristic of the length sizes of cfDNA reads. The high resolution fragmentation size distribution (high resolution fragmentation size distribution) is used for machine learning to establish a prediction model, so that non-liver cancer patients (healthy people) and intestinal cancer patients (advanced intestinal adenomas CRA and colorectal cancer CRC) are distinguished. For DNA fragment size distribution, it reflects the distribution characteristics of the length sizes of cfDNA reads. By comparing the lengths of cfDNA reads of 115 healthy people and 195 intestinal cancer/advanced intestinal adenoma patients, the number of fragments between 40-80bp and 200-300bp was found to be different between the two groups, which can be used as a distinguishing feature.
The cfDNA read length data is obtained by the following method: among the aligned bams, quality, length and alignment position information for each read was recorded, and the human reference genome was selected from the hg19 sequence provided by University of California at Cruz (University of California, Santa Cruz, UCSC). The human reference genome is cut into 572 windows according to the length of 5Mb, and the total number of reads (40-300bp), the number of short reads (40-80bp) and the number of ultra-long reads (200-300bp) in each window are respectively counted. And respectively carrying out normalized conversion on each reading quantity according to the counting result of each reading quantity in all windows, namely, the normalized value is (original value-average value)/standard deviation. This results in a set of numbers of 572 sets of reads of different lengths.
Meanwhile, in order to obtain high-resolution reading results, 41 regions of each chromosome long and short arm of the human reference genome are used as windows, which are shown as follows:
chr1_p chr4_q chr8_p chr11_q chr16_q chr20_p
chr1_q chr5_p chr8_q chr12_p chr17_p chr20_q
chr2_9 chr5_q chr9_p chr12_q chr17_q chr21_q
chr2_q chr6_p chr9_q chr13_q chr18_p chr22_q
chr3_p chr6_q chr10_p chr14_q chr18_q chrX_p
chr3_q chr7_p chr10_q chr15_q chr19_p chrX_q
chr4_p chr7_q chr11_p chr16_p chr19_q
the method comprises the steps of dividing fragments of 40-300bp into 27 length gradients in 10bp increments (for example, 40-49bp on 1q arm of chr1, and 50-59bp... the.), counting the number of fragments in each long-short arm window of each length gradient, and carrying out standardization conversion, thereby obtaining 2823 characteristic results of the size distribution results of the high-resolution DNA fragments (2823: 572 total read standardization results +572 short read standardization results +572 ultra-long individual section standardization results + 41: 27 length gradient standardization results).
2. Number of reads at break point of 5' end of reads in ratio of number of reads (Motif Breakkoint 8Mer, MTBK)
The human reference genome is a DNA double-helix structure and is linked by hydrogen bond through base complementary pairing; in the normal aging and cancer progression processes, the pH value of the environment around the five cells changes, so that the complementary hydrogen bonds of the basic groups are destroyed and the breakage occurs; the percentage of sequences containing information about sequences at different breakpoints will also vary due to differences in base sequences at the breakpoints. The collection method comprises the following steps: basic information and aligned positions of each read are recorded in the aligned bam, 4bp sequences of the human reference genome sequence coordinate at the 5' end of each read are confirmed, the number of the reads of each breakpoint sequence (4 × 8 × 65536 in total) is counted, and 65536 breakpoint sequence read ratios are calculated, wherein the example AAAAAAAA read ratio is aaaaaaaaaa read number/total number of all breakpoint sequence reads.
3.1Mb Window Copy Number Variation (1Mb-Bin Copy Number Variation, CNV)
Copy number changes are highly correlated with individual cancers, and although it has been possible to distinguish them by detecting copy number changes in a portion of the cancer-associated genes or in a particular genomic interval, there are other rare or unknown genes or intervals that can provide information about potential copy number changes. The collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing a reference gene chromosome 1-22 into windows in a length of 1Mb in a non-overlapping mode, calculating the reading depth in each window for each sample by using bdtools coverage, correcting according to the GC content and average comparison capability record (UCSC BigWig file) of each window, and taking the median depth of 30 healthy people in each window as a representative to obtain a population comparison base line with the reading depth of 2475 windows; for each sample to be detected, 2475 window individual read depth information is obtained, and the copy number change logarithm of each window, namely log2 (the depth of the sample to be detected after being corrected and homogenized/the depth of the group baseline after being corrected and homogenized) is constructed by using a Hidden Markov Model (HMM) and the group baseline contrast depth of each window, so that the copy number change information of each sample to be detected is obtained.
By the above-described data acquisition, initial data vectors of the three types of data can be obtained, respectively. Then, designing a corresponding calculation method:
the mark data in the invention mainly utilizes three single-feature machine learning algorithms:
1. gradient Boosting algorithm (GBM)
The gradient lifting algorithm is a common algorithm in machine learning, the basic principle is that a newly added weak classifier is trained according to negative gradient information of a current model loss function, and then the trained weak classifier is combined into the existing model in an accumulation mode to obtain an optimal model. To prevent GBM from over-or under-fitting during learning, GBM parameters are set as follows: ntrees ═ 200, max _ depth ═ 9, learning _ rate ═ 0.01, and subample ═ 0.8.cross validation ═ 10.
2. Random Forest (Random Forest, RF)
Random forests are a powerful classification and regression tool. When a group of data sets are provided, the random forest can randomly extract partial information to generate a group of decision-making forests for assisting classification or regression, node splitting attributes are made, and random extraction is continuously repeated until splitting can not be performed; and finally, combining all the split attribute results to obtain a final prediction result. To prevent the RF from over or under fitting during learning, the RF parameters are set as follows: ntrees ═ 200, max _ depth ═ 9, and cross ═ 10.
3. Deep web Learning algorithm (Deep Learning, DL)
Deep learning is based on a multi-layer feedforward artificial neural network that is trained with random gradient descent using back propagation. The network may comprise a number of hidden layers consisting of neurons with hyperbolic tangent, rectifying and maximum power activation functions. Advanced functions such as adaptive learning rate, rate annealing, momentum training, learning by dropping, L1 or L2 regularization, checkpointing, and grid search can achieve higher prediction accuracy. In learning training, each compute node trains a copy of the global model parameters on its local data using multiple threads (asynchronously), and periodically contributes to the global model by model averaging over the network. Feed-forward Artificial Neural Network (ANN) models, also known as Deep Neural Networks (DNNs) or multi-layer perceptrons (MLPs), are the most common type of deep neural network. The main principle is that a proper amount of neuron calculation nodes and a multilayer operation hierarchical structure are established for a plurality of multi-input and multi-output perceptrons through design, a proper human input layer and a proper output layer are selected, a functional relation from input to output is established through network learning and tuning, and the actual incidence relation can be approached as far as possible. To prevent excessive or insufficient sum in the DL learning process, DL parameters are set as follows: the value of epoch is 300, hidden is 100, 100, 100, input _ drop _ rates is 0.05, rho is 0.95, mini _ batch _ size is 10, and cross validation is 10.
After the three types of initial data information of 115 healthy people and 195 intestinal cancer/advanced intestinal adenoma patients are obtained, the statistical result of the size distribution of the high-resolution DNA fragments is used as an input value (the input vector of each sample comprises a characteristic value formed by 2823 read ratio values), and the sample to be detected and the normal sample are classified through three classification models respectively; similarly, after collecting the number ratio information of the breakpoint sequence reads at the 5 'end of the DNA fragment of the patient and the healthy person, taking the ratio of the breakpoint sequence at the 5' end of the DNA fragment (65536) as an input value, and judging whether the sample to be detected is classified with the normal sample by three classification model methods; similarly, the copy number information of 2475 windows is used as an input vector to perform sample classification by three classification models. Through the calculation process, three types of data are respectively substituted into three models for classification, so that 3 × 3-9 model calculation results are obtained, in each calculation result, the contribution value of each feature vector to the classification result can be obtained, after feature columns of each molecular feature with the contribution value not being 0 under different training algorithms are collected, a HRFSD significant union 1368 column, an MTBK significant union 958 column and a CNV significant union 1073 column are finally obtained. The first 50 feature columns of each molecular feature contribution were analyzed for differentiation, as shown by heatmap, with the first 50 specific columns of each feature having differential signals in both cancer healthy groups.
For the MTBK data set, the top 50 sequences of contribution values calculated under the three models and the contribution values are as follows:
GMB model data:
Figure BDA0003253394800000101
RF model data:
Figure BDA0003253394800000102
Figure BDA0003253394800000111
DL model data:
Figure BDA0003253394800000112
Figure BDA0003253394800000121
in order to further improve the prediction performance of the classifier, secondary set training (stacking) is performed on the above 9 training model results. Stacking is an ensemble learning technique by applying to a plurality of underlying weak classifiers (1)st-level base model) to do meta learning again (2)ndLevel meta-learning), collecting the characteristics of each bottom-layer classifier, and finding out the optimal integration mode, thereby improving the model prediction performance. The training algorithm used by the Stacking system is a Generalized Linear Model (GLM), a relation between a mathematical expected value of a response variable and a linear combined predicted variable is established through a join function,the 9 training models were converted to the final linear equation: ALL Stacked + a × HRFSD _ GMB + B × HRFSD _ RF + C × HRFSD _ DL + D × MTBK _ GBM + E × MTBK _ RF + F × MTBK _ DL + G CNV _ GBM + H × CNV _ RF + I CNV _ DL, where both interleaved and a-I are linear equation parameters. HRFSD _ GMB, etc. all refer to the output values (prevalence probabilities) obtained by the model after obtaining input data.
The specific coefficients are as follows:
name (R) Coefficient of correspondence
Intercept -0.95688
A(HRFSD_GBM) 0.004297
B(HRFSD_RF) 0.139366
C(HRFSD_DL) 0.733057
D(MTBK_GBM) 0.788211
E(MTBK_RF) -0.08808
F(MTBK_DL) 0.944454
G(CNV_GBM) 0.337852
H(CNV_RF) -0.02318
I(CNV_DL) 0.612503
With different features and input vectors for the training algorithm, the model prediction performance is as follows:
Figure BDA0003253394800000122
Figure BDA0003253394800000131
the HRFSD Stacked model is a linear equation model consisting of an HRFSD GBM model, an HRFSD RF model and an HRFSD DL model. The MTBK Stacked model refers to a linear equation model consisting of an MTBKBM model, an MTBKDL model and an MTBKDL model. The CNV Stacked model is a linear equation model consisting of a CNV GBM model, a CNV RF model and a CNVDL model. The results are shown in the figure and in the following table:
Figure BDA0003253394800000132
wherein, the Stacked model of HRFSD and MTBK is a linear model formed by three models of HRFSD and three models of MTBK; the Stacked models of the HRFSD and the CNV refer to linear models formed by three models of the HRFSD and three models of the CNV; the Stacked models of MTBK and CNV refer to linear models formed by three models of MTBK and CNV;
Figure BDA0003253394800000133
each feature has a certain prediction effect under different training algorithms, and the prediction effect of the feature is improved by training a single feature in a secondary set. The 9 models are subjected to secondary set training, the prediction effect of the classifier is optimal, and the highest AUC can reach 0.988. Meanwhile, the collective model is found to be capable of effectively distinguishing healthy people from intestinal cancer and healthy people from advanced intestinal adenoma, but the intestinal cancer and the advanced intestinal adenoma are similar in molecular characteristics and are difficult to distinguish (AUC ═ 0.594).
The results obtained for the above final set model are shown in the following table:
Figure BDA0003253394800000134
Figure BDA0003253394800000141
from the prediction result, the result of the multi-feature set classifier can correct the misjudgment of the single feature set classifier, and the sensitivity reaches 97.44% under the condition that the specificity in the verification set and the test set is 94.83%.
Sequence listing
<110> Nanjing and GeneBiotechnology Ltd
NANJING SHIHE MEDICAL DEVICES Co.,Ltd.
<120> intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
<130> do not
<160> 150
<170> SIPOSequenceListing 1.0
<210> 1
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 1
ccccattg 8
<210> 2
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 2
cttaatag 8
<210> 3
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 3
gtcccagt 8
<210> 4
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 4
accccgtg 8
<210> 5
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 5
ccgatttg 8
<210> 6
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 6
tgcggtgc 8
<210> 7
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 7
tacggtga 8
<210> 8
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 8
gcgggttg 8
<210> 9
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 9
atcgcgtg 8
<210> 10
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 10
gcgattcg 8
<210> 11
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 11
tgaaaccg 8
<210> 12
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 12
cccattca 8
<210> 13
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 13
gttcgttt 8
<210> 14
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 14
ccctgtgt 8
<210> 15
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 15
gccgatcc 8
<210> 16
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 16
gcacagtt 8
<210> 17
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 17
atagtgcg 8
<210> 18
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 18
cccagtac 8
<210> 19
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 19
gcccaatg 8
<210> 20
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 20
gggtttca 8
<210> 21
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 21
ccctcgaa 8
<210> 22
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 22
gcctagtc 8
<210> 23
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 23
gattctca 8
<210> 24
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 24
cggccgta 8
<210> 25
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 25
aattcgct 8
<210> 26
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 26
gaatggat 8
<210> 27
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 27
acagtgtt 8
<210> 28
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 28
tctcacgt 8
<210> 29
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 29
cttggaaa 8
<210> 30
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 30
atcacgct 8
<210> 31
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 31
aacttcgg 8
<210> 32
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 32
ctttcgtg 8
<210> 33
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 33
attaatgt 8
<210> 34
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 34
gctgatct 8
<210> 35
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 35
gtaggacc 8
<210> 36
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 36
cggtacgc 8
<210> 37
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 37
tcaattcg 8
<210> 38
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 38
ccgccgta 8
<210> 39
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 39
catagaaa 8
<210> 40
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 40
gcgtacaa 8
<210> 41
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 41
aggcataa 8
<210> 42
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 42
gcagcgaa 8
<210> 43
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 43
caagcgta 8
<210> 44
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 44
cacgacgc 8
<210> 45
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 45
acaagaag 8
<210> 46
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 46
acccggct 8
<210> 47
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 47
ttgtatac 8
<210> 48
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 48
gcgcgaaa 8
<210> 49
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 49
tatagccg 8
<210> 50
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 50
tcacaccc 8
<210> 51
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 51
atgaattc 8
<210> 52
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 52
agtactag 8
<210> 53
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 53
cattctct 8
<210> 54
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 54
agctgaac 8
<210> 55
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 55
gagactcc 8
<210> 56
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 56
cgcggtgt 8
<210> 57
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 57
cttaatat 8
<210> 58
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 58
gttaatga 8
<210> 59
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 59
tctaatga 8
<210> 60
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 60
tttaatta 8
<210> 61
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 61
cgcagcag 8
<210> 62
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 62
tataatcg 8
<210> 63
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 63
ttttataa 8
<210> 64
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 64
gcccatta 8
<210> 65
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 65
atttgtaa 8
<210> 66
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 66
catttagg 8
<210> 67
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 67
aacagcac 8
<210> 68
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 68
ttcccagc 8
<210> 69
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 69
atgaatac 8
<210> 70
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 70
tacttccg 8
<210> 71
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 71
accactgc 8
<210> 72
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 72
agaagcag 8
<210> 73
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 73
atcggcag 8
<210> 74
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 74
cgtactca 8
<210> 75
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 75
gcctgcac 8
<210> 76
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 76
agtgctct 8
<210> 77
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 77
cccactac 8
<210> 78
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 78
tctgatct 8
<210> 79
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 79
ggagcgta 8
<210> 80
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 80
gaggcgtc 8
<210> 81
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 81
ttgagcaa 8
<210> 82
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 82
cataatgt 8
<210> 83
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 83
cccagcac 8
<210> 84
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 84
ttgggcag 8
<210> 85
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 85
aaaagccg 8
<210> 86
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 86
aacggtgc 8
<210> 87
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 87
cggaatct 8
<210> 88
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 88
ttggcgta 8
<210> 89
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 89
gcttatgg 8
<210> 90
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 90
gggtcaga 8
<210> 91
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 91
ggcaatga 8
<210> 92
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 92
cccccgta 8
<210> 93
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 93
tgcccgtg 8
<210> 94
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 94
ataagtat 8
<210> 95
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 95
ttcagcac 8
<210> 96
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 96
tccggcaa 8
<210> 97
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 97
cattgcag 8
<210> 98
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 98
tagagcac 8
<210> 99
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 99
tctagtaa 8
<210> 100
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 100
acaaattc 8
<210> 101
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 101
cacggtga 8
<210> 102
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 102
tcggacgt 8
<210> 103
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 103
ttcggtgt 8
<210> 104
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 104
tttcgtgg 8
<210> 105
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 105
attcgttc 8
<210> 106
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 106
acgcacca 8
<210> 107
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 107
ccccgtat 8
<210> 108
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 108
agcggtgc 8
<210> 109
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 109
ggcggtac 8
<210> 110
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 110
ttcaacgc 8
<210> 111
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 111
gccggtcg 8
<210> 112
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 112
actcgacc 8
<210> 113
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 113
ctcacgca 8
<210> 114
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 114
cctagtaa 8
<210> 115
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 115
atggatcg 8
<210> 116
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 116
ccgaatcc 8
<210> 117
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 117
cggaacga 8
<210> 118
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 118
tccgttct 8
<210> 119
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 119
aggtacgg 8
<210> 120
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 120
tcgcggga 8
<210> 121
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 121
cggcgtgc 8
<210> 122
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 122
acgtatac 8
<210> 123
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 123
ccccgaac 8
<210> 124
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 124
acctggag 8
<210> 125
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 125
tggaggac 8
<210> 126
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 126
gaccaaag 8
<210> 127
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 127
ccctaagt 8
<210> 128
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 128
atcggtag 8
<210> 129
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 129
accattcc 8
<210> 130
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 130
cccggatt 8
<210> 131
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 131
tcaggact 8
<210> 132
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 132
acggatcg 8
<210> 133
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 133
atcggtcg 8
<210> 134
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 134
tcctcggg 8
<210> 135
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 135
tgtcgtag 8
<210> 136
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 136
acgggcgg 8
<210> 137
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 137
caagcgaa 8
<210> 138
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 138
gcccgtgt 8
<210> 139
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 139
ctatatca 8
<210> 140
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 140
aggagttt 8
<210> 141
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 141
cgctgtgt 8
<210> 142
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 142
cccgatgt 8
<210> 143
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 143
agccgtgc 8
<210> 144
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 144
atatacgg 8
<210> 145
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 145
caaggtga 8
<210> 146
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 146
ttctagtt 8
<210> 147
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 147
actacgga 8
<210> 148
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 148
cacgggac 8
<210> 149
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 149
gcgtgata 8
<210> 150
<211> 8
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 150
ttagatca 8

Claims (10)

1. A construction method of a colon cancer early-screening model is characterized by comprising the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data results to a reference genome, and obtaining the number of the reading in different length intervals in different window ranges on the reference genome as a first characteristic value;
step 3, comparing the reading data result to a reference genome to obtain the position of the 5' end of the reading on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of all the obtained base fragments in all the fragments as a second characteristic value;
step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third characteristic value;
step 5, the first characteristic value, the second characteristic value and the third characteristic value are jointly used as initial characteristic values, characteristic values with significant differences between samples of a positive group and samples of a control group in the initial characteristic values are screened out and used as model characteristic vectors;
and 6, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the intestinal cancer probability as a model output value to obtain the early-screening model.
2. The method for constructing a model of early bowel cancer according to claim 1, wherein the step 2 comprises: step 2-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of all reads, the number of short reads and the number of ultra-long reads within the range of each window;
step 2-2, respectively taking the long arm and the short arm on each chromosome as region ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;
and 2-3, taking the data obtained in the steps 2-1 and 2-2 together as a first characteristic value.
3. The method for constructing the early colon sieve model of claim 1, wherein the short reads are 40-80bp long, and the number of the ultra-long reads is 200-300 bp; all reads are in the range of 40-300bp in length;
the size range of the window in the step 2-1 is 2-7 Mb;
the different length gradient intervals in the step 2-2 refer to different length gradient ranges obtained by increasing steps of 8-12bp within a range of 40-300 bp.
4. The method of constructing a model of early colon cancer according to claim 1, wherein the number of reads is normalized;
and m is any integer between 2 and 5.
5. The method for constructing a model of early bowel cancer according to claim 1, wherein the step 5 comprises: and respectively taking the first characteristic value, the second characteristic value and the third characteristic value as input values of a gradient lifting algorithm model, a random forest model and a deep network learning model, training samples by taking whether the intestinal cancer exists as an output value, and respectively obtaining characteristic vectors with significant differences in the models.
6. The method for constructing a model of early bowel cancer according to claim 1, wherein the step 6 comprises: and (3) taking the feature vectors with the significant differences as input values of the classifier model, taking the probability of suffering from intestinal cancer as an output value, and training the model by adopting sample data of a positive group and a control group to obtain the early-screening model.
7. The method of claim 6, wherein the classifier model is a linear model, and the variables included in the model are obtained by inputting feature vectors with significant differences among the first, second, and third feature values to the gradient boost algorithm model, the random forest model, and the deep network learning model trained in step 5.
8. An intestinal cancer prescreening detection device, comprising:
the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data;
the comparison module is used for comparing the reading data result to a reference genome;
the first characteristic value acquisition module is used for acquiring the number of reads in different length intervals in different window ranges on a reference genome as a first characteristic value;
a second eigenvalue acquisition module for acquiring the position of the 5' end of the read on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of all the obtained base fragments in all the fragments as a second characteristic value;
the third characteristic value acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data in each window range as a third characteristic value;
the screening module is used for screening out a characteristic value with a significant difference between the samples of the positive group and the control group in the initial characteristic values as a model characteristic vector by taking the first characteristic value, the second characteristic value and the third characteristic value as the initial characteristic values;
and the model construction module is used for inputting the model characteristic vectors of the samples of the positive group and the control group into the model, taking the probability of suffering from intestinal cancer as a model output value, training the model and obtaining the early-screening model.
9. The device for early screening of intestinal cancer according to claim 8, wherein the first characteristic value acquiring module includes:
the first reading number counting module is used for dividing the reference genome into a plurality of windows and respectively obtaining the total reading number, the short reading number and the ultra-long reading number in each window range;
a second read number counting module, configured to take the long arm and the short arm on each chromosome as region ranges, and obtain the number of reads in gradient intervals of different lengths in each range;
and the merging module is used for taking the data obtained in the first reading number counting module and the second reading number counting module as the first characteristic value together.
10. A computer-readable medium storing a computer program capable of executing the method for constructing a colon cancer early-screening model according to claim 1.
CN202111053742.7A 2021-09-08 2021-09-08 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium Pending CN113903398A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111053742.7A CN113903398A (en) 2021-09-08 2021-09-08 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111053742.7A CN113903398A (en) 2021-09-08 2021-09-08 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium

Publications (1)

Publication Number Publication Date
CN113903398A true CN113903398A (en) 2022-01-07

Family

ID=79188801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111053742.7A Pending CN113903398A (en) 2021-09-08 2021-09-08 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium

Country Status (1)

Country Link
CN (1) CN113903398A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model
CN115295074A (en) * 2022-10-08 2022-11-04 南京世和基因生物技术股份有限公司 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model
WO2023197825A1 (en) * 2022-04-15 2023-10-19 南京世和基因生物技术股份有限公司 Multi-cancer early screening model construction method and detection device
CN115295074A (en) * 2022-10-08 2022-11-04 南京世和基因生物技术股份有限公司 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Similar Documents

Publication Publication Date Title
Slonim et al. Class prediction and discovery using gene expression data
CN113421608B (en) Construction method of liver cancer early screening model, detection device and computer readable medium
CN111128299B (en) Construction method of ceRNA regulation and control network with significant correlation to colorectal cancer prognosis
CN113903398A (en) Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
CN113355421B (en) Lung cancer early screening marker, model construction method, detection device and computer readable medium
CN111564177B (en) Construction method of early non-small cell lung cancer recurrence model based on DNA methylation
WO2023197825A1 (en) Multi-cancer early screening model construction method and detection device
CN113140258B (en) Method for screening potential prognosis biomarkers of lung adenocarcinoma based on tumor invasive immune cells
CN111676291B (en) miRNA marker for lung cancer risk assessment
CN108268752B (en) A kind of chromosome abnormality detection device
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN112382341A (en) Method for identifying biomarkers related to esophageal squamous carcinoma prognosis
CN113862351A (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
KR20190137012A (en) Method for predicting disease risk based on analysis of complex genetic information
Zhou et al. Identifying biomarkers of nottingham prognosis index in breast cancer survivability
CN114974432A (en) Screening method of biomarker and related application thereof
CN114373502A (en) Tumor data analysis system based on methylation
CN113035279A (en) Parkinson disease evolution key module identification method based on miRNA sequencing data
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
Livesey et al. Transforming RNA-Seq gene expression to track cancer progression in the multi-stage early to advanced-stage cancer development
KR20200057664A (en) Gene expression marker screening method using neural network based on gene selection algorithm
Leung et al. Gene selection for brain cancer classification
CN115678999B (en) Application of marker in lung cancer recurrence prediction and prediction model construction method
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
Gómez-López et al. Towards the Identification of Multiclass Lung Cancer-Related Genes: An Evolutionary and Intelligent Procedure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination