WO2023197825A1 - Multi-cancer early screening model construction method and detection device - Google Patents

Multi-cancer early screening model construction method and detection device Download PDF

Info

Publication number
WO2023197825A1
WO2023197825A1 PCT/CN2023/082118 CN2023082118W WO2023197825A1 WO 2023197825 A1 WO2023197825 A1 WO 2023197825A1 CN 2023082118 W CN2023082118 W CN 2023082118W WO 2023197825 A1 WO2023197825 A1 WO 2023197825A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
model
feature
reads
feature set
Prior art date
Application number
PCT/CN2023/082118
Other languages
French (fr)
Chinese (zh)
Inventor
邵阳
吴雪
包华
刘睿
吴舒雨
唐皖湘夫
杨珊珊
刘思思
孟齐
王婷婷
Original Assignee
南京世和基因生物技术股份有限公司
南京世和医疗器械有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京世和基因生物技术股份有限公司, 南京世和医疗器械有限公司 filed Critical 南京世和基因生物技术股份有限公司
Publication of WO2023197825A1 publication Critical patent/WO2023197825A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the invention relates to a detection of tissue origins of multiple cancer types including lung cancer (Lung Adenocarcinoma, LUAD), colorectal cancer (CRC), and liver cancer (Primary Liver Cancer, PLC), and belongs to the field of molecular biomedicine technology.
  • lung cancer Lung Adenocarcinoma, LUAD
  • CRC colorectal cancer
  • PLC Primary Liver Cancer
  • Lung cancer, colorectal cancer, and liver cancer are the three malignant tumors with the highest mortality rates worldwide.
  • Lung cancer, liver cancer and colorectal cancer have low early diagnosis rates due to lack of obvious symptoms or difficulty in detection.
  • most of the cancer early screening products currently on the market are aimed at predicting a single cancer type. If patients need to undergo multiple early screening projects for different single cancer types, due to time-consuming, laborious and high costs, it may reduce the effectiveness of early screening for various cancer types in a wide range of people. implementation and promotion.
  • Early screening for multiple cancer types not only covers the early screening of various cancer types, but also accurately detects their tissue origin, preventing unknown primary cancers that may appear during the development of cancer, which may complicate the condition and delay diagnosis and treatment. Therefore, our country urgently needs an early screening product that simultaneously covers the above three malignant tumors with the highest mortality rate, so as to be more efficient, economical and practical and applicable to a wider range of people.
  • the present invention provides a method for WGS low-depth sequencing of plasma sample cfDNA, and uses high-throughput sequencing results to analyze five differential characteristics of cfDNA fragments of various cancer types, including genome-wide fragment length distribution, fragment length distribution on each long and short arm of the chromosome,
  • the sequence at the fragment breakpoint (8-mer Breakpoint Motif), the 5' end sequence of the fragment (8-mer End Motif) and the fragment copy number changes in the 1MB window were analyzed using generalized linear model (GLM), gradient boosting machine (GBM), and random Four algorithms, namely Random Forest, Deep Learning and XGBoost, are used for training and modeling respectively.
  • a multi-feature multi-algorithm integrated model is built through the generalized linear model (GLM) to achieve multi-cancer diagnosis.
  • TOO Tissue of Origin
  • the first purpose of the present invention is to provide a first purpose of the present invention.
  • a method for constructing a multi-cancer early screening model is used to classify whether a sample has intestinal cancer, lung cancer or liver cancer, including the following steps:
  • Step 1 Extract and sequence cfDNA from the samples of the positive group and the control group to obtain read data
  • Step 2 Compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of ultra-long reads within each window, as The first feature set;
  • Step 3 Align the read data results to the reference genome, using the long arm and short arm of each chromosome as the region range, and obtain the number of reads in gradient intervals of different lengths within each range as the second feature set;
  • Step 4 Use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as the third feature set;
  • Step 5 Compare the read data results to the reference genome to obtain the position of the 5' end of the read on the reference genome. Position; obtain the sequence data of n bp bases upstream and downstream of the position as a set of base fragments; use the proportion of the obtained various base fragments in all fragments as a fourth feature set;
  • Step 6 Divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
  • Step 7 Use the first, second, third, fourth and fifth feature sets together as the initial feature value and input it into the classification model as the model feature vector, and use whether you have cancer as the output value to train the model. Get early screening models.
  • suffering from cancer refers to suffering from any one of intestinal cancer, lung cancer or liver cancer.
  • the simplification means to respectively screen out the first, second, third, fourth and fifth feature sets in the positive group. There is a significant difference between the sample and the control group.
  • the short reads are 40-80 bp in length, the number of ultra-long reads is 200-300 bp, and all reads are in the range of 40-300 bp in length.
  • the window size range in step 2 is 2-7Mb.
  • the gradient intervals of different lengths in step 3 refer to the gradient ranges of different lengths obtained by increasing in steps of 8-12 bp in the range of 40-300 bp.
  • m is any integer between 6 and 10.
  • n is any integer between 2 and 5.
  • the window in step 6 is obtained by dividing the reference gene chromosomes 1-22 with a length of 0.8-1.2Mb without overlap.
  • the input to the classification model in step 7 refers to inputting the first, second, third, fourth and fifth feature sets to the generalized linear model, gradient boosting algorithm model, random forest model, deep learning model and extreme model respectively.
  • the gradient boosting model multiple sub-models are obtained and connected into a linear relationship model.
  • a multi-cancer detection device the device is used to classify whether a sample has intestinal cancer, lung cancer or liver cancer, including:
  • the sequencing module is used to extract and sequence cfDNA from the samples of the positive group and the control group to obtain read data;
  • the first feature set acquisition module is used to compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of super-reads within each window. The number of long reads, as the first feature set;
  • the second feature set acquisition module is used to compare the read data results to the reference genome, using the long arm and short arm on each chromosome as the regional range, and obtain the reads in different length gradient intervals within each range.
  • the third feature set acquisition module is used to use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as a third feature set;
  • the fourth feature set acquisition module is used to compare the read data results to the reference genome and obtain the position of the 5' end of the read on the reference genome; obtain the n bp bases upstream and downstream of the position. Sequence data, as base Base fragment set; the proportion of the obtained various base fragments in all fragments is used as the fourth feature set;
  • the fifth feature set acquisition module is used to divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
  • the model building module is used to use the first, second, third, fourth and fifth feature sets as initial feature values and input them into the classification model as model feature vectors, and use whether there is cancer as the output value to evaluate the model. Perform training to obtain an early screening model.
  • the third purpose of the present invention is to provide a third purpose of the present invention.
  • a computer-readable medium records a computer program that can run a method for constructing a multi-cancer early screening model.
  • the fourth object of the present invention is a
  • a method for constructing a multi-cancer early screening model the model is used to distinguish intestinal cancer, lung cancer or liver cancer from cancer samples;
  • Step 1 Extract and sequence cfDNA from intestinal cancer, lung cancer, and liver cancer samples to obtain read data;
  • Step 2 Compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of ultra-long reads within each window, as The first feature set;
  • Step 3 Align the read data results to the reference genome, using the long arm and short arm of each chromosome as the region range, and obtain the number of reads in gradient intervals of different lengths within each range as the second feature set;
  • Step 4 Use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as the third feature set;
  • Step 5 Compare the read data results to the reference genome to obtain the position of the 5' end of the read on the reference genome; obtain the sequence data of n bp bases upstream and downstream of the position as bases Fragment set; the proportion of the obtained various base fragments in all fragments is used as the fourth feature set;
  • Step 6 Divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
  • Step 7 Establish three control experimental groups respectively.
  • the positive samples in each group are intestinal cancer, lung cancer or liver cancer samples.
  • the control samples in each group are the remaining two cancer samples except the positive samples.
  • the first, second, third, fourth and fifth feature sets are used together as the initial feature values to screen out the feature values with significant differences between the positive samples and the control samples, and then the three groups are compared
  • the significantly different feature values in the experimental group are combined and input into the classification model as a model feature vector, and the probability of having intestinal cancer, lung cancer or liver cancer is used as the output value to train the model and obtain an early screening model.
  • inputting to the classification model means inputting the first, second, third, fourth and fifth feature sets to the gradient boosting algorithm model, random forest model, deep learning model and extreme gradient boosting model respectively. , multiple sub-models are obtained, and the sub-models are combined into a linear relationship model.
  • the short reads are 40-80 bp in length, the number of ultra-long reads is 200-300 bp, and all reads are in the range of 40-300 bp in length.
  • the window size range in step 2 is 2-7Mb.
  • the gradient intervals of different lengths in step 3 refer to the gradient ranges of different lengths obtained by increasing in steps of 8-12 bp in the range of 40-300 bp.
  • m is any integer between 6 and 10.
  • n is any integer between 2 and 5.
  • the window in step 6 is obtained by dividing the reference gene chromosomes 1-22 with a length of 0.8-1.2Mb without overlap.
  • the fifth object of the present invention is a.
  • a multi-cancer detection device the device is used to distinguish intestinal cancer, lung cancer or liver cancer from cancer samples, including:
  • the sequencing module is used to extract and sequence cfDNA from intestinal cancer, lung cancer, and liver cancer samples to obtain read data;
  • the first feature set acquisition module is used to compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of super-reads within each window. The number of long reads, as the first feature set;
  • the second feature set acquisition module is used to compare the read data results to the reference genome, using the long arm and short arm on each chromosome as the regional range, and obtain the reads in different length gradient intervals within each range.
  • the third feature set acquisition module is used to use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as a third feature set;
  • the fourth feature set acquisition module is used to compare the read data results to the reference genome and obtain the position of the 5' end of the read on the reference genome; obtain the n bp bases upstream and downstream of the position. Sequence data is used as a set of base fragments; the proportion of various base fragments obtained in all fragments is used as the fourth feature set;
  • the fifth feature set acquisition module is used to divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
  • the model building module is used to establish three control experimental groups respectively.
  • the positive samples in each group are intestinal cancer, lung cancer or liver cancer samples.
  • the control samples in each group are the remaining two cancer samples except the positive samples.
  • the first, second, third, fourth and fifth feature sets were used as the initial feature values in the three control experimental groups respectively, and the feature values with significant differences between the positive samples and the control samples were screened out, and then the The significantly different feature values in the three control experimental groups are merged and input into the classification model as a model feature vector, and the probability of suffering from intestinal cancer, lung cancer or liver cancer is used as the output value to train the model and obtain early diagnosis. Sieve model.
  • the sixth object of the present invention is a
  • a computer-readable medium records a computer program that can run a method for constructing a multi-cancer early screening model.
  • this invention provides a multi-molecule feature multi-training algorithm secondary integrated diagnostic model based on high-throughput and low-depth sequencing of plasma cfDNA. This model can not only diagnose multiple early cancers and their tissue origins, but also has non-invasive detection, low throughput, and detection Characterized by high specificity and sensitivity.
  • Figure 1 is a schematic diagram of the model construction process
  • Figure 2 is a schematic diagram of the process of building a multi-cancer early detection model
  • Figure 3 is a schematic diagram of the construction process of the tissue origin model of multiple cancer types
  • Figure 4 shows the distribution of the largest difference feature among the five features between the cancer group and the non-cancer group
  • Figure 5 shows the AUC performance of the multi-cancer early detection model in the training set
  • Figure 6 shows the AUC performance of the multi-cancer early detection model in the test set
  • Figure 7 shows the distribution of the unique largest difference among the five characteristics of liver cancer between liver cancer and other cancer types
  • Figure 8 shows the distribution of the unique largest differential features among the five features of colorectal cancer between colorectal cancer and other cancer types
  • Figure 9 shows the distribution of the unique largest difference feature among the five features of lung cancer between lung cancer and other cancer types
  • the present invention relates to early detection of multiple cancer types (lung cancer, intestinal cancer and liver cancer) and cancer type prediction markers, detection methods, detection devices and computer-readable media.
  • the present invention provides a method for WGS low-depth sequencing of plasma sample cfDNA, and uses high-throughput sequencing results to analyze five differential characteristics of cfDNA fragments of various cancer types, including genome-wide fragment length coverage distribution, fragment length distribution on each long and short arm of the chromosome,
  • the present invention first requires steps such as extraction of cfDNA from blood samples, library construction, and sequencing.
  • the extraction and database construction methods here are not particularly limited and can be adjusted from extraction methods in the prior art.
  • existing sequencing technology can be used to obtain the base information of cfDNA.
  • the data sets used in the model building process in this invention are as follows:
  • EDTA anticoagulant tubes Use purple blood collection tubes (EDTA anticoagulant tubes) to collect 8 ml of whole blood samples from the patient, and centrifuge the plasma promptly (within 2 hours). After being transported to the laboratory, the plasma samples are extracted with cfDNA using the QIAGEN Plasma DNA Extraction Kit according to the instructions. After building a library of the collected cfDNA samples, WGS ⁇ 5-fold sequencing was performed. After obtaining the offline data, compare the data to the human reference genome to obtain the base data information of the corresponding read segments.
  • DNA fragment size ratio Framentation Size Coverage, FSC
  • DNA fragment size ratio As for the DNA fragment size ratio, it reflects the length and size ratio characteristics of cfDNA reads. Use DNA fragment size coverage depth (fragmentation size ratio) for machine learning to establish a prediction model to distinguish patients with lung cancer, bowel cancer and liver cancer. By comparing the lengths of cfDNA reads in 486 patients with lung cancer, intestinal cancer or liver cancer, it was found that the number of fragments between 40-80bp, 81-300bp and 40-300bp was distributed differently on the chromosome, which can be used as a distinguishing feature.
  • the cfDNA read length data is obtained through the following method: in the aligned BAM, the quality, length and alignment position information of each read are recorded.
  • the human reference genome is selected from the University of California, Cruz. , Santa Cruz, UCSC) provided the hg19 sequence.
  • the human reference genome was cut into 572 windows according to the length of 5Mb, and the number of all reads (40-300bp), the number of short reads (40-80bp) and the number of long reads (81-300bp) in each window were counted. .
  • the human reference genome is a DNA double helix structure, linked by hydrogen bonding based on complementary base pairing; during normal aging and cancer progression, the pH of the environment around cells changes, thereby destroying base complementary hydrogen bonds and causing rupture; due to rupture
  • Copy number changes are highly correlated with individual cancers. Although differentiation can be made by detecting copy number changes in some cancer-related genes or specific genomic regions, there are other rare or unknown genes or regions that can provide potential copy number changes. information. Collection method: For the WGS data of each sample to be tested, divide the reference gene chromosomes 1-22 into windows with a length of 1Mb without overlap, use bedtools coverage to calculate the read depth in each window for each sample, and calculate the read depth in each window according to the length of each window. GC content and average comparison ability records (UCSC BigWig files) were corrected to obtain 2475 window individual read depth information.
  • HMM Hidden Markov Model
  • the landmark data in this invention mainly uses five single-feature machine learning algorithms:
  • the generalized linear model is an extension of the linear model that establishes the relationship between the mathematical expectation value of the response variable and the linear combination of predictor variables through a link function.
  • the main feature is that it does not forcibly change the natural measurement of the data, and it is a commonly used two-class classification strategy.
  • Random forest is a powerful classification and regression tool. When a set of data is provided, the random forest can randomly extract part of the information to generate a set of decision trees to help classification or regression, make node split attributes, and repeat the random extraction until it can no longer be split; finally, combine all split attribute results to obtain the final prediction result.
  • Deep learning is based on multi-layer feedforward artificial neural networks trained with stochastic gradient descent using backpropagation.
  • the network can contain a large number of hidden layers consisting of neurons with hyperbolic tangent, rectification and maximum power activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing and grid search enable high prediction accuracy.
  • each compute node uses multi-threading (asynchronously) to train a copy of the global model parameters on its local data and regularly contributes to the global model through model averaging across the network.
  • Feedforward artificial neural network (ANN) models also known as deep neural networks (DNN) or multilayer perceptrons (MLP), are the most common type of deep neural networks.
  • DNN deep neural networks
  • MLP multilayer perceptrons
  • the main principle is to design multiple perceptrons with multiple inputs and multiple outputs and establish an appropriate number of neuron computing nodes and multi-layer computing hierarchies, select appropriate input layers and output layers, and establish a network through network learning and tuning.
  • the functional relationship from input to output can approximate the realistic correlation relationship as much as possible.
  • Extreme Gradient Boosting is an efficient open source implementation of the gradient boosting algorithm.
  • XGBoost introduces parallelization, so it is faster;
  • XGBoost introduces a second-order approximation to the objective function, obtains an analytical solution, and uses the analytical solution to build a decision tree to optimize the objective function;
  • the term part can control the complexity of the model and prevent overfitting;
  • Xgboost introduces feature subsampling, which is similar to random forest, which can reduce overfitting and calculation.
  • analysis of variance is performed on the feature values between different groups to filter out the feature values with large differences between the groups. This step is done through the aov() function of the R package stats
  • the F value (Fvalue) result and the corrected pvalue implementation of the p.adjust() function.
  • the training set was divided into cancer groups and non-cancer groups, variance analysis was performed on the five features respectively, and sorted in descending order according to the Fvalue results, and the top 200 features were retained as prediction input values.
  • the top 200 significantly different feature values of FSC between the cancer group and the non-cancer group are as follows: long/short/total represent long reads, short reads and all reads respectively, and the numbers represent the window position number;
  • the first 200 significantly different eigenvalues of FSD between the cancer group and the non-cancer group are as follows: chrx, xp/q among them represent the long arm or short arm on chromosome x, and the number part refers to the number of the gradient position;
  • the top 200 significantly different eigenvalues of EDM between the cancer group and the non-cancer group are as follows: the numbers consisting of 8-digit ATCG represent the base sequences of different eigenvalues;
  • the top 200 significantly different eigenvalues of BKM between the cancer group and the non-cancer group are as follows: the numbers consisting of 8-bit ATCG represent the base sequences of different eigenvalues;
  • chrx represents chromosome x, and the number part refers to the range of positions on the chromosome;
  • liver cancer FSC liver cancer FSC that are significantly different from other cancer types.
  • liver cancer FSD liver cancer FSD
  • liver cancer EDM liver cancer EDM that are significantly different from other cancer types.
  • liver cancer BPM The top 100 characteristic values of liver cancer BPM that are significantly different from other cancer types are as follows:
  • liver cancer CNVs The top 100 characteristic values of liver cancer CNVs that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of intestinal cancer FSC that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of bowel cancer FSD that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of bowel cancer EDM that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of bowel cancer BPM that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of intestinal cancer CNVs that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of lung cancer FSC that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of lung cancer FSD that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of lung cancer EDM that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of lung cancer BPM that are significantly different from other cancer types are as follows:
  • the top 100 characteristic values of lung cancer CNVs that are significantly different from other cancer types are as follows:
  • Stacking is an integrated learning technology that performs meta-learning ( 2nd -level meta-learning) on multiple underlying weak classifiers ( 1st -level base model) to collect the characteristics of each underlying classifier and find the best Optimize the integration method to improve the model prediction performance.
  • the training algorithm used in Stacking in this patent is the Generalized Linear Model (GLM). It establishes the relationship between the mathematical expectation value of the response variable and the linear combination of predictor variables through the link function, and converts various training basic models into the final linear equation. :
  • Intercept and A1-E5 are all linear equation parameters.
  • FSC_GLM, etc. all refer to the output value obtained by the model after obtaining the input data.
  • the characters before the symbol "_” represent the type of feature set, and the characters after the symbol "_” represent the algorithm type.
  • the output value of the multi-cancer early screening model is Cancer probability.
  • the model of tissue origin of multiple cancer types is mainly used to further confirm the specific cancer type for samples that have been confirmed to have one of the above three cancers. Therefore, when classifying samples, three groups of training samples are established:
  • the first set of training samples positive is intestinal cancer, and the control is lung cancer and liver cancer; the judgment is divided into two categories: intestinal cancer and other two cancers.
  • the second set of training samples positive is lung cancer, and the control is intestinal cancer and liver cancer; the judgment is divided into two categories: lung cancer and other two cancers.
  • the third group of training samples positive is liver cancer, and the control is liver cancer and intestinal cancer; the judgment is divided into two categories: liver cancer and other two types of cancer.
  • each group of samples variance analysis is performed separately, and the characteristic values with significant differences in each feature set can be found in each group; and after the analysis of all three groups is completed, corresponding values can be obtained between each group. There will be overlap among the eigenvalues with significant differences. Therefore, each group of filtered eigenvalues will be merged and repeated to obtain the eigenvalues required in the final model.
  • Intercept and A2-E5 are all linear equation parameters.
  • FSC_GBM, etc. all refer to the output value obtained by the model after obtaining the input data.
  • the characters before the symbol "_" represent the type of feature set, and the characters after the symbol "_” represent the algorithm type.
  • the output value of the multi-cancer early screening model is Cancer probability
  • the multi-cancer tissue origin model is the cancer type probability (the multi-cancer tissue origin integrated model will predict the possibility of liver cancer, bowel cancer and lung cancer for the samples to be predicted, and use the three prediction results as the The maximum value is taken as the final judgment result).
  • the integrated model for early detection of multiple cancer types can effectively distinguish cancer from healthy people.
  • the sensitivity and specificity in the training set both reached 94%.
  • the test set verified the integrated model, and the sensitivity and specificity reached 95%. There were no inter-set results. Differences, the specific results are shown in the table below:
  • the multi-cancer tissue origin set model can effectively distinguish the tissue origins of lung cancer, liver cancer and intestinal cancer.
  • the overall accuracy rate in the training set reached 95.1%, and the overall accuracy of the samples that successfully predicted cancer in the test set was 93.1%. Specifically The results are shown in the table below:
  • the 5' end sequence proportion of the fragment (EDM) is not included, and only the other four are used.
  • the model establishment process is the same as above.
  • the cancer species origin model is established, and the final calculation results of the test set samples are obtained. as follows:
  • GLM is a two-classification algorithm. Its advantages are not obvious when performing multiple classifications. It cannot show good classification performance in the process of cancer classification. Therefore, the basic model of glm is not used in the classification model of this part. It is only used in cancer classification. /Used in the process of classifying healthy samples.

Abstract

The present invention relates to a multi-cancer (lung cancer, intestinal cancer and liver cancer) early detection and cancer prediction method, a detection device, and a computer readable medium. The present invention provides: performing WGS lowpass sequencing on plasma samples cfDNA; using a high-throughput sequencing result to analyze five discriminative features of cfDNA fragments of cancers, which comprise genome-wide fragment length coverage distribution, fragment length distribution on long and short arms of chromosomes, a fragment breakpoint sequence, a fragment 5'end sequence, and 1 MB window fragment copy number variation; then using a generalized linear model, a gradient boosting machine, a random forest, a deep learning algorithm, and an extreme gradient boosting algorithm to respectively perform training modeling; and next using the generalized linear model to perform secondary set training to construct a multi-feature multi-algorithm integration model. The purpose of low-depth, high-specificity and high-sensitivity non-invasive precise early detection and origin of tissue detection for multiple cancers is realized.

Description

多癌种早筛模型构建方法以及检测装置Multi-cancer early screening model construction method and detection device 技术领域Technical field
本发明涉及一种包括肺癌(Lung Adenocarcinoma,LUAD)、结直肠癌(Colorectal Carcinoma,CRC)和肝癌(Primary Liver Cancer,PLC)的多癌种组织起源检测,属于分子生物医学技术领域。The invention relates to a detection of tissue origins of multiple cancer types including lung cancer (Lung Adenocarcinoma, LUAD), colorectal cancer (CRC), and liver cancer (Primary Liver Cancer, PLC), and belongs to the field of molecular biomedicine technology.
背景技术Background technique
肺癌、结直肠癌和肝癌是三种全球死亡率最高的恶性肿瘤。Lung cancer, colorectal cancer, and liver cancer are the three malignant tumors with the highest mortality rates worldwide.
肺癌、肝癌和结直肠癌由于早期无明显症状或检测困难导致早期诊断率低。然而,目前市场上的癌症早筛产品多是针对单癌种预测,若患者需接受多次不同单癌种早筛项目,由于耗时费力且费用高,可能降低各癌种早筛在广泛人群中的贯彻推广。多癌种的早期筛查不仅覆盖各癌种的早起筛查,且精确检测其组织起源,预防癌症发展过程中可能出现的未知原发癌而使病情复杂,耽误诊断治疗的情况发生。因此,我国迫切需要一种同时覆盖以上三种死亡率最高的恶性肿瘤的早筛产品,以更高效、经济、实用地适用于更广泛人群。Lung cancer, liver cancer and colorectal cancer have low early diagnosis rates due to lack of obvious symptoms or difficulty in detection. However, most of the cancer early screening products currently on the market are aimed at predicting a single cancer type. If patients need to undergo multiple early screening projects for different single cancer types, due to time-consuming, laborious and high costs, it may reduce the effectiveness of early screening for various cancer types in a wide range of people. implementation and promotion. Early screening for multiple cancer types not only covers the early screening of various cancer types, but also accurately detects their tissue origin, preventing unknown primary cancers that may appear during the development of cancer, which may complicate the condition and delay diagnosis and treatment. Therefore, our country urgently needs an early screening product that simultaneously covers the above three malignant tumors with the highest mortality rate, so as to be more efficient, economical and practical and applicable to a wider range of people.
发明内容Contents of the invention
本发明提供了一种对血浆样本cfDNA进行WGS低深度测序,使用高通量测序结果分析各癌种cfDNA片段五种差异特征,包括基因组范围片段长度覆分布,染色体各长短臂上片段长度分布,片段断点处序列(8-mer Breakpoint Motif),片段5’端序列(8-mer End Motif)和1MB窗口片段拷贝数变化,利广义线性模型(GLM),用梯度提升机(GBM),随机森林(Random Forest),深度学习(Deep Learning)和极端梯度提升(XGBoost)四种算法分别进行训练建模,最后再通过广义线性模型(GLM)构建多特征多算法整合模型,实现了对多癌种低深度高特异性高敏感性的无创精准组织起源(Tissue of Origin,TOO)检测的目的。The present invention provides a method for WGS low-depth sequencing of plasma sample cfDNA, and uses high-throughput sequencing results to analyze five differential characteristics of cfDNA fragments of various cancer types, including genome-wide fragment length distribution, fragment length distribution on each long and short arm of the chromosome, The sequence at the fragment breakpoint (8-mer Breakpoint Motif), the 5' end sequence of the fragment (8-mer End Motif) and the fragment copy number changes in the 1MB window were analyzed using generalized linear model (GLM), gradient boosting machine (GBM), and random Four algorithms, namely Random Forest, Deep Learning and XGBoost, are used for training and modeling respectively. Finally, a multi-feature multi-algorithm integrated model is built through the generalized linear model (GLM) to achieve multi-cancer diagnosis. The purpose of a low-depth, high-specificity and high-sensitivity non-invasive and precise Tissue of Origin (TOO) detection.
本发明的第一个目的:The first purpose of the present invention:
多癌种早筛模型的构建方法,所述的模型用于对样本是否患有肠癌、肺癌或者肝癌进行分类,包括如下步骤:A method for constructing a multi-cancer early screening model. The model is used to classify whether a sample has intestinal cancer, lung cancer or liver cancer, including the following steps:
步骤1,对阳性组和对照组的样本进行cfDNA的提取并测序,获得读段数据;Step 1: Extract and sequence cfDNA from the samples of the positive group and the control group to obtain read data;
步骤2,将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;Step 2: Compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of ultra-long reads within each window, as The first feature set;
步骤3,将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;Step 3: Align the read data results to the reference genome, using the long arm and short arm of each chromosome as the region range, and obtain the number of reads in gradient intervals of different lengths within each range as the second feature set;
步骤4,将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;Step 4: Use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as the third feature set;
步骤5,将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位 置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;Step 5: Compare the read data results to the reference genome to obtain the position of the 5' end of the read on the reference genome. Position; obtain the sequence data of n bp bases upstream and downstream of the position as a set of base fragments; use the proportion of the obtained various base fragments in all fragments as a fourth feature set;
步骤6,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;Step 6: Divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
步骤7,以第一、第二、第三、第四和第五特征集合共同作为初始特征值,作为模型特征向量输入至分类模型中,并以是否患癌作为输出值,对模型进行训练,获得早筛模型。Step 7: Use the first, second, third, fourth and fifth feature sets together as the initial feature value and input it into the classification model as the model feature vector, and use whether you have cancer as the output value to train the model. Get early screening models.
所述的步骤6中,患癌是指患有肠癌、肺癌或者肝癌中的任意一种。In step 6, suffering from cancer refers to suffering from any one of intestinal cancer, lung cancer or liver cancer.
所述的步骤6中,还需要对初始特征值进行简化后再作为模型特征向量,所述的简化是指分别筛选出第一、第二、第三、第四和第五特征集合在阳性组和对照组的样本之间存在显著性差异的特征值。In the described step 6, it is also necessary to simplify the initial feature values and then use them as model feature vectors. The simplification means to respectively screen out the first, second, third, fourth and fifth feature sets in the positive group. There is a significant difference between the sample and the control group.
所述的筛选过程是通过方差分析方法。The screening process described was through the analysis of variance method.
所述的短读段是指长度40-80bp,所述的超长读段数量是200-300bp;全部读段是指长度在40-300bp范围。The short reads are 40-80 bp in length, the number of ultra-long reads is 200-300 bp, and all reads are in the range of 40-300 bp in length.
所述的步骤2中窗口的大小范围是2-7Mb。The window size range in step 2 is 2-7Mb.
所述的步骤3中不同长度梯度区间是指在40-300bp范围内以8-12bp步长递增而得到的不同长度梯度范围。The gradient intervals of different lengths in step 3 refer to the gradient ranges of different lengths obtained by increasing in steps of 8-12 bp in the range of 40-300 bp.
所述的读段数量经过了标准化处理。The number of reads stated is normalized.
所述的步骤4中,m是6-10之间的任意整数。In step 4, m is any integer between 6 and 10.
所述的步骤5中,n是2-5之间的任意整数。In step 5, n is any integer between 2 and 5.
所述的步骤6中的窗口是将参考基因1-22号染色体以0.8-1.2Mb的长度无重叠划分得到的。The window in step 6 is obtained by dividing the reference gene chromosomes 1-22 with a length of 0.8-1.2Mb without overlap.
所述的步骤7中输入至分类模型是指分别将第一、第二、第三、第四和第五特征集合输入至广义线性模型、梯度提升算法模型、随机森林模型、深度学习模型和极端梯度提升模型中,获得多个子模型,并将子模型联立为线性关系模型。The input to the classification model in step 7 refers to inputting the first, second, third, fourth and fifth feature sets to the generalized linear model, gradient boosting algorithm model, random forest model, deep learning model and extreme model respectively. In the gradient boosting model, multiple sub-models are obtained and connected into a linear relationship model.
本发明的第二个目的:Second purpose of the present invention:
多癌种检测装置,所述的装置用于样本是否患有肠癌、肺癌或者肝癌进行分类,包括:A multi-cancer detection device, the device is used to classify whether a sample has intestinal cancer, lung cancer or liver cancer, including:
测序模块,用于对阳性组和对照组的样本进行cfDNA的提取并测序,获得读段数据;The sequencing module is used to extract and sequence cfDNA from the samples of the positive group and the control group to obtain read data;
第一特征集合获取模块,用于将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;The first feature set acquisition module is used to compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of super-reads within each window. The number of long reads, as the first feature set;
第二特征集合获取模块,用于将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;The second feature set acquisition module is used to compare the read data results to the reference genome, using the long arm and short arm on each chromosome as the regional range, and obtain the reads in different length gradient intervals within each range. The number of segments, as the second feature set;
第三特征集合获取模块,用于将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;The third feature set acquisition module is used to use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as a third feature set;
第四特征集合获取模块,用于将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱 基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;The fourth feature set acquisition module is used to compare the read data results to the reference genome and obtain the position of the 5' end of the read on the reference genome; obtain the n bp bases upstream and downstream of the position. Sequence data, as base Base fragment set; the proportion of the obtained various base fragments in all fragments is used as the fourth feature set;
第五特征集合获取模块,用于将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;The fifth feature set acquisition module is used to divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
模型构建模块,用于以第一、第二、第三、第四和第五特征集合共同作为初始特征值,作为模型特征向量输入至分类模型中,并以是否患癌作为输出值,对模型进行训练,获得早筛模型。The model building module is used to use the first, second, third, fourth and fifth feature sets as initial feature values and input them into the classification model as model feature vectors, and use whether there is cancer as the output value to evaluate the model. Perform training to obtain an early screening model.
本发明的第三个目的:The third purpose of the present invention:
一种计算机可读取介质,其记载有可以运行多癌种早筛模型的构建方法的计算机程序。A computer-readable medium records a computer program that can run a method for constructing a multi-cancer early screening model.
本发明的第四个目的:The fourth object of the present invention:
一种多癌种早筛模型的构建方法,所述的模型用于对癌症样本进行肠癌、肺癌或者肝癌的区分;A method for constructing a multi-cancer early screening model, the model is used to distinguish intestinal cancer, lung cancer or liver cancer from cancer samples;
包括如下步骤:Includes the following steps:
步骤1,对肠癌、肺癌以及肝癌的样本进行cfDNA的提取并测序,获得读段数据;Step 1: Extract and sequence cfDNA from intestinal cancer, lung cancer, and liver cancer samples to obtain read data;
步骤2,将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;Step 2: Compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of ultra-long reads within each window, as The first feature set;
步骤3,将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;Step 3: Align the read data results to the reference genome, using the long arm and short arm of each chromosome as the region range, and obtain the number of reads in gradient intervals of different lengths within each range as the second feature set;
步骤4,将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;Step 4: Use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as the third feature set;
步骤5,将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;Step 5: Compare the read data results to the reference genome to obtain the position of the 5' end of the read on the reference genome; obtain the sequence data of n bp bases upstream and downstream of the position as bases Fragment set; the proportion of the obtained various base fragments in all fragments is used as the fourth feature set;
步骤6,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;Step 6: Divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
步骤7,分别建立三组对照实验组,每一组中的阳性样本分别采用肠癌、肺癌或者肝癌样本,每一组中的对照样本为除去阳性样本以外的剩余两种癌症样本,分别在三组对照实验组中采用第一、第二、第三、第四和第五特征集合共同作为初始特征值,筛选出在阳性样本和对照样本中存在显著性差异的特征值,再将三组对照实验组中的存在显著差异的特征值进行合并,作为模型特征向量输入至分类模型中,并以是否患有肠癌、肺癌或者肝癌的概率作为输出值,对模型进行训练,获得早筛模型。Step 7: Establish three control experimental groups respectively. The positive samples in each group are intestinal cancer, lung cancer or liver cancer samples. The control samples in each group are the remaining two cancer samples except the positive samples. In the experimental group, the first, second, third, fourth and fifth feature sets are used together as the initial feature values to screen out the feature values with significant differences between the positive samples and the control samples, and then the three groups are compared The significantly different feature values in the experimental group are combined and input into the classification model as a model feature vector, and the probability of having intestinal cancer, lung cancer or liver cancer is used as the output value to train the model and obtain an early screening model.
所述的步骤7中,输入至分类模型是指分别将第一、第二、第三、第四和第五特征集合输入至梯度提升算法模型、随机森林模型、深度学习模型和极端梯度提升模型中,获得多个子模型,并将子模型联立为线性关系模型。In the described step 7, inputting to the classification model means inputting the first, second, third, fourth and fifth feature sets to the gradient boosting algorithm model, random forest model, deep learning model and extreme gradient boosting model respectively. , multiple sub-models are obtained, and the sub-models are combined into a linear relationship model.
所述的筛选过程是通过方差分析方法。The screening process described was through the analysis of variance method.
所述的短读段是指长度40-80bp,所述的超长读段数量是200-300bp;全部读段是指长度在40-300bp范围。 The short reads are 40-80 bp in length, the number of ultra-long reads is 200-300 bp, and all reads are in the range of 40-300 bp in length.
所述的步骤2中窗口的大小范围是2-7Mb。The window size range in step 2 is 2-7Mb.
所述的步骤3中不同长度梯度区间是指在40-300bp范围内以8-12bp步长递增而得到的不同长度梯度范围。The gradient intervals of different lengths in step 3 refer to the gradient ranges of different lengths obtained by increasing in steps of 8-12 bp in the range of 40-300 bp.
所述的读段数量经过了标准化处理。The number of reads stated is normalized.
所述的步骤4中,m是6-10之间的任意整数。In step 4, m is any integer between 6 and 10.
所述的步骤5中,n是2-5之间的任意整数。In step 5, n is any integer between 2 and 5.
所述的步骤6中的窗口是将参考基因1-22号染色体以0.8-1.2Mb的长度无重叠划分得到的。The window in step 6 is obtained by dividing the reference gene chromosomes 1-22 with a length of 0.8-1.2Mb without overlap.
本发明的第五个目的:The fifth object of the present invention:
多癌种检测装置,所述的装置用于对癌症样本进行肠癌、肺癌或者肝癌的区分,包括:A multi-cancer detection device, the device is used to distinguish intestinal cancer, lung cancer or liver cancer from cancer samples, including:
测序模块,用于对肠癌、肺癌以及肝癌的样本进行cfDNA的提取并测序,获得读段数据;The sequencing module is used to extract and sequence cfDNA from intestinal cancer, lung cancer, and liver cancer samples to obtain read data;
第一特征集合获取模块,用于将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;The first feature set acquisition module is used to compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of super-reads within each window. The number of long reads, as the first feature set;
第二特征集合获取模块,用于将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;The second feature set acquisition module is used to compare the read data results to the reference genome, using the long arm and short arm on each chromosome as the regional range, and obtain the reads in different length gradient intervals within each range. The number of segments, as the second feature set;
第三特征集合获取模块,用于将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;The third feature set acquisition module is used to use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as a third feature set;
第四特征集合获取模块,用于将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;The fourth feature set acquisition module is used to compare the read data results to the reference genome and obtain the position of the 5' end of the read on the reference genome; obtain the n bp bases upstream and downstream of the position. Sequence data is used as a set of base fragments; the proportion of various base fragments obtained in all fragments is used as the fourth feature set;
第五特征集合获取模块,用于将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;The fifth feature set acquisition module is used to divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
模型构建模块,用于分别建立三组对照实验组,每一组中的阳性样本分别采用肠癌、肺癌或者肝癌样本,每一组中的对照样本为除去阳性样本以外的剩余两种癌症样本,分别在三组对照实验组中采用第一、第二、第三、第四和第五特征集合共同作为初始特征值,筛选出在阳性样本和对照样本中存在显著性差异的特征值,再将三组对照实验组中的存在显著差异的特征值进行合并,作为模型特征向量输入至分类模型中,并以是否患有肠癌、肺癌或者肝癌的概率作为输出值,对模型进行训练,获得早筛模型。The model building module is used to establish three control experimental groups respectively. The positive samples in each group are intestinal cancer, lung cancer or liver cancer samples. The control samples in each group are the remaining two cancer samples except the positive samples. The first, second, third, fourth and fifth feature sets were used as the initial feature values in the three control experimental groups respectively, and the feature values with significant differences between the positive samples and the control samples were screened out, and then the The significantly different feature values in the three control experimental groups are merged and input into the classification model as a model feature vector, and the probability of suffering from intestinal cancer, lung cancer or liver cancer is used as the output value to train the model and obtain early diagnosis. Sieve model.
本发明的第六个目的:The sixth object of the present invention:
一种计算机可读取介质,其记载有可以运行多癌种早筛模型的构建方法的计算机程序。A computer-readable medium records a computer program that can run a method for constructing a multi-cancer early screening model.
有益效果beneficial effects
对191例肝癌患者、149例结直肠癌患者和146例肺癌患者的低深度WGS(~5X)cfDNA读段基因组范围内长度分布、染色体各长短臂范围内长度分布、片段末端序列占比、断点处序列占比和区域拷贝数变化进行统计,分别利用五种不同的训练学习算法构建模 型,并对所有模型进行二次集合训练,提高模型对癌症早期检测和癌种预测的预测性能。本发明首次基于血浆cfDNA高通量低深度测序提供了多分子特征多训练算法二次整合诊断模型,该模型不仅能够诊断早期多种癌症及其组织起源,且具有无创检测,通量低,检测特异性和敏感性高的特点。The length distribution of low-depth WGS (~5X) cfDNA reads in 191 liver cancer patients, 149 colorectal cancer patients, and 146 lung cancer patients within the genome, length distribution within each long and short arm of the chromosome, proportion of fragment end sequences, and fragmentation The sequence proportion and regional copy number changes at the points were statistically calculated, and five different training learning algorithms were used to build models. type, and perform secondary set training on all models to improve the model's prediction performance for early detection of cancer and cancer type prediction. For the first time, this invention provides a multi-molecule feature multi-training algorithm secondary integrated diagnostic model based on high-throughput and low-depth sequencing of plasma cfDNA. This model can not only diagnose multiple early cancers and their tissue origins, but also has non-invasive detection, low throughput, and detection Characterized by high specificity and sensitivity.
附图说明Description of the drawings
图1是模型构建过程示意图;Figure 1 is a schematic diagram of the model construction process;
图2是多癌种早期检测模型构建过程示意图;Figure 2 is a schematic diagram of the process of building a multi-cancer early detection model;
图3是多癌种组织起源模型构建过程示意图;Figure 3 is a schematic diagram of the construction process of the tissue origin model of multiple cancer types;
图4是5中特征中最大差异特征列在癌症组和非癌症组之间的分布;Figure 4 shows the distribution of the largest difference feature among the five features between the cancer group and the non-cancer group;
图5是多癌种早期检测模型在训练集合中的AUC表现;Figure 5 shows the AUC performance of the multi-cancer early detection model in the training set;
图6是多癌种早期检测模型在测试集合中的AUC表现;Figure 6 shows the AUC performance of the multi-cancer early detection model in the test set;
图7是肝癌5种特征中独有的最大差异特征列在肝癌和其他癌种之间的分布;Figure 7 shows the distribution of the unique largest difference among the five characteristics of liver cancer between liver cancer and other cancer types;
图8是肠癌5种特征中独有的最大差异特征列在肠癌和其他癌种之间的分布;Figure 8 shows the distribution of the unique largest differential features among the five features of colorectal cancer between colorectal cancer and other cancer types;
图9是肺癌5种特征中独有的最大差异特征列在肺癌和其他癌种之间的分布;Figure 9 shows the distribution of the unique largest difference feature among the five features of lung cancer between lung cancer and other cancer types;
具体实施方式Detailed ways
本发明涉及多癌种(肺癌、肠癌和肝癌)早期检测和癌种预测标志物、检测方法、检测装置以及计算机可读取介质。本发明提供了一种对血浆样本cfDNA进行WGS低深度测序,使用高通量测序结果分析各癌种cfDNA片段五种差异特征,包括基因组范围片段长度覆盖分布,染色体各长短臂上片段长度分布,片段断点处序列(8-mer Breakpoint Motif),片段5’端序列(8-mer End Motifs)和1MB窗口片段拷贝数变化,利用再用广义线性模型(Generalized Linear Mode,GLM),梯度提升机(Gradient Boosting Machine,GBM),随机森林(Random Forest,RF),深度学习(Deep Learning,DL)和极端梯度提升(XGBoost)五种算法分别进行训练建模,再用广义线性模型(GLM)进行二次集合训练构建多特征多算法整合模型,实现了对多癌种低深度高特异性高敏感性的无创精准早期检测和组织起源检测的目的。The present invention relates to early detection of multiple cancer types (lung cancer, intestinal cancer and liver cancer) and cancer type prediction markers, detection methods, detection devices and computer-readable media. The present invention provides a method for WGS low-depth sequencing of plasma sample cfDNA, and uses high-throughput sequencing results to analyze five differential characteristics of cfDNA fragments of various cancer types, including genome-wide fragment length coverage distribution, fragment length distribution on each long and short arm of the chromosome, The sequence at the fragment breakpoint (8-mer Breakpoint Motif), the 5' end sequence of the fragment (8-mer End Motifs) and the 1MB window fragment copy number changes are used to reuse the generalized linear model (Generalized Linear Mode, GLM), gradient boosting machine Five algorithms (Gradient Boosting Machine, GBM), Random Forest (RF), Deep Learning (DL) and Extreme Gradient Boosting (XGBoost) are trained and modeled respectively, and then the generalized linear model (GLM) is used for training Secondary set training builds a multi-feature and multi-algorithm integrated model to achieve the purpose of low-depth, high-specificity and high-sensitivity non-invasive early detection and tissue origin detection of multiple cancer types.
本发明中的计算方法详述如下:The calculation method in the present invention is detailed as follows:
本发明首先需要进行从血液样品中对cfDNA的提取、建库、测序等步骤。这里的提取、建库方法没有特别的限定,可以从现有技术中的提取方法中进行调整。这里的测序过程中可以采用现有技术中的测序技术获得cfDNA的碱基信息。The present invention first requires steps such as extraction of cfDNA from blood samples, library construction, and sequencing. The extraction and database construction methods here are not particularly limited and can be adjusted from extraction methods in the prior art. During the sequencing process here, existing sequencing technology can be used to obtain the base information of cfDNA.
本发明中的模型构建过程采用的数据集情况如下:
The data sets used in the model building process in this invention are as follows:
血浆cfDNA样本的提取和测序方法:Extraction and sequencing methods for plasma cfDNA samples:
采用紫色血液收集管(EDTA抗凝管)收集患者8ml全血样本,及时离心分离血浆(2小时内),转运至实验室后,血浆样本采用QIAGEN血浆DNA提取试剂盒按照说明书进行cfDNA提取。对采集到的cfDNA样本建库后,进行WGS~5乘测序。在获得了下机数据之后,将数据比对至人类参考基因组上,获得相应的读段的碱基数据信息。Use purple blood collection tubes (EDTA anticoagulant tubes) to collect 8 ml of whole blood samples from the patient, and centrifuge the plasma promptly (within 2 hours). After being transported to the laboratory, the plasma samples are extracted with cfDNA using the QIAGEN Plasma DNA Extraction Kit according to the instructions. After building a library of the collected cfDNA samples, WGS ~ 5-fold sequencing was performed. After obtaining the offline data, compare the data to the human reference genome to obtain the base data information of the corresponding read segments.
数据处理data processing
本发明中的标志数据,主要是利用五种分子特征:The marker data in the present invention mainly utilizes five molecular characteristics:
1.DNA片段大小占比(Fragmentation Size Coverage,FSC)1. DNA fragment size ratio (Fragmentation Size Coverage, FSC)
对于DNA片段大小占比,其反映的是cfDNA读段的长度大小的占比特征。利用DNA片段大小覆盖深度(fragmentation size ratio)进行机器学习建立预测模型,从而区分肺癌、肠癌和肝癌患者。通过比较486例肺癌、肠癌或肝癌患者的cfDNA读段的长度,发现40-80bp,81-300bp和40-300bp间的片段数量在染色体上的分布存在差异,可以作为区分特征。As for the DNA fragment size ratio, it reflects the length and size ratio characteristics of cfDNA reads. Use DNA fragment size coverage depth (fragmentation size ratio) for machine learning to establish a prediction model to distinguish patients with lung cancer, bowel cancer and liver cancer. By comparing the lengths of cfDNA reads in 486 patients with lung cancer, intestinal cancer or liver cancer, it was found that the number of fragments between 40-80bp, 81-300bp and 40-300bp was distributed differently on the chromosome, which can be used as a distinguishing feature.
cfDNA读段长度数据是通过如下方法获取得到的:在比对好的BAM当中,记录了每一条读段的质量,长度和比对位置信息,人类参考基因组选用来自加利福尼亚大学克鲁兹分校(University of California,Santa Cruz,UCSC)提供的hg19序列。将人类参考基因组按照5Mb长度,切割成572个窗口,分别统计每一个窗口中的全部读段数量(40-300bp),短读段数量(40-80bp)和长读段数量(81-300bp)。根据所有窗口中各种读段数量统计结果,分别对每种读段数量进行标准化换算,即标准化值=(原始值–平均值)/标准差。由此得到了572组不同长度的读段数量的数集。The cfDNA read length data is obtained through the following method: in the aligned BAM, the quality, length and alignment position information of each read are recorded. The human reference genome is selected from the University of California, Cruz. , Santa Cruz, UCSC) provided the hg19 sequence. The human reference genome was cut into 572 windows according to the length of 5Mb, and the number of all reads (40-300bp), the number of short reads (40-80bp) and the number of long reads (81-300bp) in each window were counted. . According to the statistical results of the number of various reads in all windows, the number of each read segment is standardized and converted separately, that is, the standardized value = (original value – average) / standard deviation. This resulted in 572 sets of numbers with different lengths of reads.
2.DNA片段大小分布(Fragmentation Size Distribution,FSD)2. DNA fragment size distribution (Fragmentation Size Distribution, FSD)
在获取了DNA片段大小占比的基础上,为获得高分辨率读段结果,以人类参考基因组各染色体长短臂41个区域作为窗口,如下所示:

On the basis of obtaining the size ratio of DNA fragments, in order to obtain high-resolution reading results, 41 regions on the long and short arms of each chromosome of the human reference genome were used as windows, as shown below:

将40-300bp的片段,以10bp递增,划分27个长度梯度(例,chr1的1q臂上40-49bp,50-59bp……),对每个长度梯度在各长短臂窗口内片段数量进行统计,并进行标准化换算,从而获得高分辨率DNA片段大小分布结果共计1107个特征结果(2823=41*27个长度梯度标准化结果)。Divide 40-300bp fragments into 27 length gradients in 10bp increments (for example, 40-49bp, 50-59bp... on the 1q arm of chr1), and count the number of fragments in each long and short arm window of each length gradient. , and perform standardized conversion to obtain high-resolution DNA fragment size distribution results with a total of 1107 feature results (2823=41*27 length gradient normalized results).
3.片段5’端末端序列占比(8-mer End Motif,EDM)3. Proportion of sequence at the 5’ end of the fragment (8-mer End Motif, EDM)
人类参考基因组是DNA双螺旋结构,依靠碱基互补配对氢离键链接;在正常衰老和癌症进展过程中,细胞周围环境的酸碱度发生变化,从而破坏了碱基互补氢键,发生断裂;由于断裂后的DNA片段末端碱基序列不同,包含不同末端序列的占比也会不同。收集方法:比对后获取每一条读段内5’端8bp序列,统计每种末端序列(共计4**8=65536种)的读段数量,从而计算出65536种末端序列读段占比,例AAAAAAAA序列占比=AAAAAAAA读段数量/所有末端序列读段总数。The human reference genome is a DNA double helix structure, linked by hydrogen bonding based on complementary base pairing; during normal aging and cancer progression, the pH of the environment around cells changes, thereby destroying base complementary hydrogen bonds and causing rupture; due to rupture The final DNA fragments have different terminal base sequences, and the proportions of different terminal sequences will also be different. Collection method: After comparison, obtain the 5' end 8bp sequence of each read segment, count the number of reads for each end sequence (a total of 4**8 = 65536 types), and thereby calculate the proportion of 65536 end sequence reads. For example, the proportion of AAAAAAAA sequences = the number of AAAAAAAA reads/the total number of all terminal sequence reads.
4.片段5’端断点序列读段数量占比(8-mer Breakpoint Motif,BKM)4. Proportion of the number of sequence reads at the 5’ end of the fragment (8-mer Breakpoint Motif, BKM)
类似于末端序列占比,由于断裂处的碱基序列不同,包含不同断点序列的序列占比也会不同。收集方法:比对后的BAM中,记录了每一条读段的基本信息和比对到的位置,确认每一条读段的5’端所在人类参考基因组序列坐标的左右各4bp序列,统计每种断点处序列(共计4**8=65536种)的读段数量,从而计算出65536种断点处序列读段占比,例AAAAAAAA序列占比=AAAAAAAA读段数量/所有断点序列读段总数。Similar to the proportion of terminal sequences, due to the different base sequences at the break, the proportion of sequences containing different breakpoint sequences will also be different. Collection method: In the aligned BAM, the basic information of each read segment and the aligned position are recorded. Confirm that the 5' end of each read segment is located at the left and right 4bp sequences of the human reference genome sequence coordinates, and count each type. The number of reads of sequences at breakpoints (a total of 4**8 = 65536 types) is used to calculate the proportion of reads of 65536 types of sequences at breakpoints, for example, the proportion of AAAAAAAA sequences = the number of reads of AAAAAAAA/all breakpoint sequence reads total.
5.1 Mb窗口拷贝数变化(1Mb-Bin Copy Number Variation,CNV)5.1 Mb window copy number variation (1Mb-Bin Copy Number Variation, CNV)
拷贝数变化与个体癌症有显高度相关性,尽管已经可以通过检测部分癌症相关基因或特定基因组区间的拷贝数数变化从而进行区分,但仍有其他稀有或未知基因或区间可以提供潜在拷贝数变化信息。收集方法:对每个待测样本WGS数据,将参考基因1-22号染色体以1Mb的长度无重叠划分窗口,利用bedtools coverage对每个样本计算各个窗口内的读段深度,并根据各自窗口的GC含量和平均比对能力记录(UCSC BigWig文件)进行矫正,获得2475个窗口个体读段深度信息,利用隐马尔可夫模型(Hidden Markov Model,HMM)和每个窗口群体对照基线深度,构建每个窗口的拷贝数变化对数,即log2(待测样本矫正均一化后深度/群体基线矫正均一化后深度),从而获得每个待测样本的拷贝数变化信息。Copy number changes are highly correlated with individual cancers. Although differentiation can be made by detecting copy number changes in some cancer-related genes or specific genomic regions, there are other rare or unknown genes or regions that can provide potential copy number changes. information. Collection method: For the WGS data of each sample to be tested, divide the reference gene chromosomes 1-22 into windows with a length of 1Mb without overlap, use bedtools coverage to calculate the read depth in each window for each sample, and calculate the read depth in each window according to the length of each window. GC content and average comparison ability records (UCSC BigWig files) were corrected to obtain 2475 window individual read depth information. Hidden Markov Model (HMM) was used to compare the baseline depth with each window group to construct each The logarithm of the copy number change in each window, that is, log2 (the depth after correction and normalization of the sample to be tested/the depth after normalization of the population baseline correction), is used to obtain the copy number change information of each sample to be tested.
通过上述的数据获取,分别能够获得这五类数据的初始数据向量。接下来,再设计相应的计算方法:本发明中的标志数据,主要利用五种单特征机器学习算法:Through the above data acquisition, the initial data vectors of these five types of data can be obtained respectively. Next, the corresponding calculation method is designed: the landmark data in this invention mainly uses five single-feature machine learning algorithms:
1.广义线性模型算法(Generalize Linear Model,GLM)1. Generalize Linear Model (GLM)
广义线性模型是线性模型的扩展,通过连接函数建立响应变量的数学期望值与线性组合的预测变量之间的关系。主要特点是不强行改变数据的自然度量,是常用的二分类分类策略。The generalized linear model is an extension of the linear model that establishes the relationship between the mathematical expectation value of the response variable and the linear combination of predictor variables through a link function. The main feature is that it does not forcibly change the natural measurement of the data, and it is a commonly used two-class classification strategy.
2.梯度提升算法(Gradient Boosting Machine,GBM)2. Gradient Boosting Machine (GBM)
梯度提升算法是机器学习中常见的一类算法,其基本原理是根据当前模型损失函数的负梯度信息来训练新加入的弱分类器,然后将训练好的若分类起以累加的形式结合到现有的模型中从而获得最优模型,该模型具有训练效果好,不易过拟合等优点。为防止GBM在学习过程中过度或欠拟合,设定GBM参数如下:ntrees=300,max_depth=9, learning_rate=0.01,subample=0.8,cross_validation=10。The gradient boosting algorithm is a common type of algorithm in machine learning. Its basic principle is to train a newly added weak classifier based on the negative gradient information of the current model loss function, and then combine the trained weak classifiers in an additive form to the current classifier. In some models, the optimal model is obtained. This model has the advantages of good training effect and is not easy to overfit. In order to prevent GBM from over- or under-fitting during the learning process, set the GBM parameters as follows: ntrees=300, max_depth=9, learning_rate=0.01, sample=0.8, cross_validation=10.
3.随机森林(Random Forest,RF)3. Random Forest (RF)
随机森林是一个强大的分类和回归工具。当提供一组数据集合,随机森林可以随机抽取部分信息产生一组帮助分类或回归的决策树林,做节点分裂属性,不断重复随机抽取,直至不能再分裂;最后结合所有分裂属性结果,获得最终预测结果。为防止RF在学习过程中过度或欠拟合,设定RF参数如下:ntrees=300,max_depth=9,cross_validation=10。Random forest is a powerful classification and regression tool. When a set of data is provided, the random forest can randomly extract part of the information to generate a set of decision trees to help classification or regression, make node split attributes, and repeat the random extraction until it can no longer be split; finally, combine all split attribute results to obtain the final prediction result. In order to prevent RF from over- or under-fitting during the learning process, set the RF parameters as follows: ntrees=300, max_depth=9, cross_validation=10.
4.深度学习算法(Deep Learning,DL)4. Deep Learning Algorithm (Deep Learning, DL)
深度学习基于多层前馈人工神经网络,该神经网络使用反向传播进行了随机梯度下降的训练。该网络可以包含大量隐藏层,这些隐藏层由具有双曲正切,矫正和最大功率激活功能的神经元组成。诸如自适应学习率,速率退火,动量训练,辍学,L1或L2正则化,检查点和网格搜索等高级功能可实现较高的预测准确性。在学习训练的时候,每个计算节点都使用多线程(异步)在其本地数据上训练全局模型参数的副本,并通过网络上的模型平均来定期为全局模型做出贡献。前馈人工神经网络(ANN)模型,也称为深层神经网络(DNN)或多层感知器(MLP),是深层神经网络的最常见类型。主要原理是将多输入和多输出的多个感知机通过设计建立适量的神经元计算节点和多层运算层次结构,选择合适的输人层和输出层,通过网络的学习和调优,建立起从输入到输出的函数关系,可以尽可能的逼近现实的关联关系。为防止DL学习过程中过度或欠拟合,设定DL参数如下:epoch是=300,hidden={100,100,100},input_dropout_ratios=0.05,rho=0.95,mini_batch_size=10,cross_validation=10。Deep learning is based on multi-layer feedforward artificial neural networks trained with stochastic gradient descent using backpropagation. The network can contain a large number of hidden layers consisting of neurons with hyperbolic tangent, rectification and maximum power activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing and grid search enable high prediction accuracy. During learning training, each compute node uses multi-threading (asynchronously) to train a copy of the global model parameters on its local data and regularly contributes to the global model through model averaging across the network. Feedforward artificial neural network (ANN) models, also known as deep neural networks (DNN) or multilayer perceptrons (MLP), are the most common type of deep neural networks. The main principle is to design multiple perceptrons with multiple inputs and multiple outputs and establish an appropriate number of neuron computing nodes and multi-layer computing hierarchies, select appropriate input layers and output layers, and establish a network through network learning and tuning. The functional relationship from input to output can approximate the realistic correlation relationship as much as possible. In order to prevent over- or under-fitting during the DL learning process, the DL parameters are set as follows: epoch=300, hidden={100,100,100}, input_dropout_ratios=0.05, rho=0.95, mini_batch_size=10, cross_validation=10.
5.极端梯度提升算法(Extreme Gradient Boosting,XGBoost)5. Extreme Gradient Boosting algorithm (Extreme Gradient Boosting, XGBoost)
极端梯度提升是梯度提升算法的高效开源实现。相比传统GMB,XGBoost引入了并行化,所以其速度更快;XGBoost对目标函数引入了二阶近似,求得解析解,用解析解来建立决策树,使得目标函数最优;XGBoost引入了正则项部分,可以控制模型的复杂度,防止过拟合;Xgboost引入了特征子采样,类似于随机森林,既可以降低过拟合,也可以减少计算。为防止XGBoost在学习过程中过度或欠拟合,设定XGBoost参数如下:ntrees=300,max_depth=9,cross_validation=10。Extreme Gradient Boosting is an efficient open source implementation of the gradient boosting algorithm. Compared with traditional GMB, XGBoost introduces parallelization, so it is faster; XGBoost introduces a second-order approximation to the objective function, obtains an analytical solution, and uses the analytical solution to build a decision tree to optimize the objective function; The term part can control the complexity of the model and prevent overfitting; Xgboost introduces feature subsampling, which is similar to random forest, which can reduce overfitting and calculation. In order to prevent XGBoost from over- or under-fitting during the learning process, set the XGBoost parameters as follows: ntrees=300, max_depth=9, cross_validation=10.
为提高机器学习效率和降低无用特征干扰,通过对不同组间的特征值进行方差分析(Analysisofvariance,ANOVA),将组间差异大的特征值筛选出来,该步骤通过R包stats的aov()功能的F值(Fvalue)结果和p.adjust()功能矫正后的pvalue实现。In order to improve the efficiency of machine learning and reduce the interference of useless features, analysis of variance (ANOVA) is performed on the feature values between different groups to filter out the feature values with large differences between the groups. This step is done through the aov() function of the R package stats The F value (Fvalue) result and the corrected pvalue implementation of the p.adjust() function.
多癌种早期检测模型的建立过程The establishment process of multi-cancer early detection model
为建立多癌种早期检测模型,将训练集合分成癌症组和非癌组,分别对五种特征进行方差分析,并根据Fvalue结果进行降序排列,保留前200的特征作为预测输入值。In order to establish a multi-cancer early detection model, the training set was divided into cancer groups and non-cancer groups, variance analysis was performed on the five features respectively, and sorted in descending order according to the Fvalue results, and the top 200 features were retained as prediction input values.
癌种组织起源模型的建立过程The establishment process of cancer tissue origin model
为建立多癌种组织起源模型,对训练集合中的191例肝癌、149例肠癌和146例肺癌患者分为三组,以“单癌种vs.其他癌种”模式对每个癌种组的5种特征分别进行特征方差分析,并对单癌种每个特征根据Fvalue结果进行降序排列,保留前100的特征作为预测输入值。In order to establish a multi-cancer tissue origin model, 191 cases of liver cancer, 149 cases of intestinal cancer and 146 cases of lung cancer in the training set were divided into three groups, and each cancer group was analyzed in the "single cancer type vs. other cancer types" mode. Conduct feature variance analysis on each of the five features of a single cancer type in descending order according to the Fvalue results, and retain the top 100 features as prediction input values.
特征值筛选过程的数据结果 Data results from the eigenvalue filtering process
癌症组与非癌症组FSC的前200个差异显著的特征值如下所示:其中的long/short/total分别代表长读段、短读段和全部读段,数字部分代表窗口位置编号;


The top 200 significantly different feature values of FSC between the cancer group and the non-cancer group are as follows: long/short/total represent long reads, short reads and all reads respectively, and the numbers represent the window position number;


癌症组与非癌症组FSD的前200个差异显著的特征值如下所示:其中的chrx,xp/q代表x号染色体上的长臂或短臂,数字部分是指梯度位置的编号;






The first 200 significantly different eigenvalues of FSD between the cancer group and the non-cancer group are as follows: chrx, xp/q among them represent the long arm or short arm on chromosome x, and the number part refers to the number of the gradient position;






癌症组与非癌症组EDM的前200个差异显著的特征值如下所示:其中由8位ATCG构成的编号代表不同的特征值的碱基序列;







The top 200 significantly different eigenvalues of EDM between the cancer group and the non-cancer group are as follows: the numbers consisting of 8-digit ATCG represent the base sequences of different eigenvalues;







癌症组与非癌症组BKM的前200个差异显著的特征值如下所示:其中由8位ATCG构成的编号代表不同的特征值的碱基序列;






The top 200 significantly different eigenvalues of BKM between the cancer group and the non-cancer group are as follows: the numbers consisting of 8-bit ATCG represent the base sequences of different eigenvalues;






癌症组与非癌症组CNV的前200个差异显著的特征值如下所示:其中chrx是代表x号染色体,数字部分是指在染色体上的位置范围;







The top 200 significantly different eigenvalues of CNV between the cancer group and the non-cancer group are as follows: chrx represents chromosome x, and the number part refers to the range of positions on the chromosome;







肝癌FSC的前100个与其他癌种存在显著差异的特征值如下所示:

The top 100 characteristic values of liver cancer FSC that are significantly different from other cancer types are as follows:

肝癌FSD的前100个与其他癌种存在显著差异的特征值如下所示:


The top 100 characteristic values of liver cancer FSD that are significantly different from other cancer types are as follows:


肝癌EDM的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of liver cancer EDM that are significantly different from other cancer types are as follows:



肝癌BPM的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of liver cancer BPM that are significantly different from other cancer types are as follows:



肝癌CNV的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of liver cancer CNVs that are significantly different from other cancer types are as follows:



肠癌FSC的前100个与其他癌种存在显著差异的特征值如下所示:

The top 100 characteristic values of intestinal cancer FSC that are significantly different from other cancer types are as follows:

肠癌FSD的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of bowel cancer FSD that are significantly different from other cancer types are as follows:



肠癌EDM的前100个与其他癌种存在显著差异的特征值如下所示:


The top 100 characteristic values of bowel cancer EDM that are significantly different from other cancer types are as follows:


肠癌BPM的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of bowel cancer BPM that are significantly different from other cancer types are as follows:



肠癌CNV的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of intestinal cancer CNVs that are significantly different from other cancer types are as follows:



肺癌FSC的前100个与其他癌种存在显著差异的特征值如下所示:

The top 100 characteristic values of lung cancer FSC that are significantly different from other cancer types are as follows:

肺癌FSD的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of lung cancer FSD that are significantly different from other cancer types are as follows:



肺癌EDM的前100个与其他癌种存在显著差异的特征值如下所示:



The top 100 characteristic values of lung cancer EDM that are significantly different from other cancer types are as follows:



肺癌BPM的前100个与其他癌种存在显著差异的特征值如下所示:


The top 100 characteristic values of lung cancer BPM that are significantly different from other cancer types are as follows:


肺癌CNV的前100个与其他癌种存在显著差异的特征值如下所示:


The top 100 characteristic values of lung cancer CNVs that are significantly different from other cancer types are as follows:


筛选异显著特征后,为多癌种早期检测模型获得五种特征各200个,将所有训练集合中的样本的每种特征作为输入值,以预测“癌症/健康”为反馈结果,分别使用广义线性模型、梯度提升算法模型、随机森林模型、深度学习模型和极端梯度提升模型进行训练建模,获得25种二分类基础模型;After screening the different significant features, 200 features of each of the five types were obtained for the multi-cancer early detection model. Each feature of the samples in all training sets was used as the input value, and the prediction of "cancer/health" was used as the feedback result. The generalized model was used respectively. Linear models, gradient boosting algorithm models, random forest models, deep learning models and extreme gradient boosting models were trained and modeled, and 25 two-class basic models were obtained;
为进一步提高分类器预测性能,对以上多种训练基础模型结果进行二次集合训练(stacking)。Stacking是一种集成学习技术,通过对多个底层弱分类器(1st-level base model)的再次进行元学习(2nd-level meta-learning),收集每个底层分类器的特点,找到最优整合方式,从而提高模型预测性能。本专利Stacking使用的训练算法为广义线性模型(Generalized Linear Model,GLM),通过联结函数建立响应变量的数学期望值与线性组合的预测变量之间的关系,将多种训练基础模型转化为最终线性方程:In order to further improve the prediction performance of the classifier, secondary set training (stacking) is performed on the above multiple training basic model results. Stacking is an integrated learning technology that performs meta-learning ( 2nd -level meta-learning) on multiple underlying weak classifiers ( 1st -level base model) to collect the characteristics of each underlying classifier and find the best Optimize the integration method to improve the model prediction performance. The training algorithm used in Stacking in this patent is the Generalized Linear Model (GLM). It establishes the relationship between the mathematical expectation value of the response variable and the linear combination of predictor variables through the link function, and converts various training basic models into the final linear equation. :
ALLStacked=Intercept+A1*FSC_GLM+A2*FSC_GBM+A3*FSC_RF+A4*FSC_DL+A5*FSC_XGBoost+B1*F SD_GLM+B2*FSD_GBM+B3*FSD_RF+B4*FSD_DL+B5*FSD_XGBoost+C1*EDM_GLM+C2*EDM_GBM+C3*EDM_RF+C4*EDM_DL+C5*EDM_XGBoost+D1*BPM_GLM+D2*BPM_GBM+D3*BPM_RF+D4*BPM_DL+D5*BPM_XGBoost+E1*CNV_GLM+E2*CNV_GBM+E3*CNV_RF+E4*CNV_DL+E5*CNV_XGBoostALLStacked=Intercept+A1*FSC_GLM+A2*FSC_GBM+A3*FSC_RF+A4*FSC_DL+A5*FSC_XGBoost+B1*F SD_GLM+B2*FSD_GBM+B3*FSD_RF+B4*FSD_DL+B5*FSD_XGBoost+C1*EDM_GLM+C2 *EDM_GBM+C3*EDM_RF+C4*EDM_DL+C5*EDM_XGBoost+D1*BPM_GLM+D2*BPM_GBM+D3*BPM_RF+D4*BPM_DL+D5*BPM_XGBoost+E1*CNV_GLM+E2*CNV_GBM+E3*CNV_RF+E4*CNV_DL +E5*CNV_XGBoost
其中,Intercept、A1-E5都是线性方程参数。FSC_GLM等都是指模型在获得输入数据后得到的输出值,符号“_”前的字符代表特征集合的类型,符号“_”后的字符代表算法类型,多癌种早筛模型的输出值为癌症概率。Among them, Intercept and A1-E5 are all linear equation parameters. FSC_GLM, etc. all refer to the output value obtained by the model after obtaining the input data. The characters before the symbol "_" represent the type of feature set, and the characters after the symbol "_" represent the algorithm type. The output value of the multi-cancer early screening model is Cancer probability.
多癌种组织起源的模型,主要是针对已经确认患有上述三种癌症之一的样本进一步地确认具体的癌症种类。因此,在进行样本分类时,分别建立起三组训练样本:The model of tissue origin of multiple cancer types is mainly used to further confirm the specific cancer type for samples that have been confirmed to have one of the above three cancers. Therefore, when classifying samples, three groups of training samples are established:
第一组训练样本:阳性为肠癌,对照为肺癌和肝癌;判断分为两类:肠癌、其他两种癌症。 The first set of training samples: positive is intestinal cancer, and the control is lung cancer and liver cancer; the judgment is divided into two categories: intestinal cancer and other two cancers.
第二组训练样本:阳性为肺癌,对照为肠癌和肝癌;判断分为两类:肺癌、其他两种癌症。The second set of training samples: positive is lung cancer, and the control is intestinal cancer and liver cancer; the judgment is divided into two categories: lung cancer and other two cancers.
第三组训练样本:阳性为肝癌,对照为肝癌和肠癌;判断分为两类:肝癌、其他两种癌症。The third group of training samples: positive is liver cancer, and the control is liver cancer and intestinal cancer; the judgment is divided into two categories: liver cancer and other two types of cancer.
在每一组的样本中,分别进行方差分析,可以在每一组当中找到各个特征集合当中具有显著性差异的特征值;而将三组都分析完成后,每一组之间都可以获得相应的显著性差异的特征值,这些之间会存在着重叠,因此,再将每一组筛选出的特征值进行合并后去重复,得到最终模型中所需要的特征值。In each group of samples, variance analysis is performed separately, and the characteristic values with significant differences in each feature set can be found in each group; and after the analysis of all three groups is completed, corresponding values can be obtained between each group. There will be overlap among the eigenvalues with significant differences. Therefore, each group of filtered eigenvalues will be merged and repeated to obtain the eigenvalues required in the final model.
最终,为多癌种组织起源模型获得FSC特征180个,FSD特征205个,EDM特征295个,BKM特征297个,CNV特征204个。将训练连集合中的癌症样本的每组特征作为输入值,以预测“肠癌/肝癌/肺癌”为反馈结果,分别使用适合多分类算法的梯度提升算法模型、随机森林模型、深度学习模型和极端梯度提升模型进行训练建模,获得20种多分类基础模型。Finally, 180 FSC features, 205 FSD features, 295 EDM features, 297 BKM features, and 204 CNV features were obtained for the multi-cancer tissue origin model. Each set of features of the cancer samples in the training connection set is used as the input value, and the prediction of "intestinal cancer/liver cancer/lung cancer" is used as the feedback result. The gradient boosting algorithm model, random forest model, deep learning model and The extreme gradient boosting model is trained and modeled, and 20 multi-classification basic models are obtained.
为了提高预测性能,也采用了二次集合训练,方法基本同上述过程,区别是采用的线性方程式为:In order to improve the prediction performance, secondary set training is also used. The method is basically the same as the above process. The difference is that the linear equation used is:
ALLStacked=Intercept+A2*FSC_GBM+A3*FSC_RF+A4*FSC_DL+A5*FSC_XGBoost++B2*FSD_GBM+B3*FSD_RF+B4*FSD_DL+B5*FSD_XGBoost+C2*EDM_GBM+C3*EDM_RF+C4*EDM_DL+C5*EDM_XGBoost++D2*BPM_GBM+D3*BPM_RF+D4*BPM_DL+D5*BPM_XGBoost++E2*CNV_GBM+E3*CNV_RF+E4*CNV_DL+E5*CNV_XGBoostALLStacked=Intercept+A2*FSC_GBM+A3*FSC_RF+A4*FSC_DL+A5*FSC_XGBoost++B2*FSD_GBM+B3*FSD_RF+B4*FSD_DL+B5*FSD_XGBoost+C2*EDM_GBM+C3*EDM_RF+C4*EDM_DL+C5 *EDM_XGBoost++D2*BPM_GBM+D3*BPM_RF+D4*BPM_DL+D5*BPM_XGBoost++E2*CNV_GBM+E3*CNV_RF+E4*CNV_DL+E5*CNV_XGBoost
其中,Intercept、A2-E5都是线性方程参数。FSC_GBM等都是指模型在获得输入数据后得到的输出值,符号“_”前的字符代表特征集合的类型,符号“_”后的字符代表算法类型,多癌种早筛模型的输出值为癌症概率,多癌种组织起源模型为癌种概率(多癌种组织起源整合模型会对待预测样本分别进行肝癌可能性预测,肠癌可能性预测和肺癌可能性预测,并以三种预测结果中的最大值作为最终判定结果)。
Among them, Intercept and A2-E5 are all linear equation parameters. FSC_GBM, etc. all refer to the output value obtained by the model after obtaining the input data. The characters before the symbol "_" represent the type of feature set, and the characters after the symbol "_" represent the algorithm type. The output value of the multi-cancer early screening model is Cancer probability, the multi-cancer tissue origin model is the cancer type probability (the multi-cancer tissue origin integrated model will predict the possibility of liver cancer, bowel cancer and lung cancer for the samples to be predicted, and use the three prediction results as the The maximum value is taken as the final judgment result).
多癌种早期检测整合模型可有效区分癌症与健康人,训练集合中敏感性与特意性均达到94%,同时测试集验证整合模型,灵敏性与特异性可达到95%,未出现集合间结果差异,具体结果如下表所示:

The integrated model for early detection of multiple cancer types can effectively distinguish cancer from healthy people. The sensitivity and specificity in the training set both reached 94%. At the same time, the test set verified the integrated model, and the sensitivity and specificity reached 95%. There were no inter-set results. Differences, the specific results are shown in the table below:

多癌种组织起源集合模型可有效区分肺癌、肝癌与肠癌组织起源,在训练集合中整体准确率达分别95.1%,对测试集中成功预测成癌的样本的整体准去了大道93.1%,具体结果如下表所示:

The multi-cancer tissue origin set model can effectively distinguish the tissue origins of lung cancer, liver cancer and intestinal cancer. The overall accuracy rate in the training set reached 95.1%, and the overall accuracy of the samples that successfully predicted cancer in the test set was 93.1%. Specifically The results are shown in the table below:

对照实验1:Control experiment 1:
模型所采用的特征值中,不纳入片段5’端末端序列占比(EDM),只采用其它四种,模型建立过程同上,进行癌种起源模型的建立,最终得到的测试集样本的计算结果如下:

Among the eigenvalues used in the model, the 5' end sequence proportion of the fragment (EDM) is not included, and only the other four are used. The model establishment process is the same as above. The cancer species origin model is established, and the final calculation results of the test set samples are obtained. as follows:

GLM是一个二分类算法,在多分类的时候优势不够明显,在癌种分类的过程中不能表现出较好的分类性能,因此在本部分的分类模型中没有用glm的基础模型,仅在癌症/健康样本分类的过程中使用。GLM is a two-classification algorithm. Its advantages are not obvious when performing multiple classifications. It cannot show good classification performance in the process of cancer classification. Therefore, the basic model of glm is not used in the classification model of this part. It is only used in cancer classification. /Used in the process of classifying healthy samples.
通过以上实施例对本专利的技术方案进行解释和说明,但是并不构成对本专利的保护范围的限制。 The technical solution of this patent is explained and described through the above embodiments, but does not constitute a limitation on the protection scope of this patent.

Claims (10)

  1. 多癌种早筛模型的构建方法,所述的模型用于对样本是否患有肠癌、肺癌或者肝癌进行分类,其特征在于,包括如下步骤:A method for constructing a multi-cancer early screening model. The model is used to classify whether a sample has intestinal cancer, lung cancer or liver cancer. It is characterized by including the following steps:
    步骤1,对阳性组和对照组的样本进行cfDNA的提取并测序,获得读段数据;Step 1: Extract and sequence cfDNA from the samples of the positive group and the control group to obtain read data;
    步骤2,将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;Step 2: Compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of ultra-long reads within each window, as The first feature set;
    步骤3,将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;Step 3: Align the read data results to the reference genome, using the long arm and short arm of each chromosome as the region range, and obtain the number of reads in gradient intervals of different lengths within each range as the second feature set;
    步骤4,将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;Step 4: Use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as the third feature set;
    步骤5,将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;Step 5: Compare the read data results to the reference genome to obtain the position of the 5' end of the read on the reference genome; obtain the sequence data of n bp bases upstream and downstream of the position as bases Fragment set; the proportion of the obtained various base fragments in all fragments is used as the fourth feature set;
    步骤6,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;Step 6: Divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
    步骤7,以第一、第二、第三、第四和第五特征集合共同作为初始特征值,作为模型特征向量输入至分类模型中,并以是否患癌作为输出值,对模型进行训练,获得早筛模型。Step 7: Use the first, second, third, fourth and fifth feature sets together as the initial feature value and input it into the classification model as the model feature vector, and use whether you have cancer as the output value to train the model. Get early screening models.
  2. 根据权利要求1所述的多癌种早筛模型的构建方法,其特征在于,所述的步骤6中,患癌是指患有肠癌、肺癌或者肝癌中的任意一种;所述的步骤6中,还需要对初始特征值进行简化后再作为模型特征向量,所述的简化是指分别筛选出第一、第二、第三、第四和第五特征集合在阳性组和对照组的样本之间存在显著性差异的特征值;所述的筛选过程是通过方差分析方法The method for constructing a multi-cancer early screening model according to claim 1, characterized in that in the step 6, suffering from cancer refers to suffering from any one of intestinal cancer, lung cancer or liver cancer; the step In 6, it is also necessary to simplify the initial feature values and then use them as model feature vectors. The simplification means filtering out the first, second, third, fourth and fifth feature sets in the positive group and the control group respectively. There are eigenvalues with significant differences between samples; the screening process is through the analysis of variance method
  3. 根据权利要求1所述的多癌种早筛模型的构建方法,其特征在于,所述的短读段是指长度40-80bp,所述的超长读段数量是200-300bp;全部读段是指长度在40-300bp范围;所述的步骤2中窗口的大小范围是2-7Mb。The method for constructing a multi-cancer early screening model according to claim 1, wherein the short reads are 40-80 bp in length, and the number of ultra-long reads is 200-300 bp; all reads It means that the length is in the range of 40-300bp; the size range of the window in step 2 is 2-7Mb.
  4. 根据权利要求1所述的多癌种早筛模型的构建方法,其特征在于,所述的步骤3中不同长度梯度区间是指在40-300bp范围内以8-12bp步长递增而得到的不同长度梯度范围;所述的读段数量经过了标准化处理。The method for constructing a multi-cancer early screening model according to claim 1, wherein the different length gradient intervals in step 3 refer to different gradient intervals obtained by increasing in steps of 8-12 bp in the range of 40-300 bp. Length gradient range; number of reads stated is normalized.
  5. 根据权利要求1所述的多癌种早筛模型的构建方法,其特征在于,所述的步骤4中,m是6-10之间的任意整数;所述的步骤5中,n是2-5之间的任意整数;The method for constructing a multi-cancer early screening model according to claim 1, characterized in that, in the step 4, m is any integer between 6 and 10; and in the step 5, n is 2-10. Any integer between 5;
  6. 根据权利要求1所述的多癌种早筛模型的构建方法,其特征在于,所述的步骤6中的窗口是将参考基因1-22号染色体以0.8-1.2Mb的长度无重叠划分得到的;所述的步骤7中输入至分类模型是指分别将第一、第二、第三、第四和第五特征集合输入至广义线性模型、梯度提升算法模型、随机森林模型、深度学习模型和极端梯度提升模型中,获得多个子模型,并将子模型联立为线性关系模型。The method for constructing a multi-cancer early screening model according to claim 1, characterized in that the window in step 6 is obtained by dividing the reference gene chromosomes 1-22 with a length of 0.8-1.2Mb without overlap. ; The input to the classification model in step 7 refers to inputting the first, second, third, fourth and fifth feature sets to the generalized linear model, gradient boosting algorithm model, random forest model, deep learning model and In the extreme gradient boosting model, multiple sub-models are obtained and connected into a linear relationship model.
  7. 多癌种检测装置,其特征在于,所述的装置用于样本是否患有肠癌、肺癌或者肝癌进行分类,包括:A multi-cancer detection device is characterized in that the device is used to classify whether a sample has intestinal cancer, lung cancer or liver cancer, including:
    测序模块,用于对阳性组和对照组的样本进行cfDNA的提取并测序,获得读段数据;The sequencing module is used to extract and sequence cfDNA from the samples of the positive group and the control group to obtain read data;
    第一特征集合获取模块,用于将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量, 作为第一特征集合;The first feature set acquisition module is used to compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of super-reads within each window. Number of long reads, As the first feature set;
    第二特征集合获取模块,用于将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;The second feature set acquisition module is used to compare the read data results to the reference genome, using the long arm and short arm on each chromosome as the regional range, and obtain the reads in different length gradient intervals within each range. The number of segments, as the second feature set;
    第三特征集合获取模块,用于将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;The third feature set acquisition module is used to use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as a third feature set;
    第四特征集合获取模块,用于将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;The fourth feature set acquisition module is used to compare the read data results to the reference genome and obtain the position of the 5' end of the read on the reference genome; obtain the n bp bases upstream and downstream of the position. Sequence data is used as a set of base fragments; the proportion of various base fragments obtained in all fragments is used as the fourth feature set;
    第五特征集合获取模块,用于将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;The fifth feature set acquisition module is used to divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
    模型构建模块,用于以第一、第二、第三、第四和第五特征集合共同作为初始特征值,作为模型特征向量输入至分类模型中,并以是否患癌作为输出值,对模型进行训练,获得早筛模型。The model building module is used to use the first, second, third, fourth and fifth feature sets as initial feature values and input them into the classification model as model feature vectors, and use whether there is cancer as the output value to evaluate the model. Perform training to obtain an early screening model.
  8. 一种多癌种早筛模型的构建方法,其特征在于,所述的模型用于对癌症样本进行肠癌、肺癌或者肝癌的区分;A method for constructing a multi-cancer early screening model, characterized in that the model is used to distinguish intestinal cancer, lung cancer or liver cancer from cancer samples;
    步骤1,对肠癌、肺癌以及肝癌的样本进行cfDNA的提取并测序,获得读段数据;Step 1: Extract and sequence cfDNA from intestinal cancer, lung cancer, and liver cancer samples to obtain read data;
    步骤2,将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;Step 2: Compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of ultra-long reads within each window, as The first feature set;
    步骤3,将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;Step 3: Align the read data results to the reference genome, using the long arm and short arm of each chromosome as the region range, and obtain the number of reads in gradient intervals of different lengths within each range as the second feature set;
    步骤4,将读段数据中的5'端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;Step 4: Use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as the third feature set;
    步骤5,将读段数据结果比对至参考基因组,得到读段的5'端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;Step 5: Compare the read data results to the reference genome to obtain the position of the 5' end of the read on the reference genome; obtain the sequence data of n bp bases upstream and downstream of the position as bases Fragment set; the proportion of the obtained various base fragments in all fragments is used as the fourth feature set;
    步骤6,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;Step 6: Divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
    步骤7,分别建立三组对照实验组,每一组中的阳性样本分别采用肠癌、肺癌或者肝癌样本,每一组中的对照样本为除去阳性样本以外的剩余两种癌症样本,分别在三组对照实验组中采用第一、第二、第三、第四和第五特征集合共同作为初始特征值,筛选出在阳性样本和对照样本中存在显著性差异的特征值,再将三组对照实验组中的存在显著差异的特征值进行合并,作为模型特征向量输入至分类模型中,并以是否患有肠癌、肺癌或者肝癌的概率作为输出值,对模型进行训练,获得早筛模型。Step 7: Establish three control experimental groups respectively. The positive samples in each group are intestinal cancer, lung cancer or liver cancer samples. The control samples in each group are the remaining two cancer samples except the positive samples. In the experimental group, the first, second, third, fourth and fifth feature sets are used together as the initial feature values to screen out the feature values with significant differences between the positive samples and the control samples, and then the three groups are compared The significantly different feature values in the experimental group are combined and input into the classification model as a model feature vector, and the probability of having intestinal cancer, lung cancer or liver cancer is used as the output value to train the model and obtain an early screening model.
  9. 根据权利要求8所述的多癌种早筛模型的构建方法,其特征在于,所述的步骤7中,输入至分类模型是指分别将第一、第二、第三、第四和第五特征集合输入至梯度提升算法模型、随机森林模型、深度学习模型和极端梯度提升模型中,获得多个子模型,并将子模型联立为线性关系模型。The method for constructing a multi-cancer early screening model according to claim 8, wherein in step 7, inputting into the classification model means to respectively add the first, second, third, fourth and fifth The feature set is input into the gradient boosting algorithm model, random forest model, deep learning model and extreme gradient boosting model, multiple sub-models are obtained, and the sub-models are connected into a linear relationship model.
  10. 多癌种检测装置,其特征在于,所述的装置用于对癌症样本进行肠癌、肺癌或者肝 癌的区分,包括:A multi-cancer detection device is characterized in that the device is used to detect intestinal cancer, lung cancer or liver cancer samples on cancer samples. The distinction between cancers includes:
    测序模块,用于对肠癌、肺癌以及肝癌的样本进行cfDNA的提取并测序,获得读段数据;The sequencing module is used to extract and sequence cfDNA from intestinal cancer, lung cancer, and liver cancer samples to obtain read data;
    第一特征集合获取模块,用于将读段数据结果比对至参考基因组,将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的全部读段数量、短读段数量和超长读段数量,作为第一特征集合;The first feature set acquisition module is used to compare the read data results to the reference genome, divide the reference genome into multiple windows, and obtain the number of all reads, the number of short reads, and the number of super-reads within each window. The number of long reads, as the first feature set;
    第二特征集合获取模块,用于将读段数据结果比对至参考基因组,分别以各个染色体上的长臂和短臂作为区域范围,并获得在每个范围内的不同长度梯度区间中的读段数量,作为第二特征集合;The second feature set acquisition module is used to compare the read data results to the reference genome, using the long arm and short arm on each chromosome as the regional range, and obtain the reads in different length gradient intervals within each range. The number of segments, as the second feature set;
    第三特征集合获取模块,用于将读段数据中的5’端的m个碱基数据作为碱基片段集合,并得到各种碱基片段在全部片段中所占比例作为第三特征集合;The third feature set acquisition module is used to use the m base data at the 5' end in the read data as a base fragment set, and obtain the proportion of various base fragments in all fragments as a third feature set;
    第四特征集合获取模块,用于将读段数据结果比对至参考基因组,得到读段的5’端在参考基因组上的位置;获得所述的位置处的上下游各n个bp碱基的序列数据,作为碱基片段集合;以得到的各种碱基片段在全部片段中的所占比例作为第四特征集合;The fourth feature set acquisition module is used to compare the read data results to the reference genome and obtain the position of the 5' end of the read on the reference genome; obtain the n bp bases upstream and downstream of the position. Sequence data is used as a set of base fragments; the proportion of various base fragments obtained in all fragments is used as the fourth feature set;
    第五特征集合获取模块,用于将参考基因组划分为多个窗口,并分别获得在每个窗口范围内的拷贝数数据,作为第五特征集合;The fifth feature set acquisition module is used to divide the reference genome into multiple windows and obtain copy number data within each window as the fifth feature set;
    模型构建模块,用于分别建立三组对照实验组,每一组中的阳性样本分别采用肠癌、肺癌或者肝癌样本,每一组中的对照样本为除去阳性样本以外的剩余两种癌症样本,分别在三组对照实验组中采用第一、第二、第三、第四和第五特征集合共同作为初始特征值,筛选出在阳性样本和对照样本中存在显著性差异的特征值,再将三组对照实验组中的存在显著差异的特征值进行合并,作为模型特征向量输入至分类模型中,并以是否患有肠癌、肺癌或者肝癌的概率作为输出值,对模型进行训练,获得早筛模型。 The model building module is used to establish three control experimental groups respectively. The positive samples in each group are intestinal cancer, lung cancer or liver cancer samples. The control samples in each group are the remaining two cancer samples except the positive samples. The first, second, third, fourth and fifth feature sets were used as the initial feature values in the three control experimental groups respectively, and the feature values with significant differences between the positive samples and the control samples were screened out, and then the The significantly different feature values in the three control experimental groups are merged and input into the classification model as a model feature vector, and the probability of suffering from intestinal cancer, lung cancer or liver cancer is used as the output value to train the model and obtain early diagnosis. Sieve model.
PCT/CN2023/082118 2022-04-15 2023-03-17 Multi-cancer early screening model construction method and detection device WO2023197825A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210392412.9A CN114927213A (en) 2022-04-15 2022-04-15 Construction method and detection device of multiple-cancer early screening model
CN202210392412.9 2022-04-15

Publications (1)

Publication Number Publication Date
WO2023197825A1 true WO2023197825A1 (en) 2023-10-19

Family

ID=82807125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/082118 WO2023197825A1 (en) 2022-04-15 2023-03-17 Multi-cancer early screening model construction method and detection device

Country Status (2)

Country Link
CN (1) CN114927213A (en)
WO (1) WO2023197825A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model
CN115595372B (en) * 2022-12-16 2023-03-14 南京世和基因生物技术股份有限公司 Methylation detection method of plasma free DNA source, lung cancer diagnosis marker and kit
CN116153420B (en) * 2023-04-24 2023-08-18 南京世和基因生物技术股份有限公司 Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110706749A (en) * 2019-09-10 2020-01-17 至本医疗科技(上海)有限公司 Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation
WO2021110987A1 (en) * 2019-12-06 2021-06-10 Life & Soft Methods and apparatuses for diagnosing cancer from cell-free nucleic acids
CN112941181A (en) * 2017-06-07 2021-06-11 深圳市海普洛斯生物科技有限公司 Method for detecting cfDNA mutation information in peripheral blood of subject
CN113903398A (en) * 2021-09-08 2022-01-07 南京世和基因生物技术股份有限公司 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
CA3189557A1 (en) * 2020-08-05 2022-02-10 Inivata Ltd. Highly sensitive method for detecting cancer dna in a sample
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113826167A (en) * 2019-05-13 2021-12-21 格瑞尔公司 Model-based characterization and classification
CN113436684B (en) * 2021-07-02 2022-07-15 南昌大学 Cancer classification and characteristic gene selection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112941181A (en) * 2017-06-07 2021-06-11 深圳市海普洛斯生物科技有限公司 Method for detecting cfDNA mutation information in peripheral blood of subject
CN110706749A (en) * 2019-09-10 2020-01-17 至本医疗科技(上海)有限公司 Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation
WO2021110987A1 (en) * 2019-12-06 2021-06-10 Life & Soft Methods and apparatuses for diagnosing cancer from cell-free nucleic acids
CA3189557A1 (en) * 2020-08-05 2022-02-10 Inivata Ltd. Highly sensitive method for detecting cancer dna in a sample
CN113903398A (en) * 2021-09-08 2022-01-07 南京世和基因生物技术股份有限公司 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model

Also Published As

Publication number Publication date
CN114927213A (en) 2022-08-19

Similar Documents

Publication Publication Date Title
WO2023197825A1 (en) Multi-cancer early screening model construction method and detection device
CN109872776B (en) Screening method for potential biomarkers of gastric cancer based on weighted gene co-expression network analysis and application thereof
CN109801680B (en) Tumor metastasis and recurrence prediction method and system based on TCGA database
CN115295074B (en) Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
CN112927757B (en) Gastric cancer biomarker identification method based on gene expression and DNA methylation data
CN113355421B (en) Lung cancer early screening marker, model construction method, detection device and computer readable medium
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
CN110853756A (en) Esophagus cancer risk prediction method based on SOM neural network and SVM
CN106156541B (en) The method and apparatus of the immunity difference of the individual two class states of analysis
CN113903398A (en) Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN111944902A (en) Early prediction method of renal papillary cell carcinoma based on lincRNA expression profile combination characteristics
CN111748634A (en) Characteristic lincRNA expression profile combination and early prediction method of colon cancer
CN111763738A (en) Characteristic mRNA expression profile combination and liver cancer early prediction method
CN111944900A (en) Characteristic lincRNA expression profile combination and early endometrial cancer prediction method
TW202121223A (en) Methods for training an artificial neural network to predict whether a subject will exhibit a characteristic gene expression and systems for executing the same
KR20200109544A (en) Multi-cancer classification method by common significant genes
CN112382341B (en) Method for identifying biomarkers related to prognosis of esophageal squamous carcinoma
CN114550831A (en) Gastric cancer proteomics typing framework identification method based on deep learning feature extraction
CN111733252A (en) Characteristic miRNA expression profile combination and early gastric cancer prediction method
Swain et al. A Comparative Analysis of Machine Learning Models for Colon Cancer Classification
CN111383717A (en) Method and system for constructing biological information analysis reference data set
CN115881218B (en) Gene automatic selection method for whole genome association analysis
Joshi et al. Sparse superlayered neural network-based multi-omics cancer subtype classification
CN111718997A (en) Characteristic mRNA expression profile combination and early gastric cancer prediction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23787473

Country of ref document: EP

Kind code of ref document: A1