CN113223613A - Cancer detection method based on multi-dimensional single nucleotide variation characteristics - Google Patents
Cancer detection method based on multi-dimensional single nucleotide variation characteristics Download PDFInfo
- Publication number
- CN113223613A CN113223613A CN202110524968.4A CN202110524968A CN113223613A CN 113223613 A CN113223613 A CN 113223613A CN 202110524968 A CN202110524968 A CN 202110524968A CN 113223613 A CN113223613 A CN 113223613A
- Authority
- CN
- China
- Prior art keywords
- sample
- cancer
- snv
- training
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 87
- 238000001514 detection method Methods 0.000 title claims abstract description 82
- 201000011510 cancer Diseases 0.000 title claims abstract description 70
- 239000002773 nucleotide Substances 0.000 title claims abstract description 14
- 125000003729 nucleotide group Chemical group 0.000 title claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 70
- 238000012360 testing method Methods 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 13
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 39
- 238000012163 sequencing technique Methods 0.000 claims description 16
- 239000000126 substance Substances 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 7
- JEYCTXHKTXCGPB-UHFFFAOYSA-N Methaqualone Chemical compound CC1=CC=CC=C1N1C(=O)C2=CC=CC=C2N=C1C JEYCTXHKTXCGPB-UHFFFAOYSA-N 0.000 claims description 5
- 108700028369 Alleles Proteins 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 2
- 239000000439 tumor marker Substances 0.000 abstract 1
- 230000006870 function Effects 0.000 description 8
- 238000004088 simulation Methods 0.000 description 7
- 108090000623 proteins and genes Proteins 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 208000031261 Acute myeloid leukaemia Diseases 0.000 description 1
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 1
- 201000009030 Carcinoma Diseases 0.000 description 1
- 206010014733 Endometrial cancer Diseases 0.000 description 1
- 206010014759 Endometrial neoplasm Diseases 0.000 description 1
- 208000033776 Myeloid Acute Leukemia Diseases 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 206010038019 Rectal adenocarcinoma Diseases 0.000 description 1
- 208000000102 Squamous Cell Carcinoma of Head and Neck Diseases 0.000 description 1
- 206010005084 bladder transitional cell carcinoma Diseases 0.000 description 1
- 201000001528 bladder urothelial carcinoma Diseases 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 201000000459 head and neck squamous cell carcinoma Diseases 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 206010073071 hepatocellular carcinoma Diseases 0.000 description 1
- 231100000844 hepatocellular carcinoma Toxicity 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012332 laboratory investigation Methods 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 201000005249 lung adenocarcinoma Diseases 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 201000001281 rectum adenocarcinoma Diseases 0.000 description 1
- 206010041823 squamous cell carcinoma Diseases 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- General Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a cancer marker identification method based on multi-dimensional mononucleotide variation characteristics, which is used for solving the technical problems of low detection accuracy and narrow detection range in the prior art and comprises the following steps: (1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation; (2) acquiring a training sample set and a test sample set; (3) constructing a distance calculation function Dist (X, Y) of the classifier G; (4) carrying out iterative training on the classifier G; (5) and obtaining the detection result of the cancer. The invention has more training set and testing set samples, and collects the multi-dimensional SNV characteristics from different characteristic dimensions, thereby increasing the information content of the cancer detection samples in the characteristics, and using the SNV data of various cancers, the detection model obtained by training can simultaneously detect various cancers, and the repeated detection process is simplified.
Description
Technical Field
The invention belongs to the technical field of biological information, relates to a cancer detection method, and particularly relates to a cancer detection method based on multi-dimensional single nucleotide variation characteristics, which can be used for classifying single nucleotide variation data of cancers.
Background
In recent years, cancer has been threatening the health of people as a major cause of the shortened life expectancy of humans worldwide. This leads to difficulties in cancer detection due to atypical clinical manifestations or the presence of histopathology. Due to the lack of uniform definition and related indexes, early cancer detection is mostly realized by depending on the experience of doctors or the results of a large number of detection items. This makes it difficult to avoid individual-specific bias, and detection cycles are long, costly, and less accurate. A high-performance cancer detection method which can be applied to various cancers is very important, and not only can provide knowledge support for doctors, but also doctors can monitor the changes of improvement, deterioration, relapse and the like of the cancers; the time period and monetary cost of the loss of a large number of complex test items can also be reduced. With the intensive application of machine learning in various fields, various cancer detection methods using machine learning have emerged.
Bockmayr T et al published a title on Laboratory Investigation in 2020: a multi-class cancer classification in fresh frequency and labeled tissue by DigiWest multiplex protein analysis article discloses a cancer detection method based on multiple protein analysis, which firstly tests a plurality of antibodies in a group of formalin-fixed paraffin-embedded FFPE samples, selects antibodies which generate obvious relevant signals in fresh frozen and FFPE primary tumor samples as characteristics, and develops a support vector machine algorithm suitable for 5 kinds of cancers by using the characteristics. The method has the disadvantages that the available data volume is small, the characteristic acquisition mode is single, the detection accuracy is low, the research is mainly directed to a specific few cancers, certain limitation is caused to the research result which is difficult to avoid, namely, more cancers cannot be detected simultaneously, and a large number of repeated tests are required.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a cancer detection method based on multi-dimensional single nucleotide variation characteristics, which is used for solving the technical problems of low detection accuracy and narrow detection range in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:
(1a) c cancer SNV loci randomly selected from TCGA database to form SNV locus set Ssite={XcL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to CcRepresents the SNV site, X, of the c cancerc={xcn|1≤n≤Nc},xcnRepresenting the SNV site, N, of the nth cancer sequencing samplecNumber of cancer sequencing samples, Nc>100,acnmRepresents the m-th SNV site,indicates the number of SNV sites,obtaining a sample Tag set Tag ═ Tag of a cancer sequencing samplel|1<l<N},TaglA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,Σ denotes summation;
(1b) for each SNV site acnmEditing the sequence to obtain SsiteCorresponding set S of SNV sequencesseq={Xc′|1≤c≤C},Xc′={xcn′|1≤n≤Nc},Xc' represents XcCorresponding SNV sequence, xcn' represents xcnCorresponding SNV sequence, acnm' means acnmA corresponding SNV sequence;
(1c) initializing the sampling times to be I, wherein I is more than 3, the characteristic dimension is d, and making d equal to 1;
(1d) in front-to-back order and for each SNV sequence a through a sliding window of size d × 1cnm' sampling to obtain a feature set S containing d groups of featurestemp={FhH is more than or equal to 1 and less than or equal to d, wherein FhRepresenting the h-th set of feature sets comprising N samples,is the feature of the ith sample in the h set of features, FhNumber of feature types fd=6×4d-1;
(1e) Judging whether d is less than I, if so, making d be d +1, and executing step (1d), otherwise, calculating SsiteAverage number of SNV sites M of medium cancer sequencing samplesequalAnd performing step (1f) in which,
(1f) judgment Mequal<fdIf true, obtain a signal containing feMultidimensional feature set S of group featuresdi={Fi|1≤i≤fe},Fi={fi (l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step (1d), whereFiRepresenting the ith set of feature sets comprising N samples, fi (l)Is FiThe characteristics of the first sample;
(2) acquiring a training sample set and a testing sample set:
(2a) statistics SdiEach of fi (l)Obtaining a number of feature types including feSet of group feature vectors Svec={Si|1≤i≤fe},Wherein SiFor a set of feature vectors containing N samples,is the l characteristic vector in the i characteristic vector;
(2b) set the feature vectors SvecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtainedRandom selection of SsamMore than half of samples to be classified are used as training sample sets containing C types of cancersAnd mixing SsamThe remaining samples to be classified are used as a test sample set containing C kinds of cancers Wherein the content of the first and second substances,is a c-th cancer comprising Nc' training sample set of training samples, pn′Is composed ofThe (n)' th training sample in (a),is a c-th cancer comprising Nc-Nc' test sample set of training samples, qn″Is composed ofN in (1)Training samples;
(3) distance calculation function Dist (X, Y) of the construction classifier G:
wherein X and Y represent SsamOf any two samples to be classifiedeA set of eigenvectors of the eigenvector, X ═ Xi|1≤i≤fe},Y={yi|1≤i≤fe},xiDenotes SvecOf the i-th group of feature vectors belonging to X, yiDenotes SvecThe feature vectors belonging to Y in the ith set of feature vectors of (1),dimension, x, representing the ith set of feature vectorsijDenotes xiThe j element of (a), yijDenotes yiThe (j) th element of (a),
(4) performing iterative training on the classifier G:
(4a) the initial iteration number is R, the maximum iteration number is R, R is more than or equal to 200, the hyperparameter of the classifier G is theta, and the initial value of theta is theta0The update step of theta is w, and the maximum accuracy is Tm,TmThe corresponding hyperparameter is thetamAnd make Tm=0,θm=θ0,r=0;
(4b) Will train the sample set StrainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)trainObtaining a training sample interval set according to the distance between every two training samples in the training data setAnd through the pair of sample spacingsEach training sample p in (1)n′Classifying to obtain C kinds of cancer detection categoriesWherein the content of the first and second substances,is StrainMiddle (x)trA training sample and ytrThe distance between the individual training samples is,is composed ofCorresponding detection class, tn′Is pn′A corresponding detection category;
(4c) judgment ofEach of p inn′And the corresponding cancer detection category tn′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtainedAnd calculating the average accuracy T of the r-th iterationrWherein, in the step (A),is composed ofThe accuracy of the detection of (a) is,is composed ofThe number of training samples that are correctly classified in,
(4d) judgment of Tm<TrIf true, let Tm=Tr,θmAnd performing step (4e), otherwise, performing step (4 e);
(4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step (4b), otherwise, obtaining a trained classifier G';
(5) obtaining the detection result of the cancer:
set of test samples StestAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)testObtaining a set of test sample spacings based on the distance between each two test samples in the setAnd through the pair of sample spacingsEach test sample q in (1)n″Classifying to obtain C kinds of cancer detection categoriesWherein the content of the first and second substances,is StestMiddle (x)teA test sample and yteThe pitch of the individual test specimens is,is composed ofCorresponding detection class, t'n″Is qn″The corresponding detection category.
Compared with the prior art, the invention has the following advantages:
1. the SNV data volume used by the invention is rich, and the multi-dimensional SNV characteristics are collected from different characteristic dimensions, so that the information content of the cancer detection sample in the characteristics is increased, and the accuracy of the detection result is improved.
2. The invention uses SNV data of various cancers, and the trained detection model can simultaneously detect the various cancers, thereby simplifying the repeated detection process and expanding the detection range of the cancers compared with the defect that only a few specific cancers can be detected in the prior art.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention will be described in further detail with reference to the following drawings and specific examples, it being understood that the invention is not an patentable object as defined in clause 25 of the patent Law, but also complies with clause 2 of the patent Law:
referring to fig. 1, the present invention includes the steps of:
step 1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:
step 1a) randomly selecting C cancer SNV loci from TCGA database to form SNV locus set Ssite={XcL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to CcRepresents the SNV site, X, of the c cancerc={xcn|1≤n≤Nc},xcnRepresenting the SNV site, N, of the nth cancer sequencing samplecNumber of cancer sequencing samples, Nc>100,acnmRepresents the m-th SNV site,indicates the number of SNV sites,obtaining a sample Tag set Tag ═ Tag of a cancer sequencing samplel|1<l<N},TaglA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,Σ denotes summation, in this example, C — 12, N — 2761;
when SNV loci are collected, only SNV loci of 12 cancers are downloaded, so that data are screened for ensuring data quality;
step 1b) for each SNV site acnmEditing the sequence to obtain SsiteCorresponding set S of SNV sequencesseq={Xc′|1≤c≤C},Xc′={xcn′|1≤n≤Nc},Xc' represents XcCorresponding SNV sequence, xcn' represents xcnCorresponding SNV sequence, acnm' means acnmA corresponding SNV sequence;
for each SNV site acnmThe sequence editing is implemented by the following steps: setting of SNV site acnmWherein the sequence Seq ═ b is included1b2b3b4b5Minor allele b3', Primary allele b3The single nucleotide variant SNV of (A) is represented by Q ═ b3->b3', and pair b1、b2、Q、b4And b5Carrying out character string splicing to obtain acnmCorresponding SNV sequence acnm′=b1b2Qb4b5=b1b2b3->b3′b4b5Wherein b is1、b2、b4、b5Is b is3- > is a mononucleosideAcid variation SNV;
the sequence editing step can avoid time loss caused by repeated operations such as character string splicing and the like during feature acquisition;
step 1c), initializing the sampling times to be I, wherein I is greater than or equal to 3, the characteristic dimension is d, and making d equal to 1, in this example, I equal to 3;
during the initialization sampling times, the value of I is reasonably controlled to avoid overfitting to a certain extent;
step 1d) of sequencing each SNV sequence a in front to back order and through a sliding window of size dX1cnm' sampling to obtain a feature set S containing d groups of featurestemp={FhH is more than or equal to 1 and less than or equal to d, wherein FhRepresenting the h-th set of feature sets comprising N samples,is the feature of the ith sample in the h set of features, FhNumber of feature types fd=6×4d-1;
Step 1e) judging whether d < I is true, if so, making d equal to d +1, and executing step 1d), otherwise, calculating SsiteAverage number of SNV sites M of medium cancer sequencing samplesequalAnd performing a step 1f) in which, in this example, M is 528098equal=191;
Step 1f) judgment of Mequal<fdIf true, obtain a signal containing feMultidimensional feature set S of group featuresdi={Fi|1≤i≤fe},Fi={fi (l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step 1d), whereFiRepresents the ith group packetFeature set comprising N samples, fi (l)Is FiThe characteristics of the first sample;
the above-mentioned judgment Mequal<fdWhether the condition is satisfied can avoid the situation that the features are too sparse, when Mequal<fdIf the characteristics are collected continuously, a large number of 0 values appear in the obtained characteristic vector, so that the detection accuracy is reduced, and the data utilization rate can be improved by collecting the characteristics from multiple dimensions, so that the accuracy of a detection result is improved;
step 2) obtaining a training sample set and a testing sample set:
step 2a) statistics of SdiEach of fi (l)Obtaining a number of feature types including feSet of group feature vectors Svec={Si|1≤i≤fe},Wherein SiFor a set of feature vectors containing N samples,is the l characteristic vector in the i characteristic vector;
statistics SdiEach of fi (l)The number of feature types of (a) is implemented by: setting sample characteristics fi (l)Has a characteristic dimension of dvEstablishing a dimension ofFeature vector ofInitializationAll the values of the elements (2) are 0, and statistics is carried outThe number of feature types corresponding to each element in the listTo obtain fi (l)Corresponding feature vectorWherein
Step 2b) set S of feature vectorsvecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtainedRandom selection of SsamMore than half of samples to be classified are used as training sample sets containing C types of cancersAnd mixing SsamThe remaining samples to be classified are used as a test sample set containing C kinds of cancers Wherein the content of the first and second substances,is a c-th cancer comprising Nc' training sample set of training samples, pn′Is composed ofThe (n)' th training sample in (a),is a c-th cancer comprising Nc-Nc' test sample set of training samples, qn″Is composed ofThe nth' training sample, in this example, S is selectedsamUsing 80% of samples to be classified as a training sample set Strain;
The above feature vectors are collected SvecThe characteristic vector and the corresponding sample label in the Tag form a sample to be classified, so that time loss caused by searching the corresponding sample label can be avoided when judging whether the detection result of the sample to be classified is correct;
step 3), constructing a distance calculation function Dist (X, Y) of the classifier G:
wherein X and Y represent SsamOf any two samples to be classifiedeA set of eigenvectors of the eigenvector, X ═ Xi|1≤i≤fe},Y={yi|1≤i≤fe},xiDenotes SvecOf the i-th group of feature vectors belonging to X, yiDenotes SvecThe feature vectors belonging to Y in the ith set of feature vectors of (1),dimension, x, representing the ith set of feature vectorsijDenotes xiThe j element of (a), yijDenotes yiThe (j) th element of (a),
the distance calculation function Dist (X, Y) can count the distance of the feature vectors of a plurality of groups of features, and ensures that each group of feature vectors has the same contribution to the detection result;
step 4), performing iterative training on the classifier G:
step 4a), initializing the iteration times to R, the maximum iteration times to R, R being more than or equal to 200, the hyperparameter of the classifier G to theta, the initial value of theta to theta0The update step of theta is w, the maximum criterionAccuracy is Tm,TmThe corresponding hyperparameter is thetamAnd make Tm=0,θm=θ0R is 0, in this example, R is 500;
step 4b) training sample set StrainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)trainObtaining a training sample interval set according to the distance between every two training samples in the training data setAnd through the pair of sample spacingsEach training sample p in (1)n′Classifying to obtain C kinds of cancer detection categoriesWherein the content of the first and second substances,is StrainMiddle (x)trA training sample and ytrThe distance between the individual training samples is,is composed ofCorresponding detection class, tn′Is pn′A corresponding detection category;
for each training sample pn′The classification is realized by the following steps: for training sample pn′ObtainingIn (c) pn′Set of distances to other training samplesSelect the smallest thetamAnCorresponding training samples and counting sample labels, and taking the cancer class with the most number of occurrences as pn′Cancer detection class t ofn′。
Step 4c) judgmentEach of p inn′And the corresponding cancer detection category tn′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtainedAnd calculating the average accuracy T of the r-th iterationrWherein, in the step (A),is composed ofThe accuracy of the detection of (a) is,is composed ofThe number of training samples that are correctly classified in,
step 4d) determining Tm<TrIf true, let Tm=Tr,θmθ and perform step 4e), otherwise, perform step 4 e);
judging T in the above stepm<TrWhether or not it is establishedObtaining the value of the hyper-parameter with the highest accuracy, thereby ensuring that the trained classifier G' is the most elegant classifier in R iterations;
step 4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step 4b), otherwise, obtaining a trained classifier G';
step 5) obtaining the detection result of the cancer:
set of test samples StestAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)testObtaining a set of test sample spacings based on the distance between each two test samples in the setAnd through the pair of sample spacingsEach test sample q in (1)n″Classifying to obtain C kinds of cancer detection categoriesWherein the content of the first and second substances,is StestMiddle (x)teA test sample and yteThe pitch of the individual test specimens is,is composed ofCorresponding detection class, t'n″Is qn″The corresponding detection category.
The technical effects of the invention are further explained by combining simulation experiments as follows:
1. simulation conditions are as follows:
the hardware platform of the simulation experiment is as follows: the CPU is Intel (R) core (TM) i7-8500, the main frequency is 2.20GHz, the memory is 16G, and the software platform is as follows: the operating system is MacOS 10.15, and version R is 3.6.
The data set used in the simulation was collected from the TCGA database and contained 12 cancers: the method comprises the following steps of obtaining cancer detection results of 2761 samples in total through bladder urothelial carcinoma BLCA, head and neck squamous cell carcinoma HNSC, renal papillary cell carcinoma KIRP, acute myeloid leukemia LAML, hepatocellular carcinoma LIHC, lung adenocarcinoma LUAD, lung squamous carcinoma LUSC, pancreatic cancer PAAD, prostate cancer PRAD, rectal adenocarcinoma READ and endometrial cancer UCEC, verifying the detection results through known labels, and considering that the detection results are correct when the detection results are consistent with the known labels, or considering that the detection results are wrong.
2. Simulation content and result analysis:
the detection accuracy and the application range of the invention are simulated, and the simulation result of the invention is compared with the cancer detection method based on the multiple protein analysis in the prior art, and the result is shown in table 1.
TABLE 1
Method | Accuracy of | Extent of cancer detection |
Prior Art | 88% | 5 |
The invention | 97.43% | 12 |
In table 1, the detection accuracy of the method of the present invention is 97.43%, the cancer detection range is 12, and the index is higher than that of the prior art method, which proves that the method of the present invention can obtain better cancer detection result and improve the cancer detection range.
The above simulation experiments show that: when the method is used for detecting the cancer, firstly, the multi-dimensional characteristics of the SNV sites are obtained, secondly, the training sample set and the testing sample set are obtained, secondly, the distance calculation function Dist (X, Y) of the classifier G is constructed, secondly, the classifier G is subjected to iterative training, and finally, the detection result of the cancer is obtained.
Claims (4)
1. A cancer detection method based on multi-dimensional single nucleotide variation characteristics is characterized by comprising the following steps:
(1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:
(1a) c cancer SNV loci randomly selected from TCGA database to form SNV locus set Ssite={XcL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to CcRepresents the SNV site, X, of the c cancerc={xcn|1≤n≤Nc},xcnRepresenting the SNV site, N, of the nth cancer sequencing samplecNumber of cancer sequencing samples, Nc>100,acnmRepresents the m-th SNV site,indicates the number of SNV sites,obtaining a sample Tag set Tag ═ Tag of a cancer sequencing samplel|1<l<N},TaglA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,Σ denotes summation;
(1b) for each SNV site acnmEditing the sequence to obtain SsiteCorresponding set S of SNV sequencesseq={Xc′|1≤c≤C},Xc′={xcn′|1≤n≤Nc},Xc' represents XcCorresponding SNV sequence, xcn' represents xcnCorresponding SNV sequence, acnm' means acnmA corresponding SNV sequence;
(1c) initializing the sampling times to be I, wherein I is more than or equal to 3, the characteristic dimension is d, and making d equal to 1;
(1d) in front-to-back order and for each SNV sequence a through a sliding window of size d × 1cnm' sampling to obtain a feature set S containing d groups of featurestemp={FhH is more than or equal to 1 and less than or equal to d, wherein FhRepresenting the h-th set of feature sets comprising N samples, is the feature of the ith sample in the h set of features, FhNumber of feature types fd=6×4d-1;
(1e) Judging whether d is less than I, if so, making d be d +1, and executing step (1d), otherwise, calculating SsiteAverage number of SNV sites M of medium cancer sequencing samplesequalAnd performing step (1f) in which,
(1f) judgment Mequal<fdIf true, obtain a signal containing feMultidimensional feature set S of group featuresdi={Fi|1≤i≤fe},Fi={fi (l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step (1d), whereFiRepresenting the ith set of feature sets comprising N samples, fi (l)Is FiThe characteristics of the first sample;
(2) acquiring a training sample set and a testing sample set:
(2a) statistics SdiEach of fi (l)Obtaining a number of feature types including feSet of group feature vectors Svec={Si|1≤i≤fe},Wherein SiFor a set of feature vectors containing N samples,is the l characteristic vector in the i characteristic vector;
(2b) set the feature vectors SvecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtainedRandom selection of SsamMore than half of samples to be classified are used as training sample sets containing C types of cancersAnd mixing SsamThe remaining sample to be classified as containing C kinds of cancerTest sample setWherein the content of the first and second substances,is a c-th cancer comprising Nc' training sample set of training samples, pn′Is composed ofThe (n)' th training sample in (a),is a c-th cancer comprising Nc-Nc' test sample set of training samples, qn″Is composed ofThe nth' training sample;
(3) distance calculation function Dist (X, Y) of the construction classifier G:
wherein X and Y represent SsamOf any two samples to be classifiedeA set of eigenvectors of the eigenvector, X ═ Xi|1≤i≤fe},Y={yi|1≤i≤fe},xiDenotes SvecOf the i-th group of feature vectors belonging to X, yiDenotes SvecThe feature vectors belonging to Y in the ith set of feature vectors of (1),dimension, x, representing the ith set of feature vectorsijDenotes xiThe j element of (a), yijDenotes yiThe (j) th element of (a),
(4) performing iterative training on the classifier G:
(4a) the initial iteration number is R, the maximum iteration number is R, R is more than or equal to 200, the hyperparameter of the classifier G is theta, and the initial value of theta is theta0The update step of theta is w, and the maximum accuracy is Tm,TmThe corresponding hyperparameter is thetamAnd make Tm=0,θm=θ0,r=0;
(4b) Will train the sample set StrainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)trainObtaining a training sample interval set according to the distance between every two training samples in the training data setAnd through the pair of sample spacingsEach training sample p in (1)n′Classifying to obtain C kinds of cancer detection categoriesWherein the content of the first and second substances,is StrainMiddle (x)trA training sample and ytrThe distance between the individual training samples is, is composed ofCorresponding detection class, tn′Is pn′A corresponding detection category;
(4c) judgment ofEach of p inn′And the corresponding cancer detection category tn′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtainedAnd calculating the average accuracy T of the r-th iterationrWherein, in the step (A),is composed ofThe accuracy of the detection of (a) is, is composed ofThe number of training samples that are correctly classified in,
(4d) judgment of Tm<TrIf true, let Tm=Tr,θmAnd performing step (4e), otherwise, performing step (4 e);
(4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step (4b), otherwise, obtaining a trained classifier G';
(5) obtaining the detection result of the cancer:
set of test samples StestAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)testObtaining a set of test sample spacings based on the distance between each two test samples in the setAnd through the pair of sample spacingsEach test sample q in (1)n″Classifying to obtain C kinds of cancer detection categoriesWherein the content of the first and second substances,is StestMiddle (x)teA test sample and yteThe pitch of the individual test specimens is,is composed ofCorresponding detection class, t'n″Is qn″The corresponding detection category.
2. The method for detecting cancer based on multi-dimensional mononucleotide variation characteristics of claim 1, wherein said step (1b) comprises a for each SNV sitecnmAnd performing sequence editing, wherein the implementation steps are as follows:
for sequences containing Seq ═ b1b2b3b4b5Minor allele b3' SNV site a ofcnmCentral origin allele b3The single nucleotide variant SNV of (A) is represented by Q ═ b3->b3', and pair b1、b2、Q、b4And b5Carrying out character string splicing to obtain acnmCorresponding SNV sequence acnm′=b1b2Qb4b5=b1b2b3->b3′b4b5Wherein b is1、b2、b4、b5Is b is3- > is a single nucleotide variant SNV.
3. The method for detecting cancer according to claim 1, wherein the statistic S in step (2a)diEach of fi (l)The number of feature types of (2) is implemented by the following steps:
for the characteristic dimension dvCharacteristic f of the samplei (l)Establishing a dimension ofFeature vector ofInitializationAll the values of the elements (2) are 0, and statistics is carried outThe number of the feature types corresponding to each element in the group is obtainedi (l)Corresponding feature vectorWherein
4. The method for detecting cancer based on multi-dimensional SNP (Single nucleotide variation) as claimed in claim 1, wherein p is used for each training sample in step (4b)n′And classifying, wherein the implementation steps are as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110524968.4A CN113223613A (en) | 2021-05-14 | 2021-05-14 | Cancer detection method based on multi-dimensional single nucleotide variation characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110524968.4A CN113223613A (en) | 2021-05-14 | 2021-05-14 | Cancer detection method based on multi-dimensional single nucleotide variation characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113223613A true CN113223613A (en) | 2021-08-06 |
Family
ID=77095606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110524968.4A Pending CN113223613A (en) | 2021-05-14 | 2021-05-14 | Cancer detection method based on multi-dimensional single nucleotide variation characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113223613A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114242158A (en) * | 2022-02-21 | 2022-03-25 | 臻和(北京)生物科技有限公司 | Method, device, storage medium and equipment for detecting ctDNA single nucleotide variation site |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104504305A (en) * | 2014-12-24 | 2015-04-08 | 西安电子科技大学 | Method for monitoring gene expression data classification |
CN110211632A (en) * | 2019-05-06 | 2019-09-06 | 西安电子科技大学 | A kind of nucleotide unit point mutation detection method neural network based |
CN111278993A (en) * | 2017-09-15 | 2020-06-12 | 加利福尼亚大学董事会 | Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring |
CN112687329A (en) * | 2019-10-17 | 2021-04-20 | 中国科学技术大学 | Cancer prediction system based on non-cancer tissue mutation information and construction method thereof |
-
2021
- 2021-05-14 CN CN202110524968.4A patent/CN113223613A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104504305A (en) * | 2014-12-24 | 2015-04-08 | 西安电子科技大学 | Method for monitoring gene expression data classification |
CN111278993A (en) * | 2017-09-15 | 2020-06-12 | 加利福尼亚大学董事会 | Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring |
CN110211632A (en) * | 2019-05-06 | 2019-09-06 | 西安电子科技大学 | A kind of nucleotide unit point mutation detection method neural network based |
CN112687329A (en) * | 2019-10-17 | 2021-04-20 | 中国科学技术大学 | Cancer prediction system based on non-cancer tissue mutation information and construction method thereof |
Non-Patent Citations (1)
Title |
---|
BO LI 等: "Identification and Validation of the SNV Biomarkers Based on Multi-Dimensional Patterns", 《ARXIV》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114242158A (en) * | 2022-02-21 | 2022-03-25 | 臻和(北京)生物科技有限公司 | Method, device, storage medium and equipment for detecting ctDNA single nucleotide variation site |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109767438B (en) | Infrared thermal image defect feature identification method based on dynamic multi-objective optimization | |
Cho et al. | Cancer classification using ensemble of neural networks with multiple significant gene subsets | |
CN113454733A (en) | Multi-instance learner for prognostic tissue pattern recognition | |
CN109767437B (en) | Infrared thermal image defect feature extraction method based on k-means dynamic multi-target | |
JP2003500663A (en) | Methods for normalization of experimental data | |
Padmanabhan et al. | An active learning approach for rapid characterization of endothelial cells in human tumors | |
Angelini et al. | A Bayesian approach to estimation and testing in time-course microarray experiments | |
CN113223613A (en) | Cancer detection method based on multi-dimensional single nucleotide variation characteristics | |
CN116596933B (en) | Base cluster detection method and device, gene sequencer and storage medium | |
CN111310680B (en) | Radiation source individual identification method based on deep learning | |
Bull et al. | Extended correlation functions for spatial analysis of multiplex imaging data | |
CN110942808A (en) | Prognosis prediction method and prediction system based on gene big data | |
Tasoulis et al. | Unsupervised clustering of bioinformatics data | |
US20030104394A1 (en) | Method and system for gene expression profiling analysis utilizing frequency domain transformation | |
KR102376212B1 (en) | Gene expression marker screening method using neural network based on gene selection algorithm | |
CN113555059A (en) | Quantitative method for coupling relationship between organic carbon and microorganisms under environmental change | |
Chan et al. | Network inference and hypothesesgeneration from single-cell transcriptomic data using multivariate information measures. bioRxiv | |
CN112183576B (en) | Time-LSTM classification method based on unbalanced data set | |
Sivanandan et al. | Machine learning enabled pooled optical screening in human lung cancer cells | |
CHEN et al. | Identification of Predictor Genes of Feed Efficiency in Beef Cattle by Applying Machine Learning (ML) Methods to Multi-tissue Transcriptome Data | |
CN117316295A (en) | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function | |
CN110555370B (en) | Channel effect inhibition method based on PLDA factor analysis method in underwater target recognition | |
Tang et al. | T-BAPS: a Bayesian statistical tool for comparison of microbial communities using terminal-restriction fragment length polymorphism (T-RFLP) data | |
Feng et al. | Statistical considerations in combining biomarkers for disease classification | |
Niederle et al. | VADA: a Data-Driven Simulator for Nanopore Sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210806 |