CN113223613A - Cancer detection method based on multi-dimensional single nucleotide variation characteristics - Google Patents

Cancer detection method based on multi-dimensional single nucleotide variation characteristics Download PDF

Info

Publication number
CN113223613A
CN113223613A CN202110524968.4A CN202110524968A CN113223613A CN 113223613 A CN113223613 A CN 113223613A CN 202110524968 A CN202110524968 A CN 202110524968A CN 113223613 A CN113223613 A CN 113223613A
Authority
CN
China
Prior art keywords
sample
cancer
snv
training
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110524968.4A
Other languages
Chinese (zh)
Inventor
鱼亮
李博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110524968.4A priority Critical patent/CN113223613A/en
Publication of CN113223613A publication Critical patent/CN113223613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a cancer marker identification method based on multi-dimensional mononucleotide variation characteristics, which is used for solving the technical problems of low detection accuracy and narrow detection range in the prior art and comprises the following steps: (1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation; (2) acquiring a training sample set and a test sample set; (3) constructing a distance calculation function Dist (X, Y) of the classifier G; (4) carrying out iterative training on the classifier G; (5) and obtaining the detection result of the cancer. The invention has more training set and testing set samples, and collects the multi-dimensional SNV characteristics from different characteristic dimensions, thereby increasing the information content of the cancer detection samples in the characteristics, and using the SNV data of various cancers, the detection model obtained by training can simultaneously detect various cancers, and the repeated detection process is simplified.

Description

Cancer detection method based on multi-dimensional single nucleotide variation characteristics
Technical Field
The invention belongs to the technical field of biological information, relates to a cancer detection method, and particularly relates to a cancer detection method based on multi-dimensional single nucleotide variation characteristics, which can be used for classifying single nucleotide variation data of cancers.
Background
In recent years, cancer has been threatening the health of people as a major cause of the shortened life expectancy of humans worldwide. This leads to difficulties in cancer detection due to atypical clinical manifestations or the presence of histopathology. Due to the lack of uniform definition and related indexes, early cancer detection is mostly realized by depending on the experience of doctors or the results of a large number of detection items. This makes it difficult to avoid individual-specific bias, and detection cycles are long, costly, and less accurate. A high-performance cancer detection method which can be applied to various cancers is very important, and not only can provide knowledge support for doctors, but also doctors can monitor the changes of improvement, deterioration, relapse and the like of the cancers; the time period and monetary cost of the loss of a large number of complex test items can also be reduced. With the intensive application of machine learning in various fields, various cancer detection methods using machine learning have emerged.
Bockmayr T et al published a title on Laboratory Investigation in 2020: a multi-class cancer classification in fresh frequency and labeled tissue by DigiWest multiplex protein analysis article discloses a cancer detection method based on multiple protein analysis, which firstly tests a plurality of antibodies in a group of formalin-fixed paraffin-embedded FFPE samples, selects antibodies which generate obvious relevant signals in fresh frozen and FFPE primary tumor samples as characteristics, and develops a support vector machine algorithm suitable for 5 kinds of cancers by using the characteristics. The method has the disadvantages that the available data volume is small, the characteristic acquisition mode is single, the detection accuracy is low, the research is mainly directed to a specific few cancers, certain limitation is caused to the research result which is difficult to avoid, namely, more cancers cannot be detected simultaneously, and a large number of repeated tests are required.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a cancer detection method based on multi-dimensional single nucleotide variation characteristics, which is used for solving the technical problems of low detection accuracy and narrow detection range in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:
(1a) c cancer SNV loci randomly selected from TCGA database to form SNV locus set Ssite={XcL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to CcRepresents the SNV site, X, of the c cancerc={xcn|1≤n≤Nc},xcnRepresenting the SNV site, N, of the nth cancer sequencing samplecNumber of cancer sequencing samples, Nc>100,
Figure BDA0003065461830000021
acnmRepresents the m-th SNV site,
Figure BDA0003065461830000022
indicates the number of SNV sites,
Figure BDA0003065461830000023
obtaining a sample Tag set Tag ═ Tag of a cancer sequencing samplel|1<l<N},TaglA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,
Figure BDA0003065461830000024
Σ denotes summation;
(1b) for each SNV site acnmEditing the sequence to obtain SsiteCorresponding set S of SNV sequencesseq={Xc′|1≤c≤C},Xc′={xcn′|1≤n≤Nc},
Figure BDA0003065461830000025
Xc' represents XcCorresponding SNV sequence, xcn' represents xcnCorresponding SNV sequence, acnm' means acnmA corresponding SNV sequence;
(1c) initializing the sampling times to be I, wherein I is more than 3, the characteristic dimension is d, and making d equal to 1;
(1d) in front-to-back order and for each SNV sequence a through a sliding window of size d × 1cnm' sampling to obtain a feature set S containing d groups of featurestemp={FhH is more than or equal to 1 and less than or equal to d, wherein FhRepresenting the h-th set of feature sets comprising N samples,
Figure BDA0003065461830000026
is the feature of the ith sample in the h set of features, FhNumber of feature types fd=6×4d-1
(1e) Judging whether d is less than I, if so, making d be d +1, and executing step (1d), otherwise, calculating SsiteAverage number of SNV sites M of medium cancer sequencing samplesequalAnd performing step (1f) in which,
Figure BDA0003065461830000027
Figure BDA0003065461830000028
(1f) judgment Mequal<fdIf true, obtain a signal containing feMultidimensional feature set S of group featuresdi={Fi|1≤i≤fe},Fi={fi (l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step (1d), where
Figure BDA0003065461830000029
FiRepresenting the ith set of feature sets comprising N samples, fi (l)Is FiThe characteristics of the first sample;
(2) acquiring a training sample set and a testing sample set:
(2a) statistics SdiEach of fi (l)Obtaining a number of feature types including feSet of group feature vectors Svec={Si|1≤i≤fe},
Figure BDA0003065461830000031
Wherein SiFor a set of feature vectors containing N samples,
Figure BDA0003065461830000032
is the l characteristic vector in the i characteristic vector;
(2b) set the feature vectors SvecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtained
Figure BDA0003065461830000033
Random selection of SsamMore than half of samples to be classified are used as training sample sets containing C types of cancers
Figure BDA0003065461830000034
And mixing SsamThe remaining samples to be classified are used as a test sample set containing C kinds of cancers
Figure BDA0003065461830000035
Figure BDA0003065461830000036
Wherein the content of the first and second substances,
Figure BDA0003065461830000037
is a c-th cancer comprising Nc' training sample set of training samples, pn′Is composed of
Figure BDA0003065461830000038
The (n)' th training sample in (a),
Figure BDA0003065461830000039
is a c-th cancer comprising Nc-Nc' test sample set of training samples, qn″Is composed of
Figure BDA00030654618300000310
N in (1)Training samples;
(3) distance calculation function Dist (X, Y) of the construction classifier G:
Figure BDA00030654618300000311
wherein X and Y represent SsamOf any two samples to be classifiedeA set of eigenvectors of the eigenvector, X ═ Xi|1≤i≤fe},Y={yi|1≤i≤fe},
Figure BDA00030654618300000312
xiDenotes SvecOf the i-th group of feature vectors belonging to X, yiDenotes SvecThe feature vectors belonging to Y in the ith set of feature vectors of (1),
Figure BDA00030654618300000313
dimension, x, representing the ith set of feature vectorsijDenotes xiThe j element of (a), yijDenotes yiThe (j) th element of (a),
Figure BDA00030654618300000314
(4) performing iterative training on the classifier G:
(4a) the initial iteration number is R, the maximum iteration number is R, R is more than or equal to 200, the hyperparameter of the classifier G is theta, and the initial value of theta is theta0The update step of theta is w, and the maximum accuracy is Tm,TmThe corresponding hyperparameter is thetamAnd make Tm=0,θm=θ0,r=0;
(4b) Will train the sample set StrainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)trainObtaining a training sample interval set according to the distance between every two training samples in the training data set
Figure BDA0003065461830000041
And through the pair of sample spacings
Figure BDA0003065461830000042
Each training sample p in (1)n′Classifying to obtain C kinds of cancer detection categories
Figure BDA0003065461830000043
Wherein the content of the first and second substances,
Figure BDA0003065461830000044
is StrainMiddle (x)trA training sample and ytrThe distance between the individual training samples is,
Figure BDA0003065461830000045
is composed of
Figure BDA0003065461830000046
Corresponding detection class, tn′Is pn′A corresponding detection category;
(4c) judgment of
Figure BDA0003065461830000047
Each of p inn′And the corresponding cancer detection category tn′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtained
Figure BDA0003065461830000048
And calculating the average accuracy T of the r-th iterationrWherein, in the step (A),
Figure BDA0003065461830000049
is composed of
Figure BDA00030654618300000410
The accuracy of the detection of (a) is,
Figure BDA00030654618300000411
is composed of
Figure BDA00030654618300000412
The number of training samples that are correctly classified in,
Figure BDA00030654618300000413
Figure BDA00030654618300000414
(4d) judgment of Tm<TrIf true, let Tm=Tr,θmAnd performing step (4e), otherwise, performing step (4 e);
(4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step (4b), otherwise, obtaining a trained classifier G';
(5) obtaining the detection result of the cancer:
set of test samples StestAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)testObtaining a set of test sample spacings based on the distance between each two test samples in the set
Figure BDA00030654618300000415
And through the pair of sample spacings
Figure BDA00030654618300000416
Each test sample q in (1)n″Classifying to obtain C kinds of cancer detection categories
Figure BDA00030654618300000417
Wherein the content of the first and second substances,
Figure BDA00030654618300000418
is StestMiddle (x)teA test sample and yteThe pitch of the individual test specimens is,
Figure BDA00030654618300000419
is composed of
Figure BDA00030654618300000420
Corresponding detection class, t'n″Is qn″The corresponding detection category.
Compared with the prior art, the invention has the following advantages:
1. the SNV data volume used by the invention is rich, and the multi-dimensional SNV characteristics are collected from different characteristic dimensions, so that the information content of the cancer detection sample in the characteristics is increased, and the accuracy of the detection result is improved.
2. The invention uses SNV data of various cancers, and the trained detection model can simultaneously detect the various cancers, thereby simplifying the repeated detection process and expanding the detection range of the cancers compared with the defect that only a few specific cancers can be detected in the prior art.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention will be described in further detail with reference to the following drawings and specific examples, it being understood that the invention is not an patentable object as defined in clause 25 of the patent Law, but also complies with clause 2 of the patent Law:
referring to fig. 1, the present invention includes the steps of:
step 1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:
step 1a) randomly selecting C cancer SNV loci from TCGA database to form SNV locus set Ssite={XcL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to CcRepresents the SNV site, X, of the c cancerc={xcn|1≤n≤Nc},xcnRepresenting the SNV site, N, of the nth cancer sequencing samplecNumber of cancer sequencing samples, Nc>100,
Figure BDA0003065461830000051
acnmRepresents the m-th SNV site,
Figure BDA0003065461830000052
indicates the number of SNV sites,
Figure BDA0003065461830000053
obtaining a sample Tag set Tag ═ Tag of a cancer sequencing samplel|1<l<N},TaglA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,
Figure BDA0003065461830000054
Σ denotes summation, in this example, C — 12, N — 2761;
when SNV loci are collected, only SNV loci of 12 cancers are downloaded, so that data are screened for ensuring data quality;
step 1b) for each SNV site acnmEditing the sequence to obtain SsiteCorresponding set S of SNV sequencesseq={Xc′|1≤c≤C},Xc′={xcn′|1≤n≤Nc},
Figure BDA0003065461830000055
Xc' represents XcCorresponding SNV sequence, xcn' represents xcnCorresponding SNV sequence, acnm' means acnmA corresponding SNV sequence;
for each SNV site acnmThe sequence editing is implemented by the following steps: setting of SNV site acnmWherein the sequence Seq ═ b is included1b2b3b4b5Minor allele b3', Primary allele b3The single nucleotide variant SNV of (A) is represented by Q ═ b3->b3', and pair b1、b2、Q、b4And b5Carrying out character string splicing to obtain acnmCorresponding SNV sequence acnm′=b1b2Qb4b5=b1b2b3->b3′b4b5Wherein b is1、b2、b4、b5Is b is3- > is a mononucleosideAcid variation SNV;
the sequence editing step can avoid time loss caused by repeated operations such as character string splicing and the like during feature acquisition;
step 1c), initializing the sampling times to be I, wherein I is greater than or equal to 3, the characteristic dimension is d, and making d equal to 1, in this example, I equal to 3;
during the initialization sampling times, the value of I is reasonably controlled to avoid overfitting to a certain extent;
step 1d) of sequencing each SNV sequence a in front to back order and through a sliding window of size dX1cnm' sampling to obtain a feature set S containing d groups of featurestemp={FhH is more than or equal to 1 and less than or equal to d, wherein FhRepresenting the h-th set of feature sets comprising N samples,
Figure BDA0003065461830000061
is the feature of the ith sample in the h set of features, FhNumber of feature types fd=6×4d-1
Step 1e) judging whether d < I is true, if so, making d equal to d +1, and executing step 1d), otherwise, calculating SsiteAverage number of SNV sites M of medium cancer sequencing samplesequalAnd performing a step 1f) in which,
Figure BDA0003065461830000062
Figure BDA0003065461830000063
in this example, M is 528098equal=191;
Step 1f) judgment of Mequal<fdIf true, obtain a signal containing feMultidimensional feature set S of group featuresdi={Fi|1≤i≤fe},Fi={fi (l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step 1d), where
Figure BDA0003065461830000064
FiRepresents the ith group packetFeature set comprising N samples, fi (l)Is FiThe characteristics of the first sample;
the above-mentioned judgment Mequal<fdWhether the condition is satisfied can avoid the situation that the features are too sparse, when Mequal<fdIf the characteristics are collected continuously, a large number of 0 values appear in the obtained characteristic vector, so that the detection accuracy is reduced, and the data utilization rate can be improved by collecting the characteristics from multiple dimensions, so that the accuracy of a detection result is improved;
step 2) obtaining a training sample set and a testing sample set:
step 2a) statistics of SdiEach of fi (l)Obtaining a number of feature types including feSet of group feature vectors Svec={Si|1≤i≤fe},
Figure BDA0003065461830000071
Wherein SiFor a set of feature vectors containing N samples,
Figure BDA0003065461830000072
is the l characteristic vector in the i characteristic vector;
statistics SdiEach of fi (l)The number of feature types of (a) is implemented by: setting sample characteristics fi (l)Has a characteristic dimension of dvEstablishing a dimension of
Figure BDA0003065461830000073
Feature vector of
Figure BDA0003065461830000074
Initialization
Figure BDA0003065461830000075
All the values of the elements (2) are 0, and statistics is carried out
Figure BDA0003065461830000076
The number of feature types corresponding to each element in the listTo obtain fi (l)Corresponding feature vector
Figure BDA0003065461830000077
Wherein
Figure BDA0003065461830000078
Step 2b) set S of feature vectorsvecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtained
Figure BDA0003065461830000079
Random selection of SsamMore than half of samples to be classified are used as training sample sets containing C types of cancers
Figure BDA00030654618300000710
And mixing SsamThe remaining samples to be classified are used as a test sample set containing C kinds of cancers
Figure BDA00030654618300000711
Figure BDA00030654618300000712
Wherein the content of the first and second substances,
Figure BDA00030654618300000713
is a c-th cancer comprising Nc' training sample set of training samples, pn′Is composed of
Figure BDA00030654618300000714
The (n)' th training sample in (a),
Figure BDA00030654618300000715
is a c-th cancer comprising Nc-Nc' test sample set of training samples, qn″Is composed of
Figure BDA00030654618300000716
The nth' training sample, in this example, S is selectedsamUsing 80% of samples to be classified as a training sample set Strain
The above feature vectors are collected SvecThe characteristic vector and the corresponding sample label in the Tag form a sample to be classified, so that time loss caused by searching the corresponding sample label can be avoided when judging whether the detection result of the sample to be classified is correct;
step 3), constructing a distance calculation function Dist (X, Y) of the classifier G:
Figure BDA00030654618300000717
wherein X and Y represent SsamOf any two samples to be classifiedeA set of eigenvectors of the eigenvector, X ═ Xi|1≤i≤fe},Y={yi|1≤i≤fe},
Figure BDA00030654618300000718
xiDenotes SvecOf the i-th group of feature vectors belonging to X, yiDenotes SvecThe feature vectors belonging to Y in the ith set of feature vectors of (1),
Figure BDA00030654618300000719
dimension, x, representing the ith set of feature vectorsijDenotes xiThe j element of (a), yijDenotes yiThe (j) th element of (a),
Figure BDA0003065461830000081
the distance calculation function Dist (X, Y) can count the distance of the feature vectors of a plurality of groups of features, and ensures that each group of feature vectors has the same contribution to the detection result;
step 4), performing iterative training on the classifier G:
step 4a), initializing the iteration times to R, the maximum iteration times to R, R being more than or equal to 200, the hyperparameter of the classifier G to theta, the initial value of theta to theta0The update step of theta is w, the maximum criterionAccuracy is Tm,TmThe corresponding hyperparameter is thetamAnd make Tm=0,θm=θ0R is 0, in this example, R is 500;
step 4b) training sample set StrainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)trainObtaining a training sample interval set according to the distance between every two training samples in the training data set
Figure BDA0003065461830000082
And through the pair of sample spacings
Figure BDA0003065461830000083
Each training sample p in (1)n′Classifying to obtain C kinds of cancer detection categories
Figure BDA0003065461830000084
Wherein the content of the first and second substances,
Figure BDA0003065461830000085
is StrainMiddle (x)trA training sample and ytrThe distance between the individual training samples is,
Figure BDA0003065461830000086
is composed of
Figure BDA0003065461830000087
Corresponding detection class, tn′Is pn′A corresponding detection category;
for each training sample pn′The classification is realized by the following steps: for training sample pn′Obtaining
Figure BDA0003065461830000088
In (c) pn′Set of distances to other training samples
Figure BDA0003065461830000089
Select the smallest thetamAn
Figure BDA00030654618300000810
Corresponding training samples and counting sample labels, and taking the cancer class with the most number of occurrences as pn′Cancer detection class t ofn′
Step 4c) judgment
Figure BDA00030654618300000811
Each of p inn′And the corresponding cancer detection category tn′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtained
Figure BDA00030654618300000812
And calculating the average accuracy T of the r-th iterationrWherein, in the step (A),
Figure BDA00030654618300000813
is composed of
Figure BDA00030654618300000814
The accuracy of the detection of (a) is,
Figure BDA00030654618300000815
is composed of
Figure BDA00030654618300000816
The number of training samples that are correctly classified in,
Figure BDA00030654618300000817
Figure BDA00030654618300000818
step 4d) determining Tm<TrIf true, let Tm=Tr,θmθ and perform step 4e), otherwise, perform step 4 e);
judging T in the above stepm<TrWhether or not it is establishedObtaining the value of the hyper-parameter with the highest accuracy, thereby ensuring that the trained classifier G' is the most elegant classifier in R iterations;
step 4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step 4b), otherwise, obtaining a trained classifier G';
step 5) obtaining the detection result of the cancer:
set of test samples StestAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)testObtaining a set of test sample spacings based on the distance between each two test samples in the set
Figure BDA0003065461830000091
And through the pair of sample spacings
Figure BDA0003065461830000092
Each test sample q in (1)n″Classifying to obtain C kinds of cancer detection categories
Figure BDA0003065461830000093
Wherein the content of the first and second substances,
Figure BDA0003065461830000094
is StestMiddle (x)teA test sample and yteThe pitch of the individual test specimens is,
Figure BDA0003065461830000095
is composed of
Figure BDA0003065461830000096
Corresponding detection class, t'n″Is qn″The corresponding detection category.
The technical effects of the invention are further explained by combining simulation experiments as follows:
1. simulation conditions are as follows:
the hardware platform of the simulation experiment is as follows: the CPU is Intel (R) core (TM) i7-8500, the main frequency is 2.20GHz, the memory is 16G, and the software platform is as follows: the operating system is MacOS 10.15, and version R is 3.6.
The data set used in the simulation was collected from the TCGA database and contained 12 cancers: the method comprises the following steps of obtaining cancer detection results of 2761 samples in total through bladder urothelial carcinoma BLCA, head and neck squamous cell carcinoma HNSC, renal papillary cell carcinoma KIRP, acute myeloid leukemia LAML, hepatocellular carcinoma LIHC, lung adenocarcinoma LUAD, lung squamous carcinoma LUSC, pancreatic cancer PAAD, prostate cancer PRAD, rectal adenocarcinoma READ and endometrial cancer UCEC, verifying the detection results through known labels, and considering that the detection results are correct when the detection results are consistent with the known labels, or considering that the detection results are wrong.
2. Simulation content and result analysis:
the detection accuracy and the application range of the invention are simulated, and the simulation result of the invention is compared with the cancer detection method based on the multiple protein analysis in the prior art, and the result is shown in table 1.
Figure BDA0003065461830000097
TABLE 1
Method Accuracy of Extent of cancer detection
Prior Art 88% 5
The invention 97.43% 12
In table 1, the detection accuracy of the method of the present invention is 97.43%, the cancer detection range is 12, and the index is higher than that of the prior art method, which proves that the method of the present invention can obtain better cancer detection result and improve the cancer detection range.
The above simulation experiments show that: when the method is used for detecting the cancer, firstly, the multi-dimensional characteristics of the SNV sites are obtained, secondly, the training sample set and the testing sample set are obtained, secondly, the distance calculation function Dist (X, Y) of the classifier G is constructed, secondly, the classifier G is subjected to iterative training, and finally, the detection result of the cancer is obtained.

Claims (4)

1. A cancer detection method based on multi-dimensional single nucleotide variation characteristics is characterized by comprising the following steps:
(1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:
(1a) c cancer SNV loci randomly selected from TCGA database to form SNV locus set Ssite={XcL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to CcRepresents the SNV site, X, of the c cancerc={xcn|1≤n≤Nc},xcnRepresenting the SNV site, N, of the nth cancer sequencing samplecNumber of cancer sequencing samples, Nc>100,
Figure FDA0003065461820000011
acnmRepresents the m-th SNV site,
Figure FDA0003065461820000012
indicates the number of SNV sites,
Figure FDA0003065461820000013
obtaining a sample Tag set Tag ═ Tag of a cancer sequencing samplel|1<l<N},TaglA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,
Figure FDA0003065461820000014
Σ denotes summation;
(1b) for each SNV site acnmEditing the sequence to obtain SsiteCorresponding set S of SNV sequencesseq={Xc′|1≤c≤C},Xc′={xcn′|1≤n≤Nc},
Figure FDA0003065461820000015
Xc' represents XcCorresponding SNV sequence, xcn' represents xcnCorresponding SNV sequence, acnm' means acnmA corresponding SNV sequence;
(1c) initializing the sampling times to be I, wherein I is more than or equal to 3, the characteristic dimension is d, and making d equal to 1;
(1d) in front-to-back order and for each SNV sequence a through a sliding window of size d × 1cnm' sampling to obtain a feature set S containing d groups of featurestemp={FhH is more than or equal to 1 and less than or equal to d, wherein FhRepresenting the h-th set of feature sets comprising N samples,
Figure FDA0003065461820000016
Figure FDA0003065461820000017
is the feature of the ith sample in the h set of features, FhNumber of feature types fd=6×4d-1
(1e) Judging whether d is less than I, if so, making d be d +1, and executing step (1d), otherwise, calculating SsiteAverage number of SNV sites M of medium cancer sequencing samplesequalAnd performing step (1f) in which,
Figure FDA0003065461820000018
(1f) judgment Mequal<fdIf true, obtain a signal containing feMultidimensional feature set S of group featuresdi={Fi|1≤i≤fe},Fi={fi (l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step (1d), where
Figure FDA0003065461820000021
FiRepresenting the ith set of feature sets comprising N samples, fi (l)Is FiThe characteristics of the first sample;
(2) acquiring a training sample set and a testing sample set:
(2a) statistics SdiEach of fi (l)Obtaining a number of feature types including feSet of group feature vectors Svec={Si|1≤i≤fe},
Figure FDA0003065461820000022
Wherein SiFor a set of feature vectors containing N samples,
Figure FDA0003065461820000023
is the l characteristic vector in the i characteristic vector;
(2b) set the feature vectors SvecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtained
Figure FDA0003065461820000024
Random selection of SsamMore than half of samples to be classified are used as training sample sets containing C types of cancers
Figure FDA0003065461820000025
And mixing SsamThe remaining sample to be classified as containing C kinds of cancerTest sample set
Figure FDA0003065461820000026
Wherein the content of the first and second substances,
Figure FDA0003065461820000027
is a c-th cancer comprising Nc' training sample set of training samples, pn′Is composed of
Figure FDA0003065461820000028
The (n)' th training sample in (a),
Figure FDA0003065461820000029
is a c-th cancer comprising Nc-Nc' test sample set of training samples, qn″Is composed of
Figure FDA00030654618200000210
The nth' training sample;
(3) distance calculation function Dist (X, Y) of the construction classifier G:
Figure FDA00030654618200000211
wherein X and Y represent SsamOf any two samples to be classifiedeA set of eigenvectors of the eigenvector, X ═ Xi|1≤i≤fe},Y={yi|1≤i≤fe},
Figure FDA00030654618200000212
xiDenotes SvecOf the i-th group of feature vectors belonging to X, yiDenotes SvecThe feature vectors belonging to Y in the ith set of feature vectors of (1),
Figure FDA00030654618200000213
dimension, x, representing the ith set of feature vectorsijDenotes xiThe j element of (a), yijDenotes yiThe (j) th element of (a),
Figure FDA00030654618200000214
(4) performing iterative training on the classifier G:
(4a) the initial iteration number is R, the maximum iteration number is R, R is more than or equal to 200, the hyperparameter of the classifier G is theta, and the initial value of theta is theta0The update step of theta is w, and the maximum accuracy is Tm,TmThe corresponding hyperparameter is thetamAnd make Tm=0,θm=θ0,r=0;
(4b) Will train the sample set StrainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)trainObtaining a training sample interval set according to the distance between every two training samples in the training data set
Figure FDA0003065461820000031
And through the pair of sample spacings
Figure FDA0003065461820000032
Each training sample p in (1)n′Classifying to obtain C kinds of cancer detection categories
Figure FDA0003065461820000033
Wherein the content of the first and second substances,
Figure FDA0003065461820000034
is StrainMiddle (x)trA training sample and ytrThe distance between the individual training samples is,
Figure FDA0003065461820000035
Figure FDA0003065461820000036
is composed of
Figure FDA0003065461820000037
Corresponding detection class, tn′Is pn′A corresponding detection category;
(4c) judgment of
Figure FDA0003065461820000038
Each of p inn′And the corresponding cancer detection category tn′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtained
Figure FDA0003065461820000039
And calculating the average accuracy T of the r-th iterationrWherein, in the step (A),
Figure FDA00030654618200000310
is composed of
Figure FDA00030654618200000311
The accuracy of the detection of (a) is,
Figure FDA00030654618200000312
Figure FDA00030654618200000313
is composed of
Figure FDA00030654618200000314
The number of training samples that are correctly classified in,
Figure FDA00030654618200000315
(4d) judgment of Tm<TrIf true, let Tm=Tr,θmAnd performing step (4e), otherwise, performing step (4 e);
(4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step (4b), otherwise, obtaining a trained classifier G';
(5) obtaining the detection result of the cancer:
set of test samples StestAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)testObtaining a set of test sample spacings based on the distance between each two test samples in the set
Figure FDA00030654618200000316
And through the pair of sample spacings
Figure FDA00030654618200000317
Each test sample q in (1)n″Classifying to obtain C kinds of cancer detection categories
Figure FDA00030654618200000318
Wherein the content of the first and second substances,
Figure FDA00030654618200000319
is StestMiddle (x)teA test sample and yteThe pitch of the individual test specimens is,
Figure FDA00030654618200000320
is composed of
Figure FDA00030654618200000321
Corresponding detection class, t'n″Is qn″The corresponding detection category.
2. The method for detecting cancer based on multi-dimensional mononucleotide variation characteristics of claim 1, wherein said step (1b) comprises a for each SNV sitecnmAnd performing sequence editing, wherein the implementation steps are as follows:
for sequences containing Seq ═ b1b2b3b4b5Minor allele b3' SNV site a ofcnmCentral origin allele b3The single nucleotide variant SNV of (A) is represented by Q ═ b3->b3', and pair b1、b2、Q、b4And b5Carrying out character string splicing to obtain acnmCorresponding SNV sequence acnm′=b1b2Qb4b5=b1b2b3->b3′b4b5Wherein b is1、b2、b4、b5Is b is3- > is a single nucleotide variant SNV.
3. The method for detecting cancer according to claim 1, wherein the statistic S in step (2a)diEach of fi (l)The number of feature types of (2) is implemented by the following steps:
for the characteristic dimension dvCharacteristic f of the samplei (l)Establishing a dimension of
Figure FDA0003065461820000041
Feature vector of
Figure FDA0003065461820000042
Initialization
Figure FDA0003065461820000043
All the values of the elements (2) are 0, and statistics is carried out
Figure FDA0003065461820000044
The number of the feature types corresponding to each element in the group is obtainedi (l)Corresponding feature vector
Figure FDA0003065461820000045
Wherein
Figure FDA0003065461820000046
4. The method for detecting cancer based on multi-dimensional SNP (Single nucleotide variation) as claimed in claim 1, wherein p is used for each training sample in step (4b)n′And classifying, wherein the implementation steps are as follows:
for training sample pn′Obtaining
Figure FDA0003065461820000047
In (c) pn′Set of distances to other training samples
Figure FDA0003065461820000048
Select the smallest thetamAn
Figure FDA0003065461820000049
Corresponding training samples and counting sample labels, and taking the cancer class with the most number of occurrences as pn′Cancer detection class t ofn′
CN202110524968.4A 2021-05-14 2021-05-14 Cancer detection method based on multi-dimensional single nucleotide variation characteristics Pending CN113223613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110524968.4A CN113223613A (en) 2021-05-14 2021-05-14 Cancer detection method based on multi-dimensional single nucleotide variation characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110524968.4A CN113223613A (en) 2021-05-14 2021-05-14 Cancer detection method based on multi-dimensional single nucleotide variation characteristics

Publications (1)

Publication Number Publication Date
CN113223613A true CN113223613A (en) 2021-08-06

Family

ID=77095606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110524968.4A Pending CN113223613A (en) 2021-05-14 2021-05-14 Cancer detection method based on multi-dimensional single nucleotide variation characteristics

Country Status (1)

Country Link
CN (1) CN113223613A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114242158A (en) * 2022-02-21 2022-03-25 臻和(北京)生物科技有限公司 Method, device, storage medium and equipment for detecting ctDNA single nucleotide variation site

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504305A (en) * 2014-12-24 2015-04-08 西安电子科技大学 Method for monitoring gene expression data classification
CN110211632A (en) * 2019-05-06 2019-09-06 西安电子科技大学 A kind of nucleotide unit point mutation detection method neural network based
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN112687329A (en) * 2019-10-17 2021-04-20 中国科学技术大学 Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504305A (en) * 2014-12-24 2015-04-08 西安电子科技大学 Method for monitoring gene expression data classification
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN110211632A (en) * 2019-05-06 2019-09-06 西安电子科技大学 A kind of nucleotide unit point mutation detection method neural network based
CN112687329A (en) * 2019-10-17 2021-04-20 中国科学技术大学 Cancer prediction system based on non-cancer tissue mutation information and construction method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BO LI 等: "Identification and Validation of the SNV Biomarkers Based on Multi-Dimensional Patterns", 《ARXIV》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114242158A (en) * 2022-02-21 2022-03-25 臻和(北京)生物科技有限公司 Method, device, storage medium and equipment for detecting ctDNA single nucleotide variation site

Similar Documents

Publication Publication Date Title
CN109767438B (en) Infrared thermal image defect feature identification method based on dynamic multi-objective optimization
Cho et al. Cancer classification using ensemble of neural networks with multiple significant gene subsets
CN113454733A (en) Multi-instance learner for prognostic tissue pattern recognition
CN109767437B (en) Infrared thermal image defect feature extraction method based on k-means dynamic multi-target
JP2003500663A (en) Methods for normalization of experimental data
Padmanabhan et al. An active learning approach for rapid characterization of endothelial cells in human tumors
Angelini et al. A Bayesian approach to estimation and testing in time-course microarray experiments
CN113223613A (en) Cancer detection method based on multi-dimensional single nucleotide variation characteristics
CN116596933B (en) Base cluster detection method and device, gene sequencer and storage medium
CN111310680B (en) Radiation source individual identification method based on deep learning
Bull et al. Extended correlation functions for spatial analysis of multiplex imaging data
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
Tasoulis et al. Unsupervised clustering of bioinformatics data
US20030104394A1 (en) Method and system for gene expression profiling analysis utilizing frequency domain transformation
KR102376212B1 (en) Gene expression marker screening method using neural network based on gene selection algorithm
CN113555059A (en) Quantitative method for coupling relationship between organic carbon and microorganisms under environmental change
Chan et al. Network inference and hypothesesgeneration from single-cell transcriptomic data using multivariate information measures. bioRxiv
CN112183576B (en) Time-LSTM classification method based on unbalanced data set
Sivanandan et al. Machine learning enabled pooled optical screening in human lung cancer cells
CHEN et al. Identification of Predictor Genes of Feed Efficiency in Beef Cattle by Applying Machine Learning (ML) Methods to Multi-tissue Transcriptome Data
CN117316295A (en) Endocrine disease cell identification method based on cell heterogeneity gene and pathway function
CN110555370B (en) Channel effect inhibition method based on PLDA factor analysis method in underwater target recognition
Tang et al. T-BAPS: a Bayesian statistical tool for comparison of microbial communities using terminal-restriction fragment length polymorphism (T-RFLP) data
Feng et al. Statistical considerations in combining biomarkers for disease classification
Niederle et al. VADA: a Data-Driven Simulator for Nanopore Sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210806