CN113223613A

CN113223613A - Cancer detection method based on multi-dimensional single nucleotide variation characteristics

Info

Publication number: CN113223613A
Application number: CN202110524968.4A
Authority: CN
Inventors: 鱼亮; 李博
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-06

Abstract

The invention provides a cancer marker identification method based on multi-dimensional mononucleotide variation characteristics, which is used for solving the technical problems of low detection accuracy and narrow detection range in the prior art and comprises the following steps: (1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation; (2) acquiring a training sample set and a test sample set; (3) constructing a distance calculation function Dist (X, Y) of the classifier G; (4) carrying out iterative training on the classifier G; (5) and obtaining the detection result of the cancer. The invention has more training set and testing set samples, and collects the multi-dimensional SNV characteristics from different characteristic dimensions, thereby increasing the information content of the cancer detection samples in the characteristics, and using the SNV data of various cancers, the detection model obtained by training can simultaneously detect various cancers, and the repeated detection process is simplified.

Description

Cancer detection method based on multi-dimensional single nucleotide variation characteristics

Technical Field

The invention belongs to the technical field of biological information, relates to a cancer detection method, and particularly relates to a cancer detection method based on multi-dimensional single nucleotide variation characteristics, which can be used for classifying single nucleotide variation data of cancers.

Background

In recent years, cancer has been threatening the health of people as a major cause of the shortened life expectancy of humans worldwide. This leads to difficulties in cancer detection due to atypical clinical manifestations or the presence of histopathology. Due to the lack of uniform definition and related indexes, early cancer detection is mostly realized by depending on the experience of doctors or the results of a large number of detection items. This makes it difficult to avoid individual-specific bias, and detection cycles are long, costly, and less accurate. A high-performance cancer detection method which can be applied to various cancers is very important, and not only can provide knowledge support for doctors, but also doctors can monitor the changes of improvement, deterioration, relapse and the like of the cancers; the time period and monetary cost of the loss of a large number of complex test items can also be reduced. With the intensive application of machine learning in various fields, various cancer detection methods using machine learning have emerged.

Bockmayr T et al published a title on Laboratory Investigation in 2020: a multi-class cancer classification in fresh frequency and labeled tissue by DigiWest multiplex protein analysis article discloses a cancer detection method based on multiple protein analysis, which firstly tests a plurality of antibodies in a group of formalin-fixed paraffin-embedded FFPE samples, selects antibodies which generate obvious relevant signals in fresh frozen and FFPE primary tumor samples as characteristics, and develops a support vector machine algorithm suitable for 5 kinds of cancers by using the characteristics. The method has the disadvantages that the available data volume is small, the characteristic acquisition mode is single, the detection accuracy is low, the research is mainly directed to a specific few cancers, certain limitation is caused to the research result which is difficult to avoid, namely, more cancers cannot be detected simultaneously, and a large number of repeated tests are required.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a cancer detection method based on multi-dimensional single nucleotide variation characteristics, which is used for solving the technical problems of low detection accuracy and narrow detection range in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:

(1a) c cancer SNV loci randomly selected from TCGA database to form SNV locus set S_site＝{X_cL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to C_cRepresents the SNV site, X, of the c cancer_c＝{x_cn|1≤n≤N_c}，x_cnRepresenting the SNV site, N, of the nth cancer sequencing sample_cNumber of cancer sequencing samples, N_c＞100，

a_cnmRepresents the m-th SNV site,

indicates the number of SNV sites,

obtaining a sample Tag set Tag ═ Tag of a cancer sequencing sample_l|1＜l＜N}，Tag_lA cancer class tag indicating to which the l cancer sequencing sample belongs, wherein,

Σ denotes summation;

(1b) for each SNV site a_cnmEditing the sequence to obtain S_siteCorresponding set S of SNV sequences_seq＝{X_c′|1≤c≤C},X_c′＝{x_cn′|1≤n≤N_c}，

X_c' represents X_cCorresponding SNV sequence, x_cn' represents x_cnCorresponding SNV sequence, a_cnm' means a_cnmA corresponding SNV sequence;

(1c) initializing the sampling times to be I, wherein I is more than 3, the characteristic dimension is d, and making d equal to 1;

(1d) in front-to-back order and for each SNV sequence a through a sliding window of size d × 1_cnm' sampling to obtain a feature set S containing d groups of features_temp＝{F_hH is more than or equal to 1 and less than or equal to d, wherein F_hRepresenting the h-th set of feature sets comprising N samples,

is the feature of the ith sample in the h set of features, F_hNumber of feature types f_d＝6×4^d-1；

(1e) Judging whether d is less than I, if so, making d be d +1, and executing step (1d), otherwise, calculating S_siteAverage number of SNV sites M of medium cancer sequencing samples_equalAnd performing step (1f) in which,

(1f) judgment M_equal＜f_dIf true, obtain a signal containing f_eMultidimensional feature set S of group features_di＝{F_i|1≤i≤f_e}，F_i＝{f_i ^(l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step (1d), where

F_iRepresenting the ith set of feature sets comprising N samples, f_i ^(l)Is F_iThe characteristics of the first sample;

(2) acquiring a training sample set and a testing sample set:

(2a) statistics S_diEach of f_i ^(l)Obtaining a number of feature types including f_eSet of group feature vectors S_vec＝{S_i|1≤i≤f_e}，

Wherein S_iFor a set of feature vectors containing N samples,

is the l characteristic vector in the i characteristic vector;

(2b) set the feature vectors S_vecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtained

Random selection of S_samMore than half of samples to be classified are used as training sample sets containing C types of cancers

And mixing S_samThe remaining samples to be classified are used as a test sample set containing C kinds of cancers

Wherein the content of the first and second substances,

is a c-th cancer comprising N_c' training sample set of training samples, p_n′Is composed of

The (n)' th training sample in (a),

is a c-th cancer comprising N_c-N_c' test sample set of training samples, q_n″Is composed of

N in (1)Training samples;

(3) distance calculation function Dist (X, Y) of the construction classifier G:

wherein X and Y represent S_samOf any two samples to be classified_eA set of eigenvectors of the eigenvector, X ═ X_i|1≤i≤f_e}，Y＝{y_i|1≤i≤f_e}，

x_iDenotes S_vecOf the i-th group of feature vectors belonging to X, y_iDenotes S_vecThe feature vectors belonging to Y in the ith set of feature vectors of (1),

dimension, x, representing the ith set of feature vectors_ijDenotes x_iThe j element of (a), y_ijDenotes y_iThe (j) th element of (a),

(4) performing iterative training on the classifier G:

(4a) the initial iteration number is R, the maximum iteration number is R, R is more than or equal to 200, the hyperparameter of the classifier G is theta, and the initial value of theta is theta₀The update step of theta is w, and the maximum accuracy is T_m，T_mThe corresponding hyperparameter is theta_mAnd make T_m＝0，θ_m＝θ₀，r＝0；

(4b) Will train the sample set S_trainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)_trainObtaining a training sample interval set according to the distance between every two training samples in the training data set

And through the pair of sample spacings

Each training sample p in (1)_n′Classifying to obtain C kinds of cancer detection categories

Wherein the content of the first and second substances,

is S_trainMiddle (x)_trA training sample and y_trThe distance between the individual training samples is,

is composed of

Corresponding detection class, t_n′Is p_n′A corresponding detection category;

(4c) judgment of

Each of p in_n′And the corresponding cancer detection category t_n′If the two are consistent, the detection result of the training sample is correct, otherwise, the detection result of the training sample is considered to be wrong, and a detection accuracy set is obtained

And calculating the average accuracy T of the r-th iteration_rWherein, in the step (A),

is composed of

The accuracy of the detection of (a) is,

is composed of

The number of training samples that are correctly classified in,

(4d) judgment of T_m＜T_rIf true, let T_m＝T_r，θ_mAnd performing step (4e), otherwise, performing step (4 e);

(4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step (4b), otherwise, obtaining a trained classifier G';

(5) obtaining the detection result of the cancer:

set of test samples S_testAs input to the trained classifier G', a set of test samples S is computed using the distance computation function Dist (X, Y)_testObtaining a set of test sample spacings based on the distance between each two test samples in the set

And through the pair of sample spacings

Each test sample q in (1)_n″Classifying to obtain C kinds of cancer detection categories

Wherein the content of the first and second substances,

is S_testMiddle (x)_teA test sample and y_teThe pitch of the individual test specimens is,

is composed of

Corresponding detection class, t'_n″Is q_n″The corresponding detection category.

Compared with the prior art, the invention has the following advantages:

1. the SNV data volume used by the invention is rich, and the multi-dimensional SNV characteristics are collected from different characteristic dimensions, so that the information content of the cancer detection sample in the characteristics is increased, and the accuracy of the detection result is improved.

2. The invention uses SNV data of various cancers, and the trained detection model can simultaneously detect the various cancers, thereby simplifying the repeated detection process and expanding the detection range of the cancers compared with the defect that only a few specific cancers can be detected in the prior art.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention will be described in further detail with reference to the following drawings and specific examples, it being understood that the invention is not an patentable object as defined in clause 25 of the patent Law, but also complies with clause 2 of the patent Law:

referring to fig. 1, the present invention includes the steps of:

step 1) obtaining the multidimensional characteristics of the SNV locus of the single nucleotide variation:

step 1a) randomly selecting C cancer SNV loci from TCGA database to form SNV locus set S_site＝{X_cL 1 is less than or equal to C, wherein C is more than 0, and X is less than or equal to C_cRepresents the SNV site, X, of the c cancer_c＝{x_cn|1≤n≤N_c}，x_cnRepresenting the SNV site, N, of the nth cancer sequencing sample_cNumber of cancer sequencing samples, N_c＞100，

a_cnmRepresents the m-th SNV site,

indicates the number of SNV sites,

Σ denotes summation, in this example, C — 12, N — 2761;

when SNV loci are collected, only SNV loci of 12 cancers are downloaded, so that data are screened for ensuring data quality;

step 1b) for each SNV site a_cnmEditing the sequence to obtain S_siteCorresponding set S of SNV sequences_seq＝{X_c′|1≤c≤C},X_c′＝{x_cn′|1≤n≤N_c}，

for each SNV site a_cnmThe sequence editing is implemented by the following steps: setting of SNV site a_cnmWherein the sequence Seq ═ b is included₁b₂b₃b₄b₅Minor allele b₃', Primary allele b₃The single nucleotide variant SNV of (A) is represented by Q ═ b₃-＞b₃', and pair b₁、b₂、Q、b₄And b₅Carrying out character string splicing to obtain a_cnmCorresponding SNV sequence a_cnm′＝b₁b₂Qb₄b₅＝b₁b₂b₃-＞b₃′b₄b₅Wherein b is₁、b₂、b₄、b₅Is b is₃- > is a mononucleosideAcid variation SNV;

the sequence editing step can avoid time loss caused by repeated operations such as character string splicing and the like during feature acquisition;

step 1c), initializing the sampling times to be I, wherein I is greater than or equal to 3, the characteristic dimension is d, and making d equal to 1, in this example, I equal to 3;

during the initialization sampling times, the value of I is reasonably controlled to avoid overfitting to a certain extent;

step 1d) of sequencing each SNV sequence a in front to back order and through a sliding window of size dX1_cnm' sampling to obtain a feature set S containing d groups of features_temp＝{F_hH is more than or equal to 1 and less than or equal to d, wherein F_hRepresenting the h-th set of feature sets comprising N samples,

Step 1e) judging whether d < I is true, if so, making d equal to d +1, and executing step 1d), otherwise, calculating S_siteAverage number of SNV sites M of medium cancer sequencing samples_equalAnd performing a step 1f) in which,

in this example, M is 528098_equal＝191；

Step 1f) judgment of M_equal＜f_dIf true, obtain a signal containing f_eMultidimensional feature set S of group features_di＝{F_i|1≤i≤f_e}，F_i＝{f_i ^(l)L 1 ≦ l ≦ N }, otherwise, let d ═ d +1, and perform step 1d), where

F_iRepresents the ith group packetFeature set comprising N samples, f_i ^(l)Is F_iThe characteristics of the first sample;

the above-mentioned judgment M_equal＜f_dWhether the condition is satisfied can avoid the situation that the features are too sparse, when M_equal＜f_dIf the characteristics are collected continuously, a large number of 0 values appear in the obtained characteristic vector, so that the detection accuracy is reduced, and the data utilization rate can be improved by collecting the characteristics from multiple dimensions, so that the accuracy of a detection result is improved;

step 2) obtaining a training sample set and a testing sample set:

step 2a) statistics of S_diEach of f_i ^(l)Obtaining a number of feature types including f_eSet of group feature vectors S_vec＝{S_i|1≤i≤f_e}，

Wherein S_iFor a set of feature vectors containing N samples,

is the l characteristic vector in the i characteristic vector;

statistics S_diEach of f_i ^(l)The number of feature types of (a) is implemented by: setting sample characteristics f_i ^(l)Has a characteristic dimension of d_vEstablishing a dimension of

Feature vector of

Initialization

All the values of the elements (2) are 0, and statistics is carried out

The number of feature types corresponding to each element in the listTo obtain f_i ^(l)Corresponding feature vector

Wherein

Step 2b) set S of feature vectors_vecThe feature vector and the corresponding sample label in the Tag form a sample to be classified, and a sample set to be classified is obtained

Wherein the content of the first and second substances,

The (n)' th training sample in (a),

The nth' training sample, in this example, S is selected_samUsing 80% of samples to be classified as a training sample set S_train；

The above feature vectors are collected S_vecThe characteristic vector and the corresponding sample label in the Tag form a sample to be classified, so that time loss caused by searching the corresponding sample label can be avoided when judging whether the detection result of the sample to be classified is correct;

step 3), constructing a distance calculation function Dist (X, Y) of the classifier G:

the distance calculation function Dist (X, Y) can count the distance of the feature vectors of a plurality of groups of features, and ensures that each group of feature vectors has the same contribution to the detection result;

step 4), performing iterative training on the classifier G:

step 4a), initializing the iteration times to R, the maximum iteration times to R, R being more than or equal to 200, the hyperparameter of the classifier G to theta, the initial value of theta to theta₀The update step of theta is w, the maximum criterionAccuracy is T_m，T_mThe corresponding hyperparameter is theta_mAnd make T_m＝0，θ_m＝θ₀R is 0, in this example, R is 500;

step 4b) training sample set S_trainAs input to the classifier G, a training sample set S is computed using a distance computation function Dist (X, Y)_trainObtaining a training sample interval set according to the distance between every two training samples in the training data set

And through the pair of sample spacings

Wherein the content of the first and second substances,

is composed of

for each training sample p_n′The classification is realized by the following steps: for training sample p_n′Obtaining

In (c) p_n′Set of distances to other training samples

Select the smallest theta_mAn

Corresponding training samples and counting sample labels, and taking the cancer class with the most number of occurrences as p_n′Cancer detection class t of_n′。

Step 4c) judgment

is composed of

The accuracy of the detection of (a) is,

is composed of

The number of training samples that are correctly classified in,

step 4d) determining T_m＜T_rIf true, let T_m＝T_r，θ_mθ and perform step 4e), otherwise, perform step 4 e);

judging T in the above step_m＜T_rWhether or not it is establishedObtaining the value of the hyper-parameter with the highest accuracy, thereby ensuring that the trained classifier G' is the most elegant classifier in R iterations;

step 4e) judging whether R is greater than R, if so, making R equal to R +1 and making theta equal to theta + w, and executing the step 4b), otherwise, obtaining a trained classifier G';

step 5) obtaining the detection result of the cancer:

And through the pair of sample spacings

Wherein the content of the first and second substances,

is composed of

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions are as follows:

the hardware platform of the simulation experiment is as follows: the CPU is Intel (R) core (TM) i7-8500, the main frequency is 2.20GHz, the memory is 16G, and the software platform is as follows: the operating system is MacOS 10.15, and version R is 3.6.

The data set used in the simulation was collected from the TCGA database and contained 12 cancers: the method comprises the following steps of obtaining cancer detection results of 2761 samples in total through bladder urothelial carcinoma BLCA, head and neck squamous cell carcinoma HNSC, renal papillary cell carcinoma KIRP, acute myeloid leukemia LAML, hepatocellular carcinoma LIHC, lung adenocarcinoma LUAD, lung squamous carcinoma LUSC, pancreatic cancer PAAD, prostate cancer PRAD, rectal adenocarcinoma READ and endometrial cancer UCEC, verifying the detection results through known labels, and considering that the detection results are correct when the detection results are consistent with the known labels, or considering that the detection results are wrong.

2. Simulation content and result analysis:

the detection accuracy and the application range of the invention are simulated, and the simulation result of the invention is compared with the cancer detection method based on the multiple protein analysis in the prior art, and the result is shown in table 1.

TABLE 1

Method	Accuracy of	Extent of cancer detection
			Prior Art	88％	5
The invention	97.43％	12

In table 1, the detection accuracy of the method of the present invention is 97.43%, the cancer detection range is 12, and the index is higher than that of the prior art method, which proves that the method of the present invention can obtain better cancer detection result and improve the cancer detection range.

The above simulation experiments show that: when the method is used for detecting the cancer, firstly, the multi-dimensional characteristics of the SNV sites are obtained, secondly, the training sample set and the testing sample set are obtained, secondly, the distance calculation function Dist (X, Y) of the classifier G is constructed, secondly, the classifier G is subjected to iterative training, and finally, the detection result of the cancer is obtained.

Claims

1. A cancer detection method based on multi-dimensional single nucleotide variation characteristics is characterized by comprising the following steps:

a_cnmRepresents the m-th SNV site,

indicates the number of SNV sites,

Σ denotes summation;

(1c) initializing the sampling times to be I, wherein I is more than or equal to 3, the characteristic dimension is d, and making d equal to 1;

(2) acquiring a training sample set and a testing sample set:

Wherein S_iFor a set of feature vectors containing N samples,

is the l characteristic vector in the i characteristic vector;

And mixing S_samThe remaining sample to be classified as containing C kinds of cancerTest sample set

Wherein the content of the first and second substances,

The (n)' th training sample in (a),

The nth' training sample;

(3) distance calculation function Dist (X, Y) of the construction classifier G:

(4) performing iterative training on the classifier G:

And through the pair of sample spacings

Wherein the content of the first and second substances,

is composed of

(4c) judgment of

is composed of

The accuracy of the detection of (a) is,

is composed of

The number of training samples that are correctly classified in,

(5) obtaining the detection result of the cancer:

And through the pair of sample spacings

Wherein the content of the first and second substances,

is composed of

2. The method for detecting cancer based on multi-dimensional mononucleotide variation characteristics of claim 1, wherein said step (1b) comprises a for each SNV site_cnmAnd performing sequence editing, wherein the implementation steps are as follows:

for sequences containing Seq ═ b₁b₂b₃b₄b₅Minor allele b₃' SNV site a of_cnmCentral origin allele b₃The single nucleotide variant SNV of (A) is represented by Q ═ b₃-＞b₃', and pair b₁、b₂、Q、b₄And b₅Carrying out character string splicing to obtain a_cnmCorresponding SNV sequence a_cnm′＝b₁b₂Qb₄b₅＝b₁b₂b₃-＞b₃′b₄b₅Wherein b is₁、b₂、b₄、b₅Is b is₃- > is a single nucleotide variant SNV.

3. The method for detecting cancer according to claim 1, wherein the statistic S in step (2a)_diEach of f_i ^(l)The number of feature types of (2) is implemented by the following steps:

for the characteristic dimension d_vCharacteristic f of the sample_i ^(l)Establishing a dimension of

Feature vector of

Initialization

All the values of the elements (2) are 0, and statistics is carried out

The number of the feature types corresponding to each element in the group is obtained_i ^(l)Corresponding feature vector

Wherein

4. The method for detecting cancer based on multi-dimensional SNP (Single nucleotide variation) as claimed in claim 1, wherein p is used for each training sample in step (4b)_n′And classifying, wherein the implementation steps are as follows:

for training sample p_n′Obtaining

In (c) p_n′Set of distances to other training samples

Select the smallest theta_mAn