WO2023005196A1 - 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 - Google Patents
基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 Download PDFInfo
- Publication number
- WO2023005196A1 WO2023005196A1 PCT/CN2022/077251 CN2022077251W WO2023005196A1 WO 2023005196 A1 WO2023005196 A1 WO 2023005196A1 CN 2022077251 W CN2022077251 W CN 2022077251W WO 2023005196 A1 WO2023005196 A1 WO 2023005196A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- breast cancer
- attribute
- cancer gene
- gene
- granularity
- Prior art date
Links
- 206010006187 Breast cancer Diseases 0.000 title claims abstract description 172
- 208000026310 Breast neoplasm Diseases 0.000 title claims abstract description 172
- 102000048850 Neoplasm Genes Human genes 0.000 title claims abstract description 148
- 108700019961 Neoplasm Genes Proteins 0.000 title claims abstract description 148
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 21
- 230000009977 dual effect Effects 0.000 title abstract 2
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 77
- 230000009467 reduction Effects 0.000 claims abstract description 47
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 26
- 238000012706 support-vector machine Methods 0.000 claims abstract description 9
- 238000005469 granulation Methods 0.000 claims abstract description 7
- 230000003179 granulation Effects 0.000 claims abstract description 7
- 238000010606 normalization Methods 0.000 claims abstract description 7
- 238000012800 visualization Methods 0.000 claims abstract description 6
- 235000019580 granularity Nutrition 0.000 claims description 56
- 239000011159 matrix material Substances 0.000 claims description 27
- 238000001514 detection method Methods 0.000 claims description 25
- 239000008187 granular material Substances 0.000 claims description 16
- 239000002245 particle Substances 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 10
- 230000000694 effects Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 7
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 230000001419 dependent effect Effects 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 4
- 238000013145 classification model Methods 0.000 claims description 3
- 238000007499 fusion processing Methods 0.000 claims description 3
- 238000003064 k means clustering Methods 0.000 claims description 3
- 210000005075 mammary gland Anatomy 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000007670 refining Methods 0.000 claims description 3
- 238000007794 visualization technique Methods 0.000 claims description 3
- 101150069452 z gene Proteins 0.000 claims description 3
- 230000006978 adaptation Effects 0.000 claims description 2
- 230000008859 change Effects 0.000 claims description 2
- 238000000513 principal component analysis Methods 0.000 claims 1
- 238000007635 classification algorithm Methods 0.000 abstract description 4
- 238000007405 data analysis Methods 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 2
- 238000004364 calculation method Methods 0.000 abstract 1
- 230000002068 genetic effect Effects 0.000 description 15
- 206010028980 Neoplasm Diseases 0.000 description 11
- 201000011510 cancer Diseases 0.000 description 11
- 238000011282 treatment Methods 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drug Drugs 0.000 description 3
- 206010064571 Gene mutation Diseases 0.000 description 2
- 208000034826 Genetic Predisposition to Disease Diseases 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- JMANVNJQNLATNU-UHFFFAOYSA-N oxalonitrile Chemical compound N#CC#N JMANVNJQNLATNU-UHFFFAOYSA-N 0.000 description 2
- 230000002265 prevention Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 206010067484 Adverse reaction Diseases 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 208000000453 Skin Neoplasms Diseases 0.000 description 1
- 230000006838 adverse reaction Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000000481 breast Anatomy 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000011337 individualized treatment Methods 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 238000009659 non-destructive testing Methods 0.000 description 1
- 235000015097 nutrients Nutrition 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 201000000849 skin cancer Diseases 0.000 description 1
- 239000000439 tumor marker Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Definitions
- the invention relates to the technical field of medical information intelligent processing, in particular to a multi-granularity breast cancer gene classification method based on double self-adaptive neighborhood radii.
- Cancer is the most common genetic disease. Relevant medical research has shown that lung cancer, skin cancer and breast cancer are closely related to genes; the emergence of cancer can often be explained by gene mutations. If genetic material is damaged without repair, cancer cells will absorb The infinite division of nutrients in normal cells leads to the decline of human body functions. The cure rate for early cancer is high, and the cure rate for cancer cells after metastasis is low; early detection and early treatment are the best treatment methods at present; genetic testing is a non-destructive testing method.
- the analysis of genetic data helps doctors effectively analyze whether a patient is a high-risk patient for breast cancer.
- a new method is urgently needed to effectively and greatly reduce the redundant genetic data in the classification information of breast cancer genetic data. , reduce the analysis time of breast cancer data and improve the analysis efficiency and precision, and effectively carry out early screening of breast cancer has certain significance for clinical treatment.
- Detection is mainly used for disease diagnosis.
- the method of genetic diagnosis not only greatly improves the sensitivity, but also can get the results in a short time, understand the correct treatment method, choose the drug correctly, and avoid adverse reactions caused by indiscriminate use of drugs.
- the results of genetic testing can help patients formulate the right treatment.
- the purpose of the present invention is to provide a multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius, which solves the problem that the existing effective way to judge the status of breast cancer is that the dimension of breast cancer-related gene data is too high to be observed.
- the influence of gene mutation on the early discrimination of breast cancer through the connection between breast cancer gene data and double adaptive neighborhood radius, solves the problem of difficult selection of neighborhood radius in neighborhood rough set, and then uses multi-granularity neighborhood rough set attribute Jane can effectively remove noise and redundant data.
- the present invention adopts the following technical scheme: a multi-granularity breast cancer gene classification method based on double self-adaptive neighborhood radius, which includes the following steps:
- S2 Normalize the non-label data in the breast cancer gene dataset.
- the formula for data normalization is as follows:
- x refers to the value of a certain attribute in the original sample
- x' represents the value of a certain attribute in the original sample after normalization
- max(x) represents the maximum value of a certain attribute in all samples
- min(x) Indicates the minimum value in a certain attribute among all samples
- S4 Information granulation implementation method: randomly select k breast cancer gene samples as cluster centers, and use Euclidean distance to assign each sample point to the cluster center closest to them. For each cluster, calculate the number of sample points in the cluster. The mean value is used as the new cluster center, and when the position of the cluster center does not change, k information granules are finally obtained;
- S5 The gene attributes of breast cancer are divided into multiple granularities, and the neighborhood rough set attribute reduction based on cluster center distance adaptation is realized at each granularity: by temporarily retaining the gene attributes in the dense similar area, for the dense similar area Multi-layer neighborhood screening of a large number of gene attributes outside, remove irrelevant gene attributes, and then use heuristic search to iterate to the positive domain. This process removes redundant gene attributes in densely similar areas, and obtains important breast cancer gene attributes;
- S6 The reduced breast cancer gene attributes are obtained for each granularity, and multiple granularities are fused, and multi-granularity neighborhood attribute reduction based on the attribute inclusion degree is used to remove similarly redundant genes at different granularities during the fusion process
- Attribute Introduce the concept of attribute inclusion degree, obtain the optimal multi-granularity neighborhood radius under breast cancer gene data by refining the learning curve of attribute inclusion degree, and use heuristic search to remove redundancy under different granularity based on the multi-granularity neighborhood radius The rest of the gene attributes, and finally get the reduced set of attributes.
- S7 Use the SVM support vector machine to fit the attribute reduction set, introduce the two major indicators of accuracy and recall, comprehensively consider the stability of the model, and introduce penalty on the basis of using the SVM support vector machine as the classifier of the model
- the classification model has good accuracy and recall at the same time, that is, the classification prediction based on breast cancer gene data under this model has a high accuracy rate and the risk of predicting a cancer patient as a normal person is low.
- S8 Input large-scale breast cancer gene data, use the reduced set to select appropriate attributes, and use the classifier to obtain the final prediction result.
- step S3 As the multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius provided by the present invention, the specific steps of the step S3 are as follows:
- Step S3.1 Using the silhouette coefficient to evaluate the clustering algorithm, the similarity between the i-th breast cancer gene attribute and other breast cancer gene attributes in the cluster is a i , and the similarity with other breast cancer gene attributes outside the cluster is b i , then the silhouette coefficient of the i-th breast cancer gene attribute is defined as follows:
- the value range of s i is [-1, 1]. When the contour system is closer to 1, the clustering effect is better, and when the contour coefficient is negative, the clustering effect is poor;
- Step S3.2 Use PCA dimensionality reduction algorithm to reduce the simplification of breast cancer gene data, realize dimensionality reduction visualization, and combine with clustering algorithm to test the actual effect of clustering.
- the specific design is as follows:
- N is the total number of gene attributes
- y i is the eigenvalue of column i
- y n is the eigenvalue of column n
- ⁇ i represents the contribution rate of the i-th column in the covariance matrix
- ⁇ r represents the cumulative contribution rate of the first r columns in the covariance matrix
- Step S3.3 Take the first r dimensions of the covariance matrix as the projection matrix S n ⁇ r , multiply the matrix Y m ⁇ n to be dimensionally reduced by the projection matrix S n ⁇ r , and obtain the matrix T m ⁇ after dimension reduction r means:
- m represents the number of samples of breast cancer gene data
- n represents the number of original gene attributes of breast cancer gene data
- r represents the number of gene attributes of breast cancer gene data obtained after dimensionality reduction.
- Step S3.4 Determine a rough value interval of k value through the silhouette coefficient, and then refine the interval through PCA dimension reduction visualization method to select the best k value to obtain the number of information grains.
- step S5 As the multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius provided by the present invention, the specific steps of the step S5 are as follows:
- Step S5.1 Under a single information granularity, calculate the neighborhood relationship of each breast cancer gene sample x i on B under a single gene attribute:
- ⁇ B is the distance function
- ⁇ is the neighborhood radius
- Step S5.2 Calculate the decision-making attribute D of the breast cancer gene with respect to the positive domain of B under the individual gene attribute at a single information granularity:
- Step S5.5 Obtain the positive domain of the breast cancer gene decision attribute D with respect to a m
- Step S5.6 At a single granularity, arrange the dependency of the attribute in descending order in the list list, and obtain the positive domain NPOS P (D) of the breast cancer gene decision attribute D with respect to the gene attribute at the P granularity:
- Step S5.7 Calculate the dependence of decision D on condition attribute P initialization
- Step S5.9 If r(R 0 ,D) ⁇ r(P,D), put the most dependent attribute in list list into R 0 , and skip to step S5.8.
- step S6 As the multi-granularity breast cancer gene classification method based on the double adaptive neighborhood radius provided by the present invention, the specific steps of the step S6 are as follows:
- Step S6.3 Obtain the positive domain of the decision attribute D with respect to P t Calculate the dependence of decision D on conditional attribute Pt of breast cancer genes
- Step S6.4 Arrange the dependency of breast cancer gene attributes in descending order in the list All_list, and obtain the optimistic multi-granularity positive domain of decision attribute D with respect to C as follows:
- Step S6.5 Calculate the dependence of decision D on condition attribute C initialization
- Step S6.7 If r(Red 0 ,D) ⁇ r(C,D), put the most dependent attribute in the list All_list into Red 0 , and jump to step S6.6;
- the parallel classifier with high accuracy and high recall of the present invention can effectively utilize the breast cancer reduction set based on double adaptive neighborhood radius, and give the detector to obtain high accuracy in a short period of time
- the high-recall model can also ensure that the high loss risk of predicting cancer patients as normal people is minimized.
- the present invention can analyze the data of a small number of samples, and extract the more important gene attributes through attribute reduction to reduce the interference of noise data on model prediction.
- the domain radius can enable the classifier to better self-learn and fit the model, thereby further improving the detection accuracy.
- the present invention removes a large number of redundant gene data and noise gene data through the multi-granularity breast cancer gene classification method based on the double adaptive neighborhood radius, thereby reducing the 24481 gene attributes originally detected to 2734 from the above example
- using the ten-fold crossover method to verify can effectively solve the problems of small sample size and long running time, which greatly reduces the complexity of the model and the time complexity of the algorithm, and the genetic data submitted by the user can be verified. Get the result in just a few minutes, giving the tester a better testing experience.
- the problem of recall rate is often ignored when samples are taken, and the risk loss of predicting a cancer patient as a normal person is extremely high, and the detector is likely to miss the best treatment time
- the present invention uses a method based on double adaptive neighbor
- the domain radius multi-granularity breast cancer gene classification method fully considers the risk of detection accuracy and detection recall rate, adjusts the model, and sets penalty items to fully consider the recall rate on the basis of ensuring a high model accuracy rate. to improve the model, thereby greatly reducing the occurrence of this risk.
- Fig. 1 is a flow chart of breast cancer gene detection in the present invention.
- Fig. 2 is a flow chart of the double adaptive neighborhood radius multi-granularity attribute reduction based on breast cancer gene data in the present invention.
- Fig. 3 is a flow chart of classification and detection of breast cancer gene data in the present invention.
- Fig. 4 is a flow chart of single-grain adaptive neighborhood radius attribute reduction for breast cancer gene data in the present invention.
- Fig. 5 is a flow chart of multi-granularity adaptive neighborhood radius attribute reduction for breast cancer gene data in the present invention.
- the technical scheme that the present invention provides is, the multi-granularity breast cancer gene classification method based on double self-adaptive neighborhood radius, comprises the following steps:
- the above model was tested using the breast cancer gene data set, in which the number of samples was 97, and the gene attributes totaled 24,481.
- the decision attributes were divided into two categories, namely diagnosed breast cancer patients and normal people.
- Step 2 Normalize the non-label data in the breast cancer gene dataset.
- the formula for data normalization is as follows:
- x refers to the value of a certain attribute in the original sample
- x' represents the value of a certain attribute in the original sample after normalization
- max(x) represents the maximum value of a certain attribute in all samples
- min(x) Indicates the minimum value of an attribute among all samples.
- Step 4 Information granulation implementation method: randomly select k breast cancer gene samples as cluster centers, and use Euclidean distance to assign each sample point to the cluster center closest to them. For each cluster, calculate the sample points in the cluster The mean value of is used as the new cluster center, when the position of the cluster center no longer changes, k information granules are finally obtained;
- Step 5 The gene attributes of breast cancer are divided into multiple granularities, and the neighborhood rough set attribute reduction based on cluster center distance adaptive is realized at each granularity: by temporarily retaining the gene attributes in the dense similar area, for dense similar A large number of gene attributes outside the area are screened in multi-layer neighborhoods to remove irrelevant gene attributes, and then heuristic search is used to iterate to the positive domain. This process removes redundant gene attributes in densely similar areas and obtains important breast cancer gene attributes ;
- Step 6 The reduced breast cancer gene attributes are obtained for each granularity, and the multiple granularities are fused, and the multi-granularity neighborhood attribute reduction based on the attribute inclusion degree is used to remove similar redundancy at different granularities during the fusion process
- Gene attribute Introduce the concept of attribute inclusion degree, obtain the optimal multi-granularity neighborhood radius under the breast cancer gene data by refining the learning curve of attribute inclusion degree, and use heuristic search based on the multi-granularity neighborhood radius to remove the Redundant gene attributes, and finally a reduced set of attributes;
- the neighborhood radii under all granularities and select the largest neighborhood radius of 0.2 as the initial multi-granularity neighborhood radius, that is, the value range of the multi-granularity neighborhood radius is [0,0.2], and calculate each multi-granularity with a step size of 0.01 Attribute coverage under the neighborhood radius, select the neighborhood radius with the largest attribute coverage, ie 0.13, as the multi-granularity neighborhood radius.
- the multi-granularity neighborhood attribute reduction algorithm was used to fuse 90 granularities to obtain the final reduced set with a total of 2734 gene attributes.
- Step 7 Use the SVM support vector machine to fit the attribute reduction set, introduce the two major indicators of accuracy and recall, comprehensively consider the stability of the model, and introduce penalties on the basis of using the SVM support vector machine as the classifier of the model
- the accuracy makes the classification model have good accuracy and recall at the same time, that is, the classification prediction based on breast cancer gene data under this model has a high accuracy rate and the risk of predicting a cancer patient as a normal person is low.
- the ten-fold crossover method is used to arbitrarily select 90% of the samples each time as the training set, and 10% of the samples are used as the test set to divide the samples, and the SVM support vector machine classification algorithm is used to fit the samples.
- a total of 10 training times, 7 of which are correct The accuracy rate reached more than 90%, and the average accuracy rate was about 85.7%.
- the penalty item was introduced to improve the model while considering the recall rate, and finally the average accuracy rate of model prediction was about 91.2%, and the recall rate was about 82%.
- Step 8 Input large-scale breast cancer gene data, use the reduced set to select appropriate attributes, and use the classifier to obtain the final prediction result.
- step 3 As a method for multi-granularity breast cancer gene classification based on double adaptive neighborhood radius provided by the present invention, the specific steps of step 3 are as follows:
- Step 3.1 Using the silhouette coefficient to evaluate the clustering algorithm, the similarity between the i-th breast cancer gene attribute and other breast cancer gene attributes in the cluster is a i , and the similarity with other breast cancer gene attributes outside the cluster is b i , then
- the silhouette coefficient of the i-th breast cancer gene attribute is defined as follows:
- the value range of s i is [-1,1]. When the contour system is closer to 1, the clustering effect is better, and when the contour coefficient is negative, the clustering effect is poor;
- a multi-granularity breast cancer gene classification method based on double-adaptive neighborhood radius is obtained through the silhouette coefficient
- Step 3.2 Use PCA dimensionality reduction algorithm to reduce the simplification of breast cancer gene data, realize dimensionality reduction visualization, and combine with clustering algorithm to test the actual effect of clustering.
- the specific design is as follows:
- N is the total number of gene attributes
- y i is the eigenvalue of column i
- y n is the eigenvalue of column n
- ⁇ i represents the contribution rate of the i-th column in the covariance matrix
- ⁇ r represents the cumulative contribution rate of the first r columns in the covariance matrix
- Step 3.3 Take the first r dimensions of the covariance matrix as the projection matrix S n ⁇ r , multiply the matrix Y m ⁇ n that needs to be dimensionally reduced by the projection matrix S n ⁇ r , and obtain the matrix T m ⁇ r after dimensionality reduction, namely :
- m represents the number of samples of breast cancer gene data
- n represents the number of original gene attributes of breast cancer gene data
- r represents the number of gene attributes of breast cancer gene data obtained after dimensionality reduction.
- Step 3.4 Determine a rough value interval of k value through the silhouette coefficient, and then refine the interval through PCA dimensionality reduction visualization method to select the best k value to obtain the number of information grains.
- step 5 As a method for multi-granularity breast cancer gene classification based on double adaptive neighborhood radius provided by the present invention, the specific steps of step 5 are as follows:
- Step 5.1 At a single information granularity, calculate the neighborhood relationship of each breast cancer gene sample x i on B under a single gene attribute:
- n B ( xi ) ⁇ x ⁇ U
- Step 5.2 Under the single information granularity, calculate the breast cancer gene decision attribute D with respect to the single gene attribute B positive domain:
- Z is the shortest cluster center distance
- h is the difference between the vertical coordinates of the granularity cluster center and the nearest granularity cluster center.
- Step 5.5 Obtain the positive domain of the breast cancer gene decision attribute D with respect to a m
- Step 5.6 At a single granularity, arrange the dependency of the attribute in descending order in the list list, and obtain the positive domain NPOS P (D) of the breast cancer gene decision attribute D about the gene attribute at the P granularity:
- Step 5.7 Calculate the dependence of decision D on conditional attribute P initialization
- Step 5.9 If r(R 0 ,D) ⁇ r(P,D), put the most dependent attribute in list list into R 0 , and skip to step S5.8.
- step 6 As a method for multi-granularity breast cancer gene classification based on double adaptive neighborhood radius provided by the present invention, the specific steps of step 6 are as follows:
- Step 6.3 Obtain the positive domain of the decision attribute D with respect to P t Calculate the dependence of decision D on conditional attribute Pt of breast cancer genes
- Step 6.4 Arrange the dependence of breast cancer gene attributes in descending order in the list All_list, and obtain the optimistic multi-granularity positive domain of decision attribute D with respect to C as follows:
- Step 6.5 Calculate the dependence of decision D on conditional attribute C initialization
- Step 6.7 If r(Red 0 ,D) ⁇ r(C,D), put the most dependent attribute in the list All_list into Red 0 , and jump to step S6.6;
- the current genetic testing mainly uses the extraction of user genetic data, and predicts by comparing the company's hundreds of millions of data.
- the data set only provides a small number of samples, and it is difficult to achieve a high accuracy rate for high-dimensional gene attributes.
- the present invention can analyze a small number of samples and extract more important gene attributes to improve the detection accuracy. Through the above Instances can effectively perform gene prediction.
- the present invention removes a large amount of redundant gene data and noise gene data through the multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius, from the above example
- the 24481 gene attributes of the original detection are reduced to 2734 gene attributes, which greatly reduces the time complexity of the algorithm.
- the genetic data submitted by the user can get the results in just a few minutes, giving the tester an excellent Test experience.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Epidemiology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Pathology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Primary Health Care (AREA)
- Bioethics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
一种基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,读取大规模基因位点数据并做归一化处理,并对大规模基因位点进行数据分析;利用轮廓系数和PCA降维可视化相结合方式,选取最佳K值,调整信息粒化的模型;其次,使用启发式约简算法分别实现基于簇心距离自适应邻域半径的多粒度属性约简基于属性包含度的邻域半径的多粒度属性约简,并采用SVM支持向量机机器学习分类算法对乳腺癌基因大数据进行分类和预测。通过调整惩罚项使模型在乳腺癌基因分类具有较高的准确率和召回率,去除大规模数据中冗余属性,提高了计算效率,利用样本之间的支持信息,提升了乳腺癌数据分类的效率和精度。
Description
本发明涉及医学信息智能处理技术领域,尤其涉及一种基于双重自适应邻域半径的多粒度乳腺癌基因分类方法。
癌症是一种最为常见的基因疾病,经相关医学研究表明肺癌、皮肤癌和乳腺癌与基因密切相关;癌症的出现往往都可以通过基因突变来解释,遗传物质受损没有修复,癌细胞会吸收正常细胞的养分无限分裂导致人体功能衰退,对于早期癌症治愈率较高,癌细胞转移后治愈率较低;早发现早治疗是当下最佳的治疗手段;基因检测是一种无损的检测方法,通过新一代测序技术同时检测成千上万个基因位点,并在大数据下通过对成千上万个基因位点进行数据分析和相关预测,对于临床治疗具有深远的意义,从特征工程、粒计算两个角度对乳腺癌基因大数据进行分析和约简,并通过机器学习分类算法对乳腺癌基因大数据进行分类和预测。
近些年在《乳腺癌NCCN指南》中,对于有家族遗传倾向的乳腺癌高风险人群,推荐用高通量测序进行多基因检测,筛查遗传易感基因,从而预防或指导治疗。这充分显示基于基因检测的个体化治疗及预防是乳腺癌的新方向。指南中指出,对于有家族遗传倾向的乳腺癌高风险人群,《NCCN指南》推荐进行乳腺自检、加强影像学和相应血清肿瘤标志物检查和药物预防等。
通过基因数据的分析帮助医生有效地分析患者是否是乳腺癌高风险患者,然而基因数据过多,亟需一种新的方法能有效地大幅度减少乳腺癌基因数据分类信息中冗余的基因数据,降低乳腺癌数据的分析时间和提高分析效率及精度,有效进行乳腺癌的早期筛查对临床治疗具有一定的意义。
检测主要是用于疾病诊断的采用,基因诊断的方法不仅敏感性大大提高,而且能在短时间内得到结果,了解正确的治疗方法,正确选择药物,避免胡乱用药造成的不良反应,根据乳腺癌基因检测的结果,能够帮助患者制定正确的治疗方法。
发明内容
本发明的目的在于提供一种基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,解决了现有的判断乳腺癌病变状况的有效办法是通过乳腺癌相关的基因数据维度过高难以观察基因突变对于乳腺癌早期判别的影响,通过乳腺癌基因数据之间的联系结合双重自适应邻域半径解决了邻域粗糙集邻域半径选取困难的问题,再利用多粒度邻域粗糙集属性约简可以有效去除噪声和冗余数据。
为了实现上述发明目的,本发明采用以下技术方案:基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,包括以下步骤:
S1:读取乳腺癌基因数据集,将数据转换为一个四元组决策信息系统S=(U,AT,V,f,δ),邻域决策信息系统S表示如下:
S=(U,AT,V,f,δ),其中U={x
1,x
2,x
3,.....x
m}表示乳腺癌基因数据集中的检测患者对象集合,m表示乳腺癌基因检测患者的个数;C={a
1,a
2,...,a
n}表示乳腺癌基因特征的非空有限集合,n表示乳腺癌基因特征的个数;D={D
1,D
2}表示乳腺癌基因检测患者类别标签的非空有限集合,AT=C∪D表示所有基因属性和决策属性,d
1表示患者患有乳腺癌,d
2表示患者没有患有乳腺癌,且
V=∪
a∈C∪DV
a,V
a是乳腺癌基因检测患者基因特征a的可能情况;f:U×C∪D→V是一个信息函数,它为每个乳腺癌基因检测患者基因特征赋予一个信息值,即
x∈U,f(x,a)∈V
a,δ为邻域阈值;
S2:对乳腺癌基因数据集中非标签数据进行归一化处理,数据归一化的公式如下:
其中x指原始样本中某一属性的数值,x'表示归一化后原始样本中某一属性的数值,max(x)表示所有样本中在某一属性中的最大值,而min(x)表示所有样本中在某一属性中的最小值;
S3:采用K-means聚类算法实现乳腺癌基因数据的信息粒化,采用轮廓系数和PCA降维相结合的方式得到最佳信息粒的个数k,最终得到多个粒度即C={P
1,P
2,...,P
k};
S4:信息粒化实现方法:随机选取k个乳腺癌基因样本作为簇心,采用欧式距离,将每个样本点分配到离他们最近的簇心,对于每个簇,计算簇内的样本点的均值作为新的簇心,当簇心位置不再改变时,最终得到k个信息粒;
S5:乳腺癌基因属性被划分到了多个粒度下,在每个粒度下实现基于簇心距离自适应的邻域粗糙集属性约简:通过暂时保留密集相似区内的基因属性,对于密集相似区外的大量基因属性进行多层的邻域筛选,去除无关的基因属性,再采用启发式搜索迭代至正域这个过程去除密集相似区内的冗余的基因属性,得到重要的乳腺癌基因属性;
S6:每个粒度都得到了约简后乳腺癌基因属性,将多个粒度进行融合,并采用基于属性包含度多粒度邻域属性约简在融合的过程中去除不同粒度下相似冗余的基因属性:引入属性包含度的概念,通过细化属性包含度的学习曲线得到乳腺癌基因数据下的最优多粒度邻域半径,并基于多粒度邻域半径采用启发式搜索去除不同粒度下的冗余的基因属性,最终得到属性的约简集合。
S7:采用SVM支持向量机对属性约简集合进行拟合,引入准确率和召回率两大指标,综合考虑模型的稳定性,在采用SVM支持向量机作为模型的分类器的基础上引入惩罚性使得分类模型同时具备较好的准确率和召回率即在该模型下基于乳腺癌基因数据的分类预测具有较高正确率的同时将一个癌症患者预测为正常人的风险较低。
S8:输入大规模乳腺癌基因数据,使用约简集合选取合适属性,使用分类器得到最终的预测结果。
作为本发明提供的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,所述步骤S3的具体步骤如下:
步骤S3.1:采用轮廓系数进行聚类算法评价,第i个乳腺癌基因属性与簇内其他乳腺癌基因属性的相似度为a
i,与簇外其他乳腺癌基因属性的相似度为b
i,则第i个乳腺癌基 因属性的轮廓系数定义如下:
其中s
i的取值范围为[-1,1],当轮廓系统越接近1说明聚类效果越好,当轮廓系数为负说明聚类效果较差;
步骤S3.2:采用主成分分析PCA降维算法减少乳腺癌基因数据的简化,实现降维可视化,与聚类算法结合测试聚类实际效果,具体设计如下:
对于m个n维乳腺癌基因数据,各变量之间的关系设计协方差矩阵如下:
其中cov(c
i,c
j)表示第i个属性和第j个属性之间的协方差;
再根据特征值大小计算协方差矩阵的贡献率θ以及累计贡献率Θ:
其中N为基因属性总数,y
i为第i列的特征值,y
n为第n列的特征值
其中θ
i表示协方差矩阵中第i列的贡献率,而Θ
r表示协方差矩阵中前r列的累计贡献率。
步骤S3.3:取协方差矩阵的前r维作为投影矩阵S
n×r,将需要降维的矩阵Y
m×n与投影矩阵S
n×r相乘,得到降维后的矩阵T
m×r即:
Y
m×n×S
n×r=T
m×r (17)
其中m表示乳腺癌基因数据的样本数,n表示乳腺癌基因数据的原始基因属性个数,r表示降维后得到的乳腺癌基因数据的基因属性个数。
步骤S3.4:通过轮廓系数确定一个k值粗略的取值区间,再通过PCA降维可视化方式细化区间选取最佳k值,得到信息粒的个数。
作为本发明提供的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,所述步骤S5的具体步骤如下:
步骤S5.1:在单个信息粒度下,计算每个乳腺癌基因样本x
i在单个基因属性下B上的邻域关系:
n
B(x
i)={x∈U|Δ
B(x
i,x)≤δ} (18)
其中Δ
B是距离函数,δ为邻域半径,δ>0。
步骤S5.2在单个信息粒度下,计算乳腺癌基因决策属性D关于单个基因属性下B正域:
则决策属性D关于B的依赖度定义为:
步骤S5.3:在单个粒度下,该粒度下有z个基因属性P={a
1,a
2,...,a
z},该信息粒下簇心坐标表示为(b
1,b
2,...,b
n),计算求得距离下一个最近的信息粒的簇的簇心坐标表示为(d
1,d
2,...,d
n),i,j为样本遍历序号初始为0,0≤i,j≤m;
步骤S5.4:在单个粒度下,对于任意的乳腺癌基因属性a
t若满足a
t到该信息粒簇心距离记为S
t,若
则默认该属性为密集相似区内的乳腺癌基因属性,先初始化集合
用于寻找基因属性i的下近似集,从x
i开始计算该属性下x
i到其他的点x
j的距离,记x
i到x
j距离为W,若
即邻域半径,则令set
i=set
i∨x
i∨x
j,待遍历完每一点后最终求得set
i,其中决策属性D={D
1,D
2}若
或
则称set
i为x
i在D
1或D
2关于a
t的下近似集,否则令
步骤S5.6:在单个粒度下,在列表list中降序排放属性的依赖度,求得乳腺癌基因决策属性D关于P粒度下基因属性的正域NPOS
P(D):
步骤S5.8:若r(R
0,D)=r(P,D),算法终止;求出最终大规模乳腺癌基因约简集合R=R
0;
步骤S5.9:若r(R
0,D)≠r(P,D),将列表list中依赖度最大的属性放入R
0,跳转到步骤S5.8。
作为本发明提供的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,所述步骤S6的具体步骤如下:
步骤S6.1:在多个粒度中得到决策表S=(U,C∪D,V,f),其中C={P
1,P
2,...,P
k},U={x
1,x
2,...,x
m},D={D
1,D
2},k为信息粒的个数,m为乳腺癌基因数据样本个数,基于属性包含度选择最佳邻域半径,i,j为样本遍历序号初始为0,0≤i,j≤m;
步骤S6.2:对于任意的信息粒P
t,先初始化集合
用于寻找基因属性i的下近似集,从x
i开始计算该信息粒下x
i到其他的点x
j的欧式距离,若x
i到x
j的欧式距离小于邻域半径,则令set
i=set
i∨x
i∨x
j,待遍历完每一点后最终求得set
i,其中决策属性D={D
1, D
2},若
或
则称set
i为x
i在D
1或D
2关于P
t的下近似集,否则令
步骤S6.6:若r(Red
0,D)=r(C,D),算法终止;求出最终乳腺癌基因约简集合Red=Red
0;
步骤S6.7:若r(Red
0,D)≠r(C,D),将列表All_list中依赖度最大的属性放入Red
0,跳转到步骤S6.6;
步骤S6.9:若r(R
0,D)≠r(C,D),将Red={P
i,...P
j}中P
t+1依赖度最大的乳腺癌基因属性放入R
0,跳转到步骤S6.8。
与现有技术相比,本发明的有益效果为:
(1)、本发明的高准确率与高召回率并行的分类器可以有效的利用基于双重自适应邻域半径的乳腺癌约简集合,给予检测者在较短的时间内得到高准确率的检测结果,与其他分类方法相比,高召回率模型还能保证将癌症患者预测为正常人的高损失风险降到最低,最后通过大数据下的数据分析、属性约简和机器学习分类算法并结合医生一定临床经验能够有效的帮助医生降低乳腺癌早期判断难度,通过乳腺癌早期的癌症筛查可以让患者获得最佳的治疗时期。
(2)、本发明可以通过对少量的样本进行数据分析,通过属性约简提取其中较为重要的基因属性以减少噪声数据对于模型预测的干扰,采用双重自适应邻域半径相比于手动设置邻域半径能够让分类器更好地自学习拟合模型,从而进一步地提高检测准确率,通过上述实例可以有效的进行基因预测。
(3)、本发明通过基于双重自适应邻域半径的多粒度乳腺癌基因分类方法去除大量冗余基因数据和噪声基因数据,从而从上述实例中将原始检测的24481个基因属性约简到了2734个基因属性,与此同时采用十倍交叉法验证可以有效地解决样本数量小,运行时间长等问题,这大大减少了模型的复杂度和算法的时间复杂度,用户提交检测完的基因数据可以在短短的几分钟内得到结果,给予检测者更好的检测体验。
(4)、对样本时往往忽视召回率的问题,将一个癌症患者预测为一个正常人的风险损失极大,检测者很可能会错过最佳的治疗时间,而本发明通过基于双重自适应邻域半径 的多粒度乳腺癌基因分类方法充分考虑了检测正确率和检测召回率的风险问题,对模型进行调整,通过设置惩罚项,在确保模型正确率较高的基础上充分考虑召回率对于模型的影响来改进模型,从而极大地减少这一风险的发生。
图1为本发明的乳腺癌基因检测流程图。
图2为本发明的基于乳腺癌基因数据的双重自适应邻域半径多粒度属性约简流程图。
图3为本发明的乳腺癌基因数据分类检测流程图。
图4为本发明的乳腺癌基因数据下单粒度自适应邻域半径属性约简流程图。
图5为本发明的乳腺癌基因数据下多粒度自适应邻域半径属性约简流程图。
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。当然,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
实施例1
参见图1至图5,本发明提供其技术方案为,基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,包括以下步骤:
步骤1:读取乳腺癌基因数据集,将数据转换为一个四元组决策信息系统S=(U,AT,V,f,δ),邻域决策信息系统S表示如下:
S=(U,AT,V,f,δ),其中U={x
1,x
2,x
3,.....x
m}表示乳腺癌基因数据集中的检测患者对象集合,m表示乳腺癌基因检测患者的个数;C={a
1,a
2,...,a
n}表示乳腺癌基因特征的非空有限集合,n表示乳腺癌基因特征的个数;D={D
1,D
2}表示乳腺癌基因检测患者类别标签的非空有限集合,AT=C∪D表示所有基因属性和决策属性,d
1表示患者患有乳腺癌,d
2表示患者没有患有乳腺癌,且
V=∪
a∈C∪DV
a,V
a是乳腺癌基因检测患者基因特征a的可能情况;f:U×C∪D→V是一个信息函数,它为每个乳腺癌基因检测患者基因特征赋予一个信息值,即
x∈U,f(x,a)∈V
a,δ为邻域阈值;
采用了乳腺癌基因数据集对以上模型进行测试,其中样本数为97个,基因属性共计24481个,决策属性为两类,分别为确诊乳腺癌患者和正常人。
步骤2:对乳腺癌基因数据集中非标签数据进行归一化处理,数据归一化的公式如下:
其中x指原始样本中某一属性的数值,x'表示归一化后原始样本中某一属性的数值,max(x)表示所有样本中在某一属性中的最大值,而min(x)表示所有样本中在某一属性中的最小值。
步骤3:采用K-means聚类算法实现乳腺癌基因数据的信息粒化,采用轮廓系数和PCA降维相结合的方式得到最佳信息粒的个数k,最终得到多个粒度即C={P
1,P
2,...,P
k}。
步骤4:信息粒化实现方法:随机选取k个乳腺癌基因样本作为簇心,采用欧式距离,将每个样本点分配到离他们最近的簇心,对于每个簇,计算簇内的样本点的均值作为新的簇心,当簇心位置不再改变时,最终得到k个信息粒;
通过轮廓系数指标将最佳粒度数即k值确定在k=90附近区间,再通过PCA降维可视化确定划分90个粒度即k=90最为合理。
步骤5:乳腺癌基因属性被划分到了多个粒度下,在每个粒度下实现基于簇心距离自适应的邻域粗糙集属性约简:通过暂时保留密集相似区内的基因属性,对于密集相似区外的大量基因属性进行多层的邻域筛选,去除无关的基因属性,再采用启发式搜索迭代至正域这个过程去除密集相似区内的冗余的基因属性,得到重要的乳腺癌基因属性;
选取一个粒度下该粒度与其他89个粒度簇心的距离,选择最短簇心距离的簇心,得到自适应邻域半径为
其中Z为最短簇心距离,h为该粒度簇心与最近粒度簇心的纵坐标之差,再采用单粒度邻域属性约简算法求得该粒度下的约简集合,最后依此类推求得其余89个粒度下约简集合。
步骤6:每个粒度都得到了约简后乳腺癌基因属性,将多个粒度进行融合,并采用基于属性包含度多粒度邻域属性约简在融合的过程中去除不同粒度下相似冗余的基因属性:引入属性包含度的概念,通过细化属性包含度的学习曲线得到乳腺癌基因数据下的最优多粒度邻域半径,并基于多粒度邻域半径采用启发式搜索去除不同粒度下的冗余的基因属性,最终得到属性的约简集合;
选取所有粒度下邻域半径,选择最大的邻域半径0.2为初始多粒度邻域半径,即多粒度邻域半径取值区间为[0,0.2],以0.01为步长分别计算每个多粒度邻域半径下属性包含度,选择属性包含度最大的邻域半径即0.13作为多粒度邻域半径。最后采用多粒度邻域属性约简算法将90个粒度进行融合得到最终约简集合共计2734个基因属性。
步骤7:采用SVM支持向量机对属性约简集合进行拟合,引入准确率和召回率两大指标,综合考虑模型的稳定性,在采用SVM支持向量机作为模型的分类器的基础上引入惩罚性使得分类模型同时具备较好的准确率和召回率即在该模型下基于乳腺癌基因数据的分类预测具有较高正确率的同时将一个癌症患者预测为正常人的风险较低。
采用十倍交叉法每次任意选取9成样本作为训练集,1成样本作为测试集对样本进行划分,采用SVM支持向量机分类算法对样本进行拟合,共训练10次,其中7次训练正确率达到90%以上,平均正确率约85.7%,引入惩罚项对模型进行改进同时考虑召回率最终得到模型预测正确率平均正确率约为91.2%,召回率约82%。
步骤8:输入大规模乳腺癌基因数据,使用约简集合选取合适属性,使用分类器得到最终的预测结果。
作为本发明提供的一种用于基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,所述步骤3的具体步骤如下:
步骤3.1:采用轮廓系数进行聚类算法评价,第i个乳腺癌基因属性与簇内其他乳腺癌基因属性的相似度为a
i,与簇外其他乳腺癌基因属性的相似度为b
i,则第i个乳腺癌基因属性的轮廓系数定义如下:
其中s
i的取值范围为[-1,1],当轮廓系统越接近1说明聚类效果越好,当轮廓系数为负说明聚类效果较差;
通过轮廓系数得到基于双重自适应邻域半径的多粒度乳腺癌基因分类方法;
步骤3.2:采用主成分分析PCA降维算法减少乳腺癌基因数据的简化,实现降维可视化,与聚类算法结合测试聚类实际效果,具体设计如下:
对于m个n维乳腺癌基因数据,各变量之间的关系设计协方差矩阵如下:
其中cov(c
i,c
j)表示第i个属性和第j个属性之间的协方差;
再根据特征值大小计算协方差矩阵的贡献率θ以及累计贡献率Θ:
其中N为基因属性总数,y
i为第i列的特征值,y
n为第n列的特征值
其中θ
i表示协方差矩阵中第i列的贡献率,而Θ
r表示协方差矩阵中前r列的累计贡献率。
步骤3.3:取协方差矩阵的前r维作为投影矩阵S
n×r,将需要降维的矩阵Y
m×n与投影矩阵S
n×r相乘,得到降维后的矩阵T
m×r即:
Y
m×n×S
n×r=T
m×r (28)
其中m表示乳腺癌基因数据的样本数,n表示乳腺癌基因数据的原始基因属性个数,r表示降维后得到的乳腺癌基因数据的基因属性个数。
步骤3.4:通过轮廓系数确定一个k值粗略的取值区间,再通过PCA降维可视化方式细化区间选取最佳k值,得到信息粒的个数。
通过PCA降维可视化最终确定划分90个粒度即k=90最为合理;
作为本发明提供的一种用于基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,所述步骤5的具体步骤如下:
步骤5.1:在单个信息粒度下,计算每个乳腺癌基因样本x
i在单个基因属性下B上的邻域关系:
n
B(x
i)={x∈U|Δ
B(x
i,x)≤δ} (29)其中Δ
B是距离函数,δ为邻域半径,δ>0。
步骤5.2在单个信息粒度下,计算乳腺癌基因决策属性D关于单个基因属性下B正域:
则决策属性D关于B的依赖度定义为:
步骤5.3:在单个粒度下,该粒度下有z个基因属性P={a
1,a
2,...,a
z},该信息粒下簇心坐标表示为(b
1,b
2,...,b
n),计算求得距离下一个最近的信息粒的簇的簇心坐标表示为(d
1,d
2,...,d
n),i,j为样本遍历序号初始为0,0≤i,j≤m;
步骤5.4:在单个粒度下,对于任意的乳腺癌基因属性a
t若满足a
t到该信息粒簇心距离记为S
t,若
则默认该属性为密集相似区内的乳腺癌基因属性,先初始化集合
用于寻找基因属性i的下近似集,从x
i开始计算该属性下x
i到其他的点x
j的距离,记x
i到x
j距离为W,若
即邻域半径,则令set
i=set
i∨x
i∨x
j,待遍历完每一点后最终求得set
i,其中决策属性D={D
1,D
2}若
或
则称set
i为x
i在D
1或D
2关于a
t的下近似集,否则令
步骤5.6:在单个粒度下,在列表list中降序排放属性的依赖度,求得乳腺癌基因决策属性D关于P粒度下基因属性的正域NPOS
P(D):
步骤5.8:若r(R
0,D)=r(P,D),算法终止;求出最终大规模乳腺癌基因约简集合R=R
0;
步骤5.9:若r(R
0,D)≠r(P,D),将列表list中依赖度最大的属性放入R
0,跳转到步骤S5.8。
作为本发明提供的一种用于基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,所述步骤6的具体步骤如下:
步骤6.1:在多个粒度中得到决策表S=(U,C∪D,V,f),其中C={P
1,P
2,...,P
k},U={x
1,x
2,...,x
m},D={D
1,D
2},k为信息粒的个数,m为乳腺癌基因数据样本个数,基于属性包含度选择最佳邻域半径,i,j为样本遍历序号初始为0,0≤i,j≤m;
该数据集下k=90,m=97;
步骤6.2:对于任意的信息粒P
t,先初始化集合
用于寻找基因属性i的下近 似集,从x
i开始计算该信息粒下x
i到其他的点x
j的欧式距离,若x
i到x
j的欧式距离小于邻域半径,则令set
i=set
i∨x
i∨x
j,待遍历完每一点后最终求得set
i,其中决策属性D={D
1,D
2},若
或
则称set
i为x
i在D
1或D
2关于P
t的下近似集,否则令
选择最大的邻域半径0.2为初始多粒度邻域半径,即多粒度邻域半径取值区间为[0,0.2],以0.01为步长分别计算每个多粒度邻域半径下属性包含度,选择属性包含度最大的邻域半径即0.13作为多粒度邻域半径;
步骤6.6:若r(Red
0,D)=r(C,D),算法终止;求出最终乳腺癌基因约简集合Red=Red
0;
步骤6.7:若r(Red
0,D)≠r(C,D),将列表All_list中依赖度最大的属性放入Red
0,跳转到步骤S6.6;
步骤6.9:若r(R
0,D)≠r(C,D),将Red={P
i,...P
j}中P
t+1依赖度最大的乳腺癌基因属性放入R
0,跳转到步骤6.8。
由此可知,当下基因检测主要是采用提取用户基因数据,通过比对该公司数以亿计的数据进行预测,然而这些数据并未公开,所以基因检测方法因为数据源的问题难以普及,许多公开的数据集也只提供少量的样本,对于高维度的基因属性难以达到较高的准确率,而本发明可以通过对少量的样本进行分析,提取其中较为重要的基因属性提高检测准确率,通过上述实例可以有效的进行基因预测。
不仅如此,由于许多公司需要拿用户基因数据去比对数据库数以亿计的样本,这样带来相当大的时间成本,因为计算系统全部基因属性的时间复杂度会随着基因的组合呈指数级增长,用户需要等待几个小时甚至几天才能得到最终的结果,而本发明通过基于双重自适应邻域半径的多粒度乳腺癌基因分类方法去除大量冗余基因数据和噪声基因数据,从上述实例中将原始检测的24481个基因属性约简到了2734个基因属性,这大大减少了算法的时间复杂度,用户提交检测完的基因数据可以在短短的几分钟内得到结果,给予检测者极佳检测体验。
此外,许多公司比对样本时往往忽视召回率的问题,将一个癌症患者预测为一个 正常人的风险损失极大,检测者很可能会错过最佳的治疗时间;而本发明通过基于双重自适应邻域半径的多粒度乳腺癌基因分类方法充分考虑了检测正确率和检测召回率的风险问题,对模型进行调整,极大减少这一风险的发生。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
Claims (4)
- 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,包括以下步骤:S1:读取乳腺癌基因数据集,将数据转换为一个四元组决策信息系统S=(U,AT,V,f,δ),邻域决策信息系统S表示如下:S=(U,AT,V,f,δ),其中U={x 1,x 2,x 3,.....x m}表示乳腺癌基因数据集中的检测患者对象集合,m表示乳腺癌基因检测患者的个数;C={a 1,a 2,...,a n}表示乳腺癌基因特征的非空有限集合,n表示乳腺癌基因特征的个数;D={D 1,D 2}表示乳腺癌基因检测患者类别标签的非空有限集合,AT=C∪D表示所有基因属性和决策属性,d 1表示患者患有乳腺癌,d 2表示患者没有患有乳腺癌,且 V=∪ a∈C∪DV a,V a是乳腺癌基因检测患者基因特征a的可能情况;f:U×C∪D→V是一个信息函数,它为每个乳腺癌基因检测患者基因特征赋予一个信息值,即 δ为邻域阈值;S2:对乳腺癌基因数据集中非标签数据进行归一化处理,数据归一化的公式如下:其中x指原始样本中某一属性的数值,x'表示归一化后原始样本中某一属性的数值,max(x)表示所有样本中在某一属性中的最大值,而min(x)表示所有样本中在某一属性中的最小值;S3:采用K-means聚类算法实现乳腺癌基因数据的信息粒化,采用轮廓系数和PCA降维相结合的方式得到最佳信息粒的个数k,得到多个粒度即C={P 1,P 2,...,P k};S4:信息粒化实现方法:随机选取k个乳腺癌基因样本作为簇心,采用欧式距离,将每个样本点分配到离簇心最近处,对于每个簇,计算簇内的样本点的均值作为新的簇心,当簇心位置不再改变时,最终得到k个信息粒;S5:乳腺癌基因属性被划分到了多个粒度下,在每个粒度下实现基于簇心距离自适应的邻域粗糙集属性约简:通过暂时保留密集相似区内的基因属性,对于密集相似区外的大量基因属性进行多层的邻域筛选,去除无关的基因属性,再采用启发式搜索迭代至正域过程去除密集相似区内的冗余的基因属性,得到重要的乳腺癌基因属性;S6:每个粒度都得到了约简后乳腺癌基因属性,将多个粒度进行融合,并采用基于属性包含度多粒度邻域属性约简在融合的过程中去除不同粒度下相似冗余的基因属性:引入属性包含度的概念,通过细化属性包含度的学习曲线得到乳腺癌基因数据下的最优多粒度邻域半径,并基于多粒度邻域半径采用启发式搜索去除不同粒度下的冗余的基因属性,最终得到属性的约简集合;S7:采用SVM支持向量机对属性约简集合进行拟合,引入准确率和召回率两大指标,综合考虑模型的稳定性,在采用SVM支持向量机作为模型的分类器的基础上引入惩罚性使得分类模型同时具备较好的准确率和召回率;S8:输入大规模乳腺癌基因数据,使用约简集合选取合适属性,使用分类器得到最终的预测结果。
- 根据权利要求1所述的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,所述步骤S3具体包括以下步骤:步骤S3.1:采用轮廓系数进行聚类算法评价,第i个乳腺癌基因属性与簇内其他乳腺癌基因属性的相似度为a i,与簇外其他乳腺癌基因属性的相似度为b i,则第i个乳腺癌基因属性的轮廓系数定义如下:其中s i的取值范围为[-1,1],当轮廓系统越接近1说明聚类效果越好,当轮廓系数为负说明聚类效果较差;步骤S3.2:采用主成分分析PCA降维算法减少乳腺癌基因数据的简化,达到降维可视化,与聚类算法结合测试聚类实际效果,具体内容如下:对于m个n维乳腺癌基因数据,各变量之间的关系设计协方差矩阵如下:其中cov(c i,c j)表示第i个属性和第j个属性之间的协方差;再根据特征值大小计算协方差矩阵的贡献率θ以及累计贡献率Θ:其中N为基因属性总数,y i为第i列的特征值,y n为第n列的特征值其中θ i表示协方差矩阵中第i列的贡献率,而Θ r表示协方差矩阵中前r列的累计贡献率;步骤S3.3:取协方差矩阵的前r维为投影矩阵S n×r,将降维的矩阵Y m×n与投影矩阵S n×r相乘,得降维后的矩阵T m×r即:Y m×n×S n×r=T m×r (6)其中m表示乳腺癌基因数据的样本数,n表示乳腺癌基因数据的原始基因属性个数,r表示降维后得到的乳腺癌基因数据的基因属性个数;步骤S3.4:通过轮廓系数确定一个k值粗略的取值区间,再通过PCA降维可视化方式细化区间选取最佳k值,得到信息粒的个数。
- 根据权利要求1所述的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,所述步骤S5的具体步骤如下:步骤S5.1:在单个信息粒度下,计算每个乳腺癌基因样本x i在单个基因属性下B上的邻域关系:n B(x i)={x∈U|Δ B(x i,x)≤δ} (7)其中Δ B是距离函数,δ为邻域半径,δ>0;步骤S5.2在单个信息粒度下,计算乳腺癌基因决策属性D关于单个基因属性下B正域:则决策属性D关于B的依赖度定义为:步骤S5.3:在单个粒度下,该粒度下有z个基因属性P={a 1,a 2,...,a z},该信息粒下簇心坐标表示为(b 1,b 2,...,b n),计算求得距离下一个最近的信息粒的簇的簇心坐标表示为(d 1,d 2,...,d n),i,j为样本遍历序号初始为0,0≤i,j≤m;步骤S5.4:在单个粒度下,对于任意的乳腺癌基因属性a t若满足a t到该信息粒簇心距离记为S t,若 则默认该属性为密集相似区内的乳腺癌基因属性,先初始化集合 用于寻找基因属性i的下近似集,从x i开始计算该属性下x i到其他的点x j的距离,记x i到x j距离为W,若 即邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2}若 或 则称set i为x i在D 1或D 2关于a t的下近似集,否则令步骤S5.6:在单个粒度下,在列表list中降序排放属性的依赖度,求得乳腺癌基因决策属性D关于P粒度下基因属性的正域NPOS P(D):步骤S5.8:若r(R 0,D)=r(P,D),算法终止;求出最终大规模乳腺癌基因约简集合R=R 0;步骤S5.9:若r(R 0,D)≠r(P,D),将列表list中依赖度最大的属性放入R 0,跳转到步骤S5.8。
- 根据权利要求1所述的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,所述步骤S6的具体步骤如下:步骤S6.1:在多个粒度中得到决策表S=(U,C∪D,V,f),其中C={P 1,P 2,...,P k},U={x 1,x 2,...,x m},D={D 1,D 2},k为信息粒的个数,m为乳腺癌基因数据样本个数,基于属性包含度选择最佳邻域半径,i,j为样本遍历序号初始为0,0≤i,j≤m;步骤S6.2:对于任意的信息粒P t,先初始化集合 用于寻找基因属性i的下近似集,从x i开始计算该信息粒下x i到其他的点x j的欧式距离,若x i到x j的欧式距离小于邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2},若 或 则称set i为x i在D 1或D 2关于P t的下近似集,否则令步骤S6.6:若r(Red 0,D)=r(C,D),算法终止;求出最终乳腺癌基因约简集合Red=Red 0;步骤S6.7:若r(Red 0,D)≠r(C,D),将列表All_list中依赖度最大的属性放入Red 0,跳转到步骤S6.6;步骤S6.9:若r(R 0,D)≠r(C,D),将Red={P i,...P j}中P t+1依赖度最大的乳腺癌基因属性放入R 0,跳转到步骤S6.8。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/798,352 US11837329B2 (en) | 2021-07-26 | 2022-02-22 | Method for classifying multi-granularity breast cancer genes based on double self-adaptive neighborhood radius |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110845531.0 | 2021-07-26 | ||
CN202110845531.0A CN113838532B (zh) | 2021-07-26 | 2021-07-26 | 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023005196A1 true WO2023005196A1 (zh) | 2023-02-02 |
Family
ID=78962844
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/077251 WO2023005196A1 (zh) | 2021-07-26 | 2022-02-22 | 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11837329B2 (zh) |
CN (1) | CN113838532B (zh) |
WO (1) | WO2023005196A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113838532B (zh) * | 2021-07-26 | 2022-11-18 | 南通大学 | 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 |
CN114675818B (zh) * | 2022-03-29 | 2024-04-19 | 江苏科技大学 | 一种基于粗糙集理论的度量可视化工具的实现方法 |
CN115186769B (zh) * | 2022-09-07 | 2022-11-25 | 山东未来网络研究院(紫金山实验室工业互联网创新应用基地) | 一种基于nlp的突变基因分类方法 |
CN117912712B (zh) * | 2024-03-20 | 2024-05-28 | 徕兄健康科技(威海)有限责任公司 | 基于大数据的甲状腺疾病数据智能管理方法及系统 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110023759A (zh) * | 2016-09-19 | 2019-07-16 | 血液学有限公司 | 用于使用多维分析检测异常细胞的系统、方法和制品 |
CN110211638A (zh) * | 2019-05-28 | 2019-09-06 | 河南师范大学 | 一种考虑基因相关度的基因选择方法与装置 |
CN113838532A (zh) * | 2021-07-26 | 2021-12-24 | 南通大学 | 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040076984A1 (en) * | 2000-12-07 | 2004-04-22 | Roland Eils | Expert system for classification and prediction of generic diseases, and for association of molecular genetic parameters with clinical parameters |
CA2618939A1 (en) * | 2004-08-13 | 2006-04-27 | Jaguar Bioscience Inc. | Systems and methods for identifying diagnostic indicators |
US8165973B2 (en) * | 2007-06-18 | 2012-04-24 | International Business Machines Corporation | Method of identifying robust clustering |
US10252145B2 (en) * | 2016-05-02 | 2019-04-09 | Bao Tran | Smart device |
US10381105B1 (en) * | 2017-01-24 | 2019-08-13 | Bao | Personalized beauty system |
WO2020089835A1 (en) * | 2018-10-31 | 2020-05-07 | Ancestry.Com Dna, Llc | Estimation of phenotypes using dna, pedigree, and historical data |
WO2021092071A1 (en) * | 2019-11-07 | 2021-05-14 | Oncxerna Therapeutics, Inc. | Classification of tumor microenvironments |
CN112163133B (zh) * | 2020-09-25 | 2021-10-08 | 南通大学 | 一种基于多粒度证据邻域粗糙集的乳腺癌数据分类方法 |
AU2020103782A4 (en) * | 2020-11-30 | 2021-02-11 | Ningxia Medical University | Pet/ct high-dimensional feature level selection method based on genetic algorithm and variable precision rough set |
-
2021
- 2021-07-26 CN CN202110845531.0A patent/CN113838532B/zh active Active
-
2022
- 2022-02-22 WO PCT/CN2022/077251 patent/WO2023005196A1/zh unknown
- 2022-02-22 US US17/798,352 patent/US11837329B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110023759A (zh) * | 2016-09-19 | 2019-07-16 | 血液学有限公司 | 用于使用多维分析检测异常细胞的系统、方法和制品 |
CN110211638A (zh) * | 2019-05-28 | 2019-09-06 | 河南师范大学 | 一种考虑基因相关度的基因选择方法与装置 |
CN113838532A (zh) * | 2021-07-26 | 2021-12-24 | 南通大学 | 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 |
Non-Patent Citations (3)
Title |
---|
CHENG YI, LIU YONG: "Knowledge Discovery Model Based on Neighborhood Multi-granularity Rough Sets", COMPUTER SCIENCE, KEXUE JISHU WENXIAN CHUBANSHE CHONGQING FENSHE, CN, vol. 46, no. 6, 15 June 2019 (2019-06-15), CN , pages 224 - 230, XP093028935, ISSN: 1002-137X, DOI: 10.118967j.issn.1002-137X.2019.06.034 * |
DING WEIPING; GUAN ZHIJIN; WANG JIEHUA; TIAN DI: "A Layered Co-evolution Based Rough Feature Selection Using Adaptive Neighborhood Radius Hierarchy and Its Application in 3D-MRI", CHINESE JOURNAL OF ELECTRONICS, TECHNOLOGY EXCHANGE LTD., HONG KONG,, HK, vol. 26, no. 6, 1 November 2017 (2017-11-01), HK , pages 1168 - 1176, XP006072400, ISSN: 1022-4653, DOI: 10.1049/cje.2017.01.004 * |
SUN LIN; WANG LANYING; DING WEIPING; QIAN YUHUA; XU JIUCHENG: "Neighborhood multi-granulation rough sets-based attribute reduction using Lebesgue and entropy measures in incomplete neighborhood decision systems", KNOWLEDGE-BASED SYSTEMS, ELSEVIER, AMSTERDAM, NL, vol. 192, 13 December 2019 (2019-12-13), AMSTERDAM, NL , XP086063825, ISSN: 0950-7051, DOI: 10.1016/j.knosys.2019.105373 * |
Also Published As
Publication number | Publication date |
---|---|
US11837329B2 (en) | 2023-12-05 |
CN113838532A (zh) | 2021-12-24 |
CN113838532B (zh) | 2022-11-18 |
US20230197203A1 (en) | 2023-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023005196A1 (zh) | 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 | |
Nguena Nguefack et al. | Trajectory modelling techniques useful to epidemiological research: a comparative narrative review of approaches | |
Azadifar et al. | Graph-based relevancy-redundancy gene selection method for cancer diagnosis | |
Qiu et al. | Reproducibility and non-redundancy of radiomic features extracted from arterial phase CT scans in hepatocellular carcinoma patients: impact of tumor segmentation variability | |
Naseem et al. | An automatic detection of breast cancer diagnosis and prognosis based on machine learning using ensemble of classifiers | |
Onken et al. | Prognostic testing in uveal melanoma by transcriptomic profiling of fine needle biopsy specimens | |
CN112927757B (zh) | 基于基因表达和dna甲基化数据的胃癌生物标志物识别方法 | |
Jiang et al. | A generative adversarial network model for disease gene prediction with RNA-seq data | |
Raina et al. | A systematic review on acute leukemia detection using deep learning techniques | |
Ramyachitra et al. | Interval-value Based Particle Swarm Optimization algorithm for cancer-type specific gene selection and sample classification | |
Kour et al. | Study on detection of breast cancer using Machine Learning | |
Shi et al. | Sparse discriminant analysis for breast cancer biomarker identification and classification | |
Surya Sashank et al. | Detection of acute lymphoblastic leukemia by utilizing deep learning methods | |
Karim et al. | Convolutional embedded networks for population scale clustering and bio-ancestry inferencing | |
Lo et al. | Computer-aided diagnosis of isocitrate dehydrogenase genotypes in glioblastomas from radiomic patterns | |
Wang et al. | Enhanced rotated mask r-cnn for chromosome segmentation | |
Yousef et al. | Computational approaches for biomarker discovery | |
Sharma et al. | Predicting survivability in oral cancer patients | |
Subramanian et al. | A deep ensemble network model for classifying and predicting breast cancer | |
Sarkar et al. | Breast Cancer Subtypes Classification with Hybrid Machine Learning Model | |
Vijaya Lakshmi et al. | Cancer prediction with gene expression profiling and differential evolution | |
Metsis et al. | DNA copy number selection using robust structured sparsity-inducing norms | |
Han et al. | A two step method to identify clinical outcome relevant genes with microarray data | |
Zhang et al. | A disease-related gene mining method based on weakly supervised learning model | |
Gunavathi et al. | A survey on feature selection methods in microarray gene expression data for cancer classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22847816 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |