WO2023005196A1 - 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 - Google Patents

基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 Download PDF

Info

Publication number
WO2023005196A1
WO2023005196A1 PCT/CN2022/077251 CN2022077251W WO2023005196A1 WO 2023005196 A1 WO2023005196 A1 WO 2023005196A1 CN 2022077251 W CN2022077251 W CN 2022077251W WO 2023005196 A1 WO2023005196 A1 WO 2023005196A1
Authority
WO
WIPO (PCT)
Prior art keywords
breast cancer
attribute
cancer gene
gene
granularity
Prior art date
Application number
PCT/CN2022/077251
Other languages
English (en)
French (fr)
Inventor
丁卫平
耿宇
鞠恒荣
黄嘉爽
程纯
孙颖
张毅
李铭
秦廷桢
沈鑫杰
王海鹏
Original Assignee
南通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南通大学 filed Critical 南通大学
Priority to US17/798,352 priority Critical patent/US11837329B2/en
Publication of WO2023005196A1 publication Critical patent/WO2023005196A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the invention relates to the technical field of medical information intelligent processing, in particular to a multi-granularity breast cancer gene classification method based on double self-adaptive neighborhood radii.
  • Cancer is the most common genetic disease. Relevant medical research has shown that lung cancer, skin cancer and breast cancer are closely related to genes; the emergence of cancer can often be explained by gene mutations. If genetic material is damaged without repair, cancer cells will absorb The infinite division of nutrients in normal cells leads to the decline of human body functions. The cure rate for early cancer is high, and the cure rate for cancer cells after metastasis is low; early detection and early treatment are the best treatment methods at present; genetic testing is a non-destructive testing method.
  • the analysis of genetic data helps doctors effectively analyze whether a patient is a high-risk patient for breast cancer.
  • a new method is urgently needed to effectively and greatly reduce the redundant genetic data in the classification information of breast cancer genetic data. , reduce the analysis time of breast cancer data and improve the analysis efficiency and precision, and effectively carry out early screening of breast cancer has certain significance for clinical treatment.
  • Detection is mainly used for disease diagnosis.
  • the method of genetic diagnosis not only greatly improves the sensitivity, but also can get the results in a short time, understand the correct treatment method, choose the drug correctly, and avoid adverse reactions caused by indiscriminate use of drugs.
  • the results of genetic testing can help patients formulate the right treatment.
  • the purpose of the present invention is to provide a multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius, which solves the problem that the existing effective way to judge the status of breast cancer is that the dimension of breast cancer-related gene data is too high to be observed.
  • the influence of gene mutation on the early discrimination of breast cancer through the connection between breast cancer gene data and double adaptive neighborhood radius, solves the problem of difficult selection of neighborhood radius in neighborhood rough set, and then uses multi-granularity neighborhood rough set attribute Jane can effectively remove noise and redundant data.
  • the present invention adopts the following technical scheme: a multi-granularity breast cancer gene classification method based on double self-adaptive neighborhood radius, which includes the following steps:
  • S2 Normalize the non-label data in the breast cancer gene dataset.
  • the formula for data normalization is as follows:
  • x refers to the value of a certain attribute in the original sample
  • x' represents the value of a certain attribute in the original sample after normalization
  • max(x) represents the maximum value of a certain attribute in all samples
  • min(x) Indicates the minimum value in a certain attribute among all samples
  • S4 Information granulation implementation method: randomly select k breast cancer gene samples as cluster centers, and use Euclidean distance to assign each sample point to the cluster center closest to them. For each cluster, calculate the number of sample points in the cluster. The mean value is used as the new cluster center, and when the position of the cluster center does not change, k information granules are finally obtained;
  • S5 The gene attributes of breast cancer are divided into multiple granularities, and the neighborhood rough set attribute reduction based on cluster center distance adaptation is realized at each granularity: by temporarily retaining the gene attributes in the dense similar area, for the dense similar area Multi-layer neighborhood screening of a large number of gene attributes outside, remove irrelevant gene attributes, and then use heuristic search to iterate to the positive domain. This process removes redundant gene attributes in densely similar areas, and obtains important breast cancer gene attributes;
  • S6 The reduced breast cancer gene attributes are obtained for each granularity, and multiple granularities are fused, and multi-granularity neighborhood attribute reduction based on the attribute inclusion degree is used to remove similarly redundant genes at different granularities during the fusion process
  • Attribute Introduce the concept of attribute inclusion degree, obtain the optimal multi-granularity neighborhood radius under breast cancer gene data by refining the learning curve of attribute inclusion degree, and use heuristic search to remove redundancy under different granularity based on the multi-granularity neighborhood radius The rest of the gene attributes, and finally get the reduced set of attributes.
  • S7 Use the SVM support vector machine to fit the attribute reduction set, introduce the two major indicators of accuracy and recall, comprehensively consider the stability of the model, and introduce penalty on the basis of using the SVM support vector machine as the classifier of the model
  • the classification model has good accuracy and recall at the same time, that is, the classification prediction based on breast cancer gene data under this model has a high accuracy rate and the risk of predicting a cancer patient as a normal person is low.
  • S8 Input large-scale breast cancer gene data, use the reduced set to select appropriate attributes, and use the classifier to obtain the final prediction result.
  • step S3 As the multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius provided by the present invention, the specific steps of the step S3 are as follows:
  • Step S3.1 Using the silhouette coefficient to evaluate the clustering algorithm, the similarity between the i-th breast cancer gene attribute and other breast cancer gene attributes in the cluster is a i , and the similarity with other breast cancer gene attributes outside the cluster is b i , then the silhouette coefficient of the i-th breast cancer gene attribute is defined as follows:
  • the value range of s i is [-1, 1]. When the contour system is closer to 1, the clustering effect is better, and when the contour coefficient is negative, the clustering effect is poor;
  • Step S3.2 Use PCA dimensionality reduction algorithm to reduce the simplification of breast cancer gene data, realize dimensionality reduction visualization, and combine with clustering algorithm to test the actual effect of clustering.
  • the specific design is as follows:
  • N is the total number of gene attributes
  • y i is the eigenvalue of column i
  • y n is the eigenvalue of column n
  • ⁇ i represents the contribution rate of the i-th column in the covariance matrix
  • ⁇ r represents the cumulative contribution rate of the first r columns in the covariance matrix
  • Step S3.3 Take the first r dimensions of the covariance matrix as the projection matrix S n ⁇ r , multiply the matrix Y m ⁇ n to be dimensionally reduced by the projection matrix S n ⁇ r , and obtain the matrix T m ⁇ after dimension reduction r means:
  • m represents the number of samples of breast cancer gene data
  • n represents the number of original gene attributes of breast cancer gene data
  • r represents the number of gene attributes of breast cancer gene data obtained after dimensionality reduction.
  • Step S3.4 Determine a rough value interval of k value through the silhouette coefficient, and then refine the interval through PCA dimension reduction visualization method to select the best k value to obtain the number of information grains.
  • step S5 As the multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius provided by the present invention, the specific steps of the step S5 are as follows:
  • Step S5.1 Under a single information granularity, calculate the neighborhood relationship of each breast cancer gene sample x i on B under a single gene attribute:
  • ⁇ B is the distance function
  • is the neighborhood radius
  • Step S5.2 Calculate the decision-making attribute D of the breast cancer gene with respect to the positive domain of B under the individual gene attribute at a single information granularity:
  • Step S5.5 Obtain the positive domain of the breast cancer gene decision attribute D with respect to a m
  • Step S5.6 At a single granularity, arrange the dependency of the attribute in descending order in the list list, and obtain the positive domain NPOS P (D) of the breast cancer gene decision attribute D with respect to the gene attribute at the P granularity:
  • Step S5.7 Calculate the dependence of decision D on condition attribute P initialization
  • Step S5.9 If r(R 0 ,D) ⁇ r(P,D), put the most dependent attribute in list list into R 0 , and skip to step S5.8.
  • step S6 As the multi-granularity breast cancer gene classification method based on the double adaptive neighborhood radius provided by the present invention, the specific steps of the step S6 are as follows:
  • Step S6.3 Obtain the positive domain of the decision attribute D with respect to P t Calculate the dependence of decision D on conditional attribute Pt of breast cancer genes
  • Step S6.4 Arrange the dependency of breast cancer gene attributes in descending order in the list All_list, and obtain the optimistic multi-granularity positive domain of decision attribute D with respect to C as follows:
  • Step S6.5 Calculate the dependence of decision D on condition attribute C initialization
  • Step S6.7 If r(Red 0 ,D) ⁇ r(C,D), put the most dependent attribute in the list All_list into Red 0 , and jump to step S6.6;
  • the parallel classifier with high accuracy and high recall of the present invention can effectively utilize the breast cancer reduction set based on double adaptive neighborhood radius, and give the detector to obtain high accuracy in a short period of time
  • the high-recall model can also ensure that the high loss risk of predicting cancer patients as normal people is minimized.
  • the present invention can analyze the data of a small number of samples, and extract the more important gene attributes through attribute reduction to reduce the interference of noise data on model prediction.
  • the domain radius can enable the classifier to better self-learn and fit the model, thereby further improving the detection accuracy.
  • the present invention removes a large number of redundant gene data and noise gene data through the multi-granularity breast cancer gene classification method based on the double adaptive neighborhood radius, thereby reducing the 24481 gene attributes originally detected to 2734 from the above example
  • using the ten-fold crossover method to verify can effectively solve the problems of small sample size and long running time, which greatly reduces the complexity of the model and the time complexity of the algorithm, and the genetic data submitted by the user can be verified. Get the result in just a few minutes, giving the tester a better testing experience.
  • the problem of recall rate is often ignored when samples are taken, and the risk loss of predicting a cancer patient as a normal person is extremely high, and the detector is likely to miss the best treatment time
  • the present invention uses a method based on double adaptive neighbor
  • the domain radius multi-granularity breast cancer gene classification method fully considers the risk of detection accuracy and detection recall rate, adjusts the model, and sets penalty items to fully consider the recall rate on the basis of ensuring a high model accuracy rate. to improve the model, thereby greatly reducing the occurrence of this risk.
  • Fig. 1 is a flow chart of breast cancer gene detection in the present invention.
  • Fig. 2 is a flow chart of the double adaptive neighborhood radius multi-granularity attribute reduction based on breast cancer gene data in the present invention.
  • Fig. 3 is a flow chart of classification and detection of breast cancer gene data in the present invention.
  • Fig. 4 is a flow chart of single-grain adaptive neighborhood radius attribute reduction for breast cancer gene data in the present invention.
  • Fig. 5 is a flow chart of multi-granularity adaptive neighborhood radius attribute reduction for breast cancer gene data in the present invention.
  • the technical scheme that the present invention provides is, the multi-granularity breast cancer gene classification method based on double self-adaptive neighborhood radius, comprises the following steps:
  • the above model was tested using the breast cancer gene data set, in which the number of samples was 97, and the gene attributes totaled 24,481.
  • the decision attributes were divided into two categories, namely diagnosed breast cancer patients and normal people.
  • Step 2 Normalize the non-label data in the breast cancer gene dataset.
  • the formula for data normalization is as follows:
  • x refers to the value of a certain attribute in the original sample
  • x' represents the value of a certain attribute in the original sample after normalization
  • max(x) represents the maximum value of a certain attribute in all samples
  • min(x) Indicates the minimum value of an attribute among all samples.
  • Step 4 Information granulation implementation method: randomly select k breast cancer gene samples as cluster centers, and use Euclidean distance to assign each sample point to the cluster center closest to them. For each cluster, calculate the sample points in the cluster The mean value of is used as the new cluster center, when the position of the cluster center no longer changes, k information granules are finally obtained;
  • Step 5 The gene attributes of breast cancer are divided into multiple granularities, and the neighborhood rough set attribute reduction based on cluster center distance adaptive is realized at each granularity: by temporarily retaining the gene attributes in the dense similar area, for dense similar A large number of gene attributes outside the area are screened in multi-layer neighborhoods to remove irrelevant gene attributes, and then heuristic search is used to iterate to the positive domain. This process removes redundant gene attributes in densely similar areas and obtains important breast cancer gene attributes ;
  • Step 6 The reduced breast cancer gene attributes are obtained for each granularity, and the multiple granularities are fused, and the multi-granularity neighborhood attribute reduction based on the attribute inclusion degree is used to remove similar redundancy at different granularities during the fusion process
  • Gene attribute Introduce the concept of attribute inclusion degree, obtain the optimal multi-granularity neighborhood radius under the breast cancer gene data by refining the learning curve of attribute inclusion degree, and use heuristic search based on the multi-granularity neighborhood radius to remove the Redundant gene attributes, and finally a reduced set of attributes;
  • the neighborhood radii under all granularities and select the largest neighborhood radius of 0.2 as the initial multi-granularity neighborhood radius, that is, the value range of the multi-granularity neighborhood radius is [0,0.2], and calculate each multi-granularity with a step size of 0.01 Attribute coverage under the neighborhood radius, select the neighborhood radius with the largest attribute coverage, ie 0.13, as the multi-granularity neighborhood radius.
  • the multi-granularity neighborhood attribute reduction algorithm was used to fuse 90 granularities to obtain the final reduced set with a total of 2734 gene attributes.
  • Step 7 Use the SVM support vector machine to fit the attribute reduction set, introduce the two major indicators of accuracy and recall, comprehensively consider the stability of the model, and introduce penalties on the basis of using the SVM support vector machine as the classifier of the model
  • the accuracy makes the classification model have good accuracy and recall at the same time, that is, the classification prediction based on breast cancer gene data under this model has a high accuracy rate and the risk of predicting a cancer patient as a normal person is low.
  • the ten-fold crossover method is used to arbitrarily select 90% of the samples each time as the training set, and 10% of the samples are used as the test set to divide the samples, and the SVM support vector machine classification algorithm is used to fit the samples.
  • a total of 10 training times, 7 of which are correct The accuracy rate reached more than 90%, and the average accuracy rate was about 85.7%.
  • the penalty item was introduced to improve the model while considering the recall rate, and finally the average accuracy rate of model prediction was about 91.2%, and the recall rate was about 82%.
  • Step 8 Input large-scale breast cancer gene data, use the reduced set to select appropriate attributes, and use the classifier to obtain the final prediction result.
  • step 3 As a method for multi-granularity breast cancer gene classification based on double adaptive neighborhood radius provided by the present invention, the specific steps of step 3 are as follows:
  • Step 3.1 Using the silhouette coefficient to evaluate the clustering algorithm, the similarity between the i-th breast cancer gene attribute and other breast cancer gene attributes in the cluster is a i , and the similarity with other breast cancer gene attributes outside the cluster is b i , then
  • the silhouette coefficient of the i-th breast cancer gene attribute is defined as follows:
  • the value range of s i is [-1,1]. When the contour system is closer to 1, the clustering effect is better, and when the contour coefficient is negative, the clustering effect is poor;
  • a multi-granularity breast cancer gene classification method based on double-adaptive neighborhood radius is obtained through the silhouette coefficient
  • Step 3.2 Use PCA dimensionality reduction algorithm to reduce the simplification of breast cancer gene data, realize dimensionality reduction visualization, and combine with clustering algorithm to test the actual effect of clustering.
  • the specific design is as follows:
  • N is the total number of gene attributes
  • y i is the eigenvalue of column i
  • y n is the eigenvalue of column n
  • ⁇ i represents the contribution rate of the i-th column in the covariance matrix
  • ⁇ r represents the cumulative contribution rate of the first r columns in the covariance matrix
  • Step 3.3 Take the first r dimensions of the covariance matrix as the projection matrix S n ⁇ r , multiply the matrix Y m ⁇ n that needs to be dimensionally reduced by the projection matrix S n ⁇ r , and obtain the matrix T m ⁇ r after dimensionality reduction, namely :
  • m represents the number of samples of breast cancer gene data
  • n represents the number of original gene attributes of breast cancer gene data
  • r represents the number of gene attributes of breast cancer gene data obtained after dimensionality reduction.
  • Step 3.4 Determine a rough value interval of k value through the silhouette coefficient, and then refine the interval through PCA dimensionality reduction visualization method to select the best k value to obtain the number of information grains.
  • step 5 As a method for multi-granularity breast cancer gene classification based on double adaptive neighborhood radius provided by the present invention, the specific steps of step 5 are as follows:
  • Step 5.1 At a single information granularity, calculate the neighborhood relationship of each breast cancer gene sample x i on B under a single gene attribute:
  • n B ( xi ) ⁇ x ⁇ U
  • Step 5.2 Under the single information granularity, calculate the breast cancer gene decision attribute D with respect to the single gene attribute B positive domain:
  • Z is the shortest cluster center distance
  • h is the difference between the vertical coordinates of the granularity cluster center and the nearest granularity cluster center.
  • Step 5.5 Obtain the positive domain of the breast cancer gene decision attribute D with respect to a m
  • Step 5.6 At a single granularity, arrange the dependency of the attribute in descending order in the list list, and obtain the positive domain NPOS P (D) of the breast cancer gene decision attribute D about the gene attribute at the P granularity:
  • Step 5.7 Calculate the dependence of decision D on conditional attribute P initialization
  • Step 5.9 If r(R 0 ,D) ⁇ r(P,D), put the most dependent attribute in list list into R 0 , and skip to step S5.8.
  • step 6 As a method for multi-granularity breast cancer gene classification based on double adaptive neighborhood radius provided by the present invention, the specific steps of step 6 are as follows:
  • Step 6.3 Obtain the positive domain of the decision attribute D with respect to P t Calculate the dependence of decision D on conditional attribute Pt of breast cancer genes
  • Step 6.4 Arrange the dependence of breast cancer gene attributes in descending order in the list All_list, and obtain the optimistic multi-granularity positive domain of decision attribute D with respect to C as follows:
  • Step 6.5 Calculate the dependence of decision D on conditional attribute C initialization
  • Step 6.7 If r(Red 0 ,D) ⁇ r(C,D), put the most dependent attribute in the list All_list into Red 0 , and jump to step S6.6;
  • the current genetic testing mainly uses the extraction of user genetic data, and predicts by comparing the company's hundreds of millions of data.
  • the data set only provides a small number of samples, and it is difficult to achieve a high accuracy rate for high-dimensional gene attributes.
  • the present invention can analyze a small number of samples and extract more important gene attributes to improve the detection accuracy. Through the above Instances can effectively perform gene prediction.
  • the present invention removes a large amount of redundant gene data and noise gene data through the multi-granularity breast cancer gene classification method based on double adaptive neighborhood radius, from the above example
  • the 24481 gene attributes of the original detection are reduced to 2734 gene attributes, which greatly reduces the time complexity of the algorithm.
  • the genetic data submitted by the user can get the results in just a few minutes, giving the tester an excellent Test experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Primary Health Care (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

一种基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,读取大规模基因位点数据并做归一化处理,并对大规模基因位点进行数据分析;利用轮廓系数和PCA降维可视化相结合方式,选取最佳K值,调整信息粒化的模型;其次,使用启发式约简算法分别实现基于簇心距离自适应邻域半径的多粒度属性约简基于属性包含度的邻域半径的多粒度属性约简,并采用SVM支持向量机机器学习分类算法对乳腺癌基因大数据进行分类和预测。通过调整惩罚项使模型在乳腺癌基因分类具有较高的准确率和召回率,去除大规模数据中冗余属性,提高了计算效率,利用样本之间的支持信息,提升了乳腺癌数据分类的效率和精度。

Description

基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 技术领域
本发明涉及医学信息智能处理技术领域,尤其涉及一种基于双重自适应邻域半径的多粒度乳腺癌基因分类方法。
背景技术
癌症是一种最为常见的基因疾病,经相关医学研究表明肺癌、皮肤癌和乳腺癌与基因密切相关;癌症的出现往往都可以通过基因突变来解释,遗传物质受损没有修复,癌细胞会吸收正常细胞的养分无限分裂导致人体功能衰退,对于早期癌症治愈率较高,癌细胞转移后治愈率较低;早发现早治疗是当下最佳的治疗手段;基因检测是一种无损的检测方法,通过新一代测序技术同时检测成千上万个基因位点,并在大数据下通过对成千上万个基因位点进行数据分析和相关预测,对于临床治疗具有深远的意义,从特征工程、粒计算两个角度对乳腺癌基因大数据进行分析和约简,并通过机器学习分类算法对乳腺癌基因大数据进行分类和预测。
近些年在《乳腺癌NCCN指南》中,对于有家族遗传倾向的乳腺癌高风险人群,推荐用高通量测序进行多基因检测,筛查遗传易感基因,从而预防或指导治疗。这充分显示基于基因检测的个体化治疗及预防是乳腺癌的新方向。指南中指出,对于有家族遗传倾向的乳腺癌高风险人群,《NCCN指南》推荐进行乳腺自检、加强影像学和相应血清肿瘤标志物检查和药物预防等。
通过基因数据的分析帮助医生有效地分析患者是否是乳腺癌高风险患者,然而基因数据过多,亟需一种新的方法能有效地大幅度减少乳腺癌基因数据分类信息中冗余的基因数据,降低乳腺癌数据的分析时间和提高分析效率及精度,有效进行乳腺癌的早期筛查对临床治疗具有一定的意义。
检测主要是用于疾病诊断的采用,基因诊断的方法不仅敏感性大大提高,而且能在短时间内得到结果,了解正确的治疗方法,正确选择药物,避免胡乱用药造成的不良反应,根据乳腺癌基因检测的结果,能够帮助患者制定正确的治疗方法。
发明内容
本发明的目的在于提供一种基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,解决了现有的判断乳腺癌病变状况的有效办法是通过乳腺癌相关的基因数据维度过高难以观察基因突变对于乳腺癌早期判别的影响,通过乳腺癌基因数据之间的联系结合双重自适应邻域半径解决了邻域粗糙集邻域半径选取困难的问题,再利用多粒度邻域粗糙集属性约简可以有效去除噪声和冗余数据。
为了实现上述发明目的,本发明采用以下技术方案:基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,包括以下步骤:
S1:读取乳腺癌基因数据集,将数据转换为一个四元组决策信息系统S=(U,AT,V,f,δ),邻域决策信息系统S表示如下:
S=(U,AT,V,f,δ),其中U={x 1,x 2,x 3,.....x m}表示乳腺癌基因数据集中的检测患者对象集合,m表示乳腺癌基因检测患者的个数;C={a 1,a 2,...,a n}表示乳腺癌基因特征的非空有限集合,n表示乳腺癌基因特征的个数;D={D 1,D 2}表示乳腺癌基因检测患者类别标签的非空有限集合,AT=C∪D表示所有基因属性和决策属性,d 1表示患者患有乳腺癌,d 2表示患者没有患有乳腺癌,且
Figure PCTCN2022077251-appb-000001
V=∪ a∈C∪DV a,V a是乳腺癌基因检测患者基因特征a的可能情况;f:U×C∪D→V是一个信息函数,它为每个乳腺癌基因检测患者基因特征赋予一个信息值,即
Figure PCTCN2022077251-appb-000002
x∈U,f(x,a)∈V a,δ为邻域阈值;
S2:对乳腺癌基因数据集中非标签数据进行归一化处理,数据归一化的公式如下:
Figure PCTCN2022077251-appb-000003
其中x指原始样本中某一属性的数值,x'表示归一化后原始样本中某一属性的数值,max(x)表示所有样本中在某一属性中的最大值,而min(x)表示所有样本中在某一属性中的最小值;
S3:采用K-means聚类算法实现乳腺癌基因数据的信息粒化,采用轮廓系数和PCA降维相结合的方式得到最佳信息粒的个数k,最终得到多个粒度即C={P 1,P 2,...,P k};
S4:信息粒化实现方法:随机选取k个乳腺癌基因样本作为簇心,采用欧式距离,将每个样本点分配到离他们最近的簇心,对于每个簇,计算簇内的样本点的均值作为新的簇心,当簇心位置不再改变时,最终得到k个信息粒;
S5:乳腺癌基因属性被划分到了多个粒度下,在每个粒度下实现基于簇心距离自适应的邻域粗糙集属性约简:通过暂时保留密集相似区内的基因属性,对于密集相似区外的大量基因属性进行多层的邻域筛选,去除无关的基因属性,再采用启发式搜索迭代至正域这个过程去除密集相似区内的冗余的基因属性,得到重要的乳腺癌基因属性;
S6:每个粒度都得到了约简后乳腺癌基因属性,将多个粒度进行融合,并采用基于属性包含度多粒度邻域属性约简在融合的过程中去除不同粒度下相似冗余的基因属性:引入属性包含度的概念,通过细化属性包含度的学习曲线得到乳腺癌基因数据下的最优多粒度邻域半径,并基于多粒度邻域半径采用启发式搜索去除不同粒度下的冗余的基因属性,最终得到属性的约简集合。
S7:采用SVM支持向量机对属性约简集合进行拟合,引入准确率和召回率两大指标,综合考虑模型的稳定性,在采用SVM支持向量机作为模型的分类器的基础上引入惩罚性使得分类模型同时具备较好的准确率和召回率即在该模型下基于乳腺癌基因数据的分类预测具有较高正确率的同时将一个癌症患者预测为正常人的风险较低。
S8:输入大规模乳腺癌基因数据,使用约简集合选取合适属性,使用分类器得到最终的预测结果。
作为本发明提供的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,所述步骤S3的具体步骤如下:
步骤S3.1:采用轮廓系数进行聚类算法评价,第i个乳腺癌基因属性与簇内其他乳腺癌基因属性的相似度为a i,与簇外其他乳腺癌基因属性的相似度为b i,则第i个乳腺癌基 因属性的轮廓系数定义如下:
Figure PCTCN2022077251-appb-000004
其中s i的取值范围为[-1,1],当轮廓系统越接近1说明聚类效果越好,当轮廓系数为负说明聚类效果较差;
步骤S3.2:采用主成分分析PCA降维算法减少乳腺癌基因数据的简化,实现降维可视化,与聚类算法结合测试聚类实际效果,具体设计如下:
对于m个n维乳腺癌基因数据,各变量之间的关系设计协方差矩阵如下:
Figure PCTCN2022077251-appb-000005
其中cov(c i,c j)表示第i个属性和第j个属性之间的协方差;
再根据特征值大小计算协方差矩阵的贡献率θ以及累计贡献率Θ:
Figure PCTCN2022077251-appb-000006
其中N为基因属性总数,y i为第i列的特征值,y n为第n列的特征值
Figure PCTCN2022077251-appb-000007
其中θ i表示协方差矩阵中第i列的贡献率,而Θ r表示协方差矩阵中前r列的累计贡献率。
步骤S3.3:取协方差矩阵的前r维作为投影矩阵S n×r,将需要降维的矩阵Y m×n与投影矩阵S n×r相乘,得到降维后的矩阵T m×r即:
Y m×n×S n×r=T m×r  (17)
其中m表示乳腺癌基因数据的样本数,n表示乳腺癌基因数据的原始基因属性个数,r表示降维后得到的乳腺癌基因数据的基因属性个数。
步骤S3.4:通过轮廓系数确定一个k值粗略的取值区间,再通过PCA降维可视化方式细化区间选取最佳k值,得到信息粒的个数。
作为本发明提供的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,所述步骤S5的具体步骤如下:
步骤S5.1:在单个信息粒度下,计算每个乳腺癌基因样本x i在单个基因属性下B上的邻域关系:
n B(x i)={x∈U|Δ B(x i,x)≤δ}  (18)
其中Δ B是距离函数,δ为邻域半径,δ>0。
步骤S5.2在单个信息粒度下,计算乳腺癌基因决策属性D关于单个基因属性下B正域:
Figure PCTCN2022077251-appb-000008
则决策属性D关于B的依赖度定义为:
Figure PCTCN2022077251-appb-000009
步骤S5.3:在单个粒度下,该粒度下有z个基因属性P={a 1,a 2,...,a z},该信息粒下簇心坐标表示为(b 1,b 2,...,b n),计算求得距离下一个最近的信息粒的簇的簇心坐标表示为(d 1,d 2,...,d n),i,j为样本遍历序号初始为0,0≤i,j≤m;
步骤S5.4:在单个粒度下,对于任意的乳腺癌基因属性a t若满足a t到该信息粒簇心距离记为S t,若
Figure PCTCN2022077251-appb-000010
则默认该属性为密集相似区内的乳腺癌基因属性,先初始化集合
Figure PCTCN2022077251-appb-000011
用于寻找基因属性i的下近似集,从x i开始计算该属性下x i到其他的点x j的距离,记x i到x j距离为W,若
Figure PCTCN2022077251-appb-000012
即邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2}若
Figure PCTCN2022077251-appb-000013
Figure PCTCN2022077251-appb-000014
则称set i为x i在D 1或D 2关于a t的下近似集,否则令
Figure PCTCN2022077251-appb-000015
步骤S5.5:求得乳腺癌基因决策属性D关于a m的正域
Figure PCTCN2022077251-appb-000016
计算乳腺癌基因决策属性D对乳腺癌基因条件属性a t的依赖度如下:
Figure PCTCN2022077251-appb-000017
步骤S5.6:在单个粒度下,在列表list中降序排放属性的依赖度,求得乳腺癌基因决策属性D关于P粒度下基因属性的正域NPOS P(D):
Figure PCTCN2022077251-appb-000018
步骤S5.7:计算决策D对条件属性P的依赖度
Figure PCTCN2022077251-appb-000019
初始化
Figure PCTCN2022077251-appb-000020
步骤S5.8:若r(R 0,D)=r(P,D),算法终止;求出最终大规模乳腺癌基因约简集合R=R 0
步骤S5.9:若r(R 0,D)≠r(P,D),将列表list中依赖度最大的属性放入R 0,跳转到步骤S5.8。
作为本发明提供的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其中,所述步骤S6的具体步骤如下:
步骤S6.1:在多个粒度中得到决策表S=(U,C∪D,V,f),其中C={P 1,P 2,...,P k},U={x 1,x 2,...,x m},D={D 1,D 2},k为信息粒的个数,m为乳腺癌基因数据样本个数,基于属性包含度选择最佳邻域半径,i,j为样本遍历序号初始为0,0≤i,j≤m;
步骤S6.2:对于任意的信息粒P t,先初始化集合
Figure PCTCN2022077251-appb-000021
用于寻找基因属性i的下近似集,从x i开始计算该信息粒下x i到其他的点x j的欧式距离,若x i到x j的欧式距离小于邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1, D 2},若
Figure PCTCN2022077251-appb-000022
Figure PCTCN2022077251-appb-000023
则称set i为x i在D 1或D 2关于P t的下近似集,否则令
Figure PCTCN2022077251-appb-000024
步骤S6.3:求得决策属性D关于P t的正域
Figure PCTCN2022077251-appb-000025
计算决策D对乳腺癌基因条件属性P t的依赖度
Figure PCTCN2022077251-appb-000026
步骤S6.4:在列表All_list中降序排放乳腺癌基因属性的依赖度,求得决策属性D关于C的乐观多粒度正域
Figure PCTCN2022077251-appb-000027
如下:
Figure PCTCN2022077251-appb-000028
步骤S6.5:计算决策D对条件属性C的依赖度
Figure PCTCN2022077251-appb-000029
初始化
Figure PCTCN2022077251-appb-000030
步骤S6.6:若r(Red 0,D)=r(C,D),算法终止;求出最终乳腺癌基因约简集合Red=Red 0
步骤S6.7:若r(Red 0,D)≠r(C,D),将列表All_list中依赖度最大的属性放入Red 0,跳转到步骤S6.6;
步骤S6.8:依次从Red={P i,...P j}中选出P t中邻域依赖度最大的属性
Figure PCTCN2022077251-appb-000031
Figure PCTCN2022077251-appb-000032
算法终止;求出R=R 0
步骤S6.9:若r(R 0,D)≠r(C,D),将Red={P i,...P j}中P t+1依赖度最大的乳腺癌基因属性放入R 0,跳转到步骤S6.8。
与现有技术相比,本发明的有益效果为:
(1)、本发明的高准确率与高召回率并行的分类器可以有效的利用基于双重自适应邻域半径的乳腺癌约简集合,给予检测者在较短的时间内得到高准确率的检测结果,与其他分类方法相比,高召回率模型还能保证将癌症患者预测为正常人的高损失风险降到最低,最后通过大数据下的数据分析、属性约简和机器学习分类算法并结合医生一定临床经验能够有效的帮助医生降低乳腺癌早期判断难度,通过乳腺癌早期的癌症筛查可以让患者获得最佳的治疗时期。
(2)、本发明可以通过对少量的样本进行数据分析,通过属性约简提取其中较为重要的基因属性以减少噪声数据对于模型预测的干扰,采用双重自适应邻域半径相比于手动设置邻域半径能够让分类器更好地自学习拟合模型,从而进一步地提高检测准确率,通过上述实例可以有效的进行基因预测。
(3)、本发明通过基于双重自适应邻域半径的多粒度乳腺癌基因分类方法去除大量冗余基因数据和噪声基因数据,从而从上述实例中将原始检测的24481个基因属性约简到了2734个基因属性,与此同时采用十倍交叉法验证可以有效地解决样本数量小,运行时间长等问题,这大大减少了模型的复杂度和算法的时间复杂度,用户提交检测完的基因数据可以在短短的几分钟内得到结果,给予检测者更好的检测体验。
(4)、对样本时往往忽视召回率的问题,将一个癌症患者预测为一个正常人的风险损失极大,检测者很可能会错过最佳的治疗时间,而本发明通过基于双重自适应邻域半径 的多粒度乳腺癌基因分类方法充分考虑了检测正确率和检测召回率的风险问题,对模型进行调整,通过设置惩罚项,在确保模型正确率较高的基础上充分考虑召回率对于模型的影响来改进模型,从而极大地减少这一风险的发生。
附图说明
图1为本发明的乳腺癌基因检测流程图。
图2为本发明的基于乳腺癌基因数据的双重自适应邻域半径多粒度属性约简流程图。
图3为本发明的乳腺癌基因数据分类检测流程图。
图4为本发明的乳腺癌基因数据下单粒度自适应邻域半径属性约简流程图。
图5为本发明的乳腺癌基因数据下多粒度自适应邻域半径属性约简流程图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。当然,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
实施例1
参见图1至图5,本发明提供其技术方案为,基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,包括以下步骤:
步骤1:读取乳腺癌基因数据集,将数据转换为一个四元组决策信息系统S=(U,AT,V,f,δ),邻域决策信息系统S表示如下:
S=(U,AT,V,f,δ),其中U={x 1,x 2,x 3,.....x m}表示乳腺癌基因数据集中的检测患者对象集合,m表示乳腺癌基因检测患者的个数;C={a 1,a 2,...,a n}表示乳腺癌基因特征的非空有限集合,n表示乳腺癌基因特征的个数;D={D 1,D 2}表示乳腺癌基因检测患者类别标签的非空有限集合,AT=C∪D表示所有基因属性和决策属性,d 1表示患者患有乳腺癌,d 2表示患者没有患有乳腺癌,且
Figure PCTCN2022077251-appb-000033
V=∪ a∈C∪DV a,V a是乳腺癌基因检测患者基因特征a的可能情况;f:U×C∪D→V是一个信息函数,它为每个乳腺癌基因检测患者基因特征赋予一个信息值,即
Figure PCTCN2022077251-appb-000034
x∈U,f(x,a)∈V a,δ为邻域阈值;
采用了乳腺癌基因数据集对以上模型进行测试,其中样本数为97个,基因属性共计24481个,决策属性为两类,分别为确诊乳腺癌患者和正常人。
步骤2:对乳腺癌基因数据集中非标签数据进行归一化处理,数据归一化的公式如下:
Figure PCTCN2022077251-appb-000035
其中x指原始样本中某一属性的数值,x'表示归一化后原始样本中某一属性的数值,max(x)表示所有样本中在某一属性中的最大值,而min(x)表示所有样本中在某一属性中的最小值。
步骤3:采用K-means聚类算法实现乳腺癌基因数据的信息粒化,采用轮廓系数和PCA降维相结合的方式得到最佳信息粒的个数k,最终得到多个粒度即C={P 1,P 2,...,P k}。
步骤4:信息粒化实现方法:随机选取k个乳腺癌基因样本作为簇心,采用欧式距离,将每个样本点分配到离他们最近的簇心,对于每个簇,计算簇内的样本点的均值作为新的簇心,当簇心位置不再改变时,最终得到k个信息粒;
通过轮廓系数指标将最佳粒度数即k值确定在k=90附近区间,再通过PCA降维可视化确定划分90个粒度即k=90最为合理。
步骤5:乳腺癌基因属性被划分到了多个粒度下,在每个粒度下实现基于簇心距离自适应的邻域粗糙集属性约简:通过暂时保留密集相似区内的基因属性,对于密集相似区外的大量基因属性进行多层的邻域筛选,去除无关的基因属性,再采用启发式搜索迭代至正域这个过程去除密集相似区内的冗余的基因属性,得到重要的乳腺癌基因属性;
选取一个粒度下该粒度与其他89个粒度簇心的距离,选择最短簇心距离的簇心,得到自适应邻域半径为
Figure PCTCN2022077251-appb-000036
其中Z为最短簇心距离,h为该粒度簇心与最近粒度簇心的纵坐标之差,再采用单粒度邻域属性约简算法求得该粒度下的约简集合,最后依此类推求得其余89个粒度下约简集合。
步骤6:每个粒度都得到了约简后乳腺癌基因属性,将多个粒度进行融合,并采用基于属性包含度多粒度邻域属性约简在融合的过程中去除不同粒度下相似冗余的基因属性:引入属性包含度的概念,通过细化属性包含度的学习曲线得到乳腺癌基因数据下的最优多粒度邻域半径,并基于多粒度邻域半径采用启发式搜索去除不同粒度下的冗余的基因属性,最终得到属性的约简集合;
选取所有粒度下邻域半径,选择最大的邻域半径0.2为初始多粒度邻域半径,即多粒度邻域半径取值区间为[0,0.2],以0.01为步长分别计算每个多粒度邻域半径下属性包含度,选择属性包含度最大的邻域半径即0.13作为多粒度邻域半径。最后采用多粒度邻域属性约简算法将90个粒度进行融合得到最终约简集合共计2734个基因属性。
步骤7:采用SVM支持向量机对属性约简集合进行拟合,引入准确率和召回率两大指标,综合考虑模型的稳定性,在采用SVM支持向量机作为模型的分类器的基础上引入惩罚性使得分类模型同时具备较好的准确率和召回率即在该模型下基于乳腺癌基因数据的分类预测具有较高正确率的同时将一个癌症患者预测为正常人的风险较低。
采用十倍交叉法每次任意选取9成样本作为训练集,1成样本作为测试集对样本进行划分,采用SVM支持向量机分类算法对样本进行拟合,共训练10次,其中7次训练正确率达到90%以上,平均正确率约85.7%,引入惩罚项对模型进行改进同时考虑召回率最终得到模型预测正确率平均正确率约为91.2%,召回率约82%。
步骤8:输入大规模乳腺癌基因数据,使用约简集合选取合适属性,使用分类器得到最终的预测结果。
作为本发明提供的一种用于基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,所述步骤3的具体步骤如下:
步骤3.1:采用轮廓系数进行聚类算法评价,第i个乳腺癌基因属性与簇内其他乳腺癌基因属性的相似度为a i,与簇外其他乳腺癌基因属性的相似度为b i,则第i个乳腺癌基因属性的轮廓系数定义如下:
Figure PCTCN2022077251-appb-000037
其中s i的取值范围为[-1,1],当轮廓系统越接近1说明聚类效果越好,当轮廓系数为负说明聚类效果较差;
通过轮廓系数得到基于双重自适应邻域半径的多粒度乳腺癌基因分类方法;
步骤3.2:采用主成分分析PCA降维算法减少乳腺癌基因数据的简化,实现降维可视化,与聚类算法结合测试聚类实际效果,具体设计如下:
对于m个n维乳腺癌基因数据,各变量之间的关系设计协方差矩阵如下:
Figure PCTCN2022077251-appb-000038
其中cov(c i,c j)表示第i个属性和第j个属性之间的协方差;
再根据特征值大小计算协方差矩阵的贡献率θ以及累计贡献率Θ:
Figure PCTCN2022077251-appb-000039
其中N为基因属性总数,y i为第i列的特征值,y n为第n列的特征值
Figure PCTCN2022077251-appb-000040
其中θ i表示协方差矩阵中第i列的贡献率,而Θ r表示协方差矩阵中前r列的累计贡献率。
步骤3.3:取协方差矩阵的前r维作为投影矩阵S n×r,将需要降维的矩阵Y m×n与投影矩阵S n×r相乘,得到降维后的矩阵T m×r即:
Y m×n×S n×r=T m×r  (28)
其中m表示乳腺癌基因数据的样本数,n表示乳腺癌基因数据的原始基因属性个数,r表示降维后得到的乳腺癌基因数据的基因属性个数。
步骤3.4:通过轮廓系数确定一个k值粗略的取值区间,再通过PCA降维可视化方式细化区间选取最佳k值,得到信息粒的个数。
通过PCA降维可视化最终确定划分90个粒度即k=90最为合理;
作为本发明提供的一种用于基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,所述步骤5的具体步骤如下:
步骤5.1:在单个信息粒度下,计算每个乳腺癌基因样本x i在单个基因属性下B上的邻域关系:
n B(x i)={x∈U|Δ B(x i,x)≤δ}  (29)其中Δ B是距离函数,δ为邻域半径,δ>0。
步骤5.2在单个信息粒度下,计算乳腺癌基因决策属性D关于单个基因属性下B正域:
Figure PCTCN2022077251-appb-000041
则决策属性D关于B的依赖度定义为:
Figure PCTCN2022077251-appb-000042
步骤5.3:在单个粒度下,该粒度下有z个基因属性P={a 1,a 2,...,a z},该信息粒下簇心坐标表示为(b 1,b 2,...,b n),计算求得距离下一个最近的信息粒的簇的簇心坐标表示为(d 1,d 2,...,d n),i,j为样本遍历序号初始为0,0≤i,j≤m;
步骤5.4:在单个粒度下,对于任意的乳腺癌基因属性a t若满足a t到该信息粒簇心距离记为S t,若
Figure PCTCN2022077251-appb-000043
则默认该属性为密集相似区内的乳腺癌基因属性,先初始化集合
Figure PCTCN2022077251-appb-000044
用于寻找基因属性i的下近似集,从x i开始计算该属性下x i到其他的点x j的距离,记x i到x j距离为W,若
Figure PCTCN2022077251-appb-000045
即邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2}若
Figure PCTCN2022077251-appb-000046
Figure PCTCN2022077251-appb-000047
则称set i为x i在D 1或D 2关于a t的下近似集,否则令
Figure PCTCN2022077251-appb-000048
求得邻域半径为
Figure PCTCN2022077251-appb-000049
其中Z为最短簇心距离,h为该粒度簇心与最近粒度簇心的纵坐标之差。
步骤5.5:求得乳腺癌基因决策属性D关于a m的正域
Figure PCTCN2022077251-appb-000050
计算乳腺癌基因决策属性D对乳腺癌基因条件属性a t的依赖度如下:
Figure PCTCN2022077251-appb-000051
步骤5.6:在单个粒度下,在列表list中降序排放属性的依赖度,求得乳腺癌基因决策属性D关于P粒度下基因属性的正域NPOS P(D):
Figure PCTCN2022077251-appb-000052
步骤5.7:计算决策D对条件属性P的依赖度
Figure PCTCN2022077251-appb-000053
初始化
Figure PCTCN2022077251-appb-000054
步骤5.8:若r(R 0,D)=r(P,D),算法终止;求出最终大规模乳腺癌基因约简集合R=R 0
步骤5.9:若r(R 0,D)≠r(P,D),将列表list中依赖度最大的属性放入R 0,跳转到步骤S5.8。
作为本发明提供的一种用于基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,所述步骤6的具体步骤如下:
步骤6.1:在多个粒度中得到决策表S=(U,C∪D,V,f),其中C={P 1,P 2,...,P k},U={x 1,x 2,...,x m},D={D 1,D 2},k为信息粒的个数,m为乳腺癌基因数据样本个数,基于属性包含度选择最佳邻域半径,i,j为样本遍历序号初始为0,0≤i,j≤m;
该数据集下k=90,m=97;
步骤6.2:对于任意的信息粒P t,先初始化集合
Figure PCTCN2022077251-appb-000055
用于寻找基因属性i的下近 似集,从x i开始计算该信息粒下x i到其他的点x j的欧式距离,若x i到x j的欧式距离小于邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2},若
Figure PCTCN2022077251-appb-000056
Figure PCTCN2022077251-appb-000057
则称set i为x i在D 1或D 2关于P t的下近似集,否则令
Figure PCTCN2022077251-appb-000058
选择最大的邻域半径0.2为初始多粒度邻域半径,即多粒度邻域半径取值区间为[0,0.2],以0.01为步长分别计算每个多粒度邻域半径下属性包含度,选择属性包含度最大的邻域半径即0.13作为多粒度邻域半径;
步骤6.3:求得决策属性D关于P t的正域
Figure PCTCN2022077251-appb-000059
计算决策D对乳腺癌基因条件属性P t的依赖度
Figure PCTCN2022077251-appb-000060
步骤6.4:在列表All_list中降序排放乳腺癌基因属性的依赖度,求得决策属性D关于C的乐观多粒度正域
Figure PCTCN2022077251-appb-000061
如下:
Figure PCTCN2022077251-appb-000062
步骤6.5:计算决策D对条件属性C的依赖度
Figure PCTCN2022077251-appb-000063
初始化
Figure PCTCN2022077251-appb-000064
步骤6.6:若r(Red 0,D)=r(C,D),算法终止;求出最终乳腺癌基因约简集合Red=Red 0
步骤6.7:若r(Red 0,D)≠r(C,D),将列表All_list中依赖度最大的属性放入Red 0,跳转到步骤S6.6;
步骤6.8:依次从Red={P i,...P j}中选出P t中邻域依赖度最大的属性
Figure PCTCN2022077251-appb-000065
Figure PCTCN2022077251-appb-000066
算法终止;求出R=R 0
步骤6.9:若r(R 0,D)≠r(C,D),将Red={P i,...P j}中P t+1依赖度最大的乳腺癌基因属性放入R 0,跳转到步骤6.8。
由此可知,当下基因检测主要是采用提取用户基因数据,通过比对该公司数以亿计的数据进行预测,然而这些数据并未公开,所以基因检测方法因为数据源的问题难以普及,许多公开的数据集也只提供少量的样本,对于高维度的基因属性难以达到较高的准确率,而本发明可以通过对少量的样本进行分析,提取其中较为重要的基因属性提高检测准确率,通过上述实例可以有效的进行基因预测。
不仅如此,由于许多公司需要拿用户基因数据去比对数据库数以亿计的样本,这样带来相当大的时间成本,因为计算系统全部基因属性的时间复杂度会随着基因的组合呈指数级增长,用户需要等待几个小时甚至几天才能得到最终的结果,而本发明通过基于双重自适应邻域半径的多粒度乳腺癌基因分类方法去除大量冗余基因数据和噪声基因数据,从上述实例中将原始检测的24481个基因属性约简到了2734个基因属性,这大大减少了算法的时间复杂度,用户提交检测完的基因数据可以在短短的几分钟内得到结果,给予检测者极佳检测体验。
此外,许多公司比对样本时往往忽视召回率的问题,将一个癌症患者预测为一个 正常人的风险损失极大,检测者很可能会错过最佳的治疗时间;而本发明通过基于双重自适应邻域半径的多粒度乳腺癌基因分类方法充分考虑了检测正确率和检测召回率的风险问题,对模型进行调整,极大减少这一风险的发生。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (4)

  1. 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,包括以下步骤:
    S1:读取乳腺癌基因数据集,将数据转换为一个四元组决策信息系统S=(U,AT,V,f,δ),邻域决策信息系统S表示如下:
    S=(U,AT,V,f,δ),其中U={x 1,x 2,x 3,.....x m}表示乳腺癌基因数据集中的检测患者对象集合,m表示乳腺癌基因检测患者的个数;C={a 1,a 2,...,a n}表示乳腺癌基因特征的非空有限集合,n表示乳腺癌基因特征的个数;D={D 1,D 2}表示乳腺癌基因检测患者类别标签的非空有限集合,AT=C∪D表示所有基因属性和决策属性,d 1表示患者患有乳腺癌,d 2表示患者没有患有乳腺癌,且
    Figure PCTCN2022077251-appb-100001
    V=∪ a∈C∪DV a,V a是乳腺癌基因检测患者基因特征a的可能情况;f:U×C∪D→V是一个信息函数,它为每个乳腺癌基因检测患者基因特征赋予一个信息值,即
    Figure PCTCN2022077251-appb-100002
    δ为邻域阈值;
    S2:对乳腺癌基因数据集中非标签数据进行归一化处理,数据归一化的公式如下:
    Figure PCTCN2022077251-appb-100003
    其中x指原始样本中某一属性的数值,x'表示归一化后原始样本中某一属性的数值,max(x)表示所有样本中在某一属性中的最大值,而min(x)表示所有样本中在某一属性中的最小值;
    S3:采用K-means聚类算法实现乳腺癌基因数据的信息粒化,采用轮廓系数和PCA降维相结合的方式得到最佳信息粒的个数k,得到多个粒度即C={P 1,P 2,...,P k};
    S4:信息粒化实现方法:随机选取k个乳腺癌基因样本作为簇心,采用欧式距离,将每个样本点分配到离簇心最近处,对于每个簇,计算簇内的样本点的均值作为新的簇心,当簇心位置不再改变时,最终得到k个信息粒;
    S5:乳腺癌基因属性被划分到了多个粒度下,在每个粒度下实现基于簇心距离自适应的邻域粗糙集属性约简:通过暂时保留密集相似区内的基因属性,对于密集相似区外的大量基因属性进行多层的邻域筛选,去除无关的基因属性,再采用启发式搜索迭代至正域过程去除密集相似区内的冗余的基因属性,得到重要的乳腺癌基因属性;
    S6:每个粒度都得到了约简后乳腺癌基因属性,将多个粒度进行融合,并采用基于属性包含度多粒度邻域属性约简在融合的过程中去除不同粒度下相似冗余的基因属性:引入属性包含度的概念,通过细化属性包含度的学习曲线得到乳腺癌基因数据下的最优多粒度邻域半径,并基于多粒度邻域半径采用启发式搜索去除不同粒度下的冗余的基因属性,最终得到属性的约简集合;
    S7:采用SVM支持向量机对属性约简集合进行拟合,引入准确率和召回率两大指标,综合考虑模型的稳定性,在采用SVM支持向量机作为模型的分类器的基础上引入惩罚性使得分类模型同时具备较好的准确率和召回率;
    S8:输入大规模乳腺癌基因数据,使用约简集合选取合适属性,使用分类器得到最终的预测结果。
  2. 根据权利要求1所述的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,所述步骤S3具体包括以下步骤:
    步骤S3.1:采用轮廓系数进行聚类算法评价,第i个乳腺癌基因属性与簇内其他乳腺癌基因属性的相似度为a i,与簇外其他乳腺癌基因属性的相似度为b i,则第i个乳腺癌基因属性的轮廓系数定义如下:
    Figure PCTCN2022077251-appb-100004
    其中s i的取值范围为[-1,1],当轮廓系统越接近1说明聚类效果越好,当轮廓系数为负说明聚类效果较差;
    步骤S3.2:采用主成分分析PCA降维算法减少乳腺癌基因数据的简化,达到降维可视化,与聚类算法结合测试聚类实际效果,具体内容如下:
    对于m个n维乳腺癌基因数据,各变量之间的关系设计协方差矩阵如下:
    Figure PCTCN2022077251-appb-100005
    其中cov(c i,c j)表示第i个属性和第j个属性之间的协方差;
    再根据特征值大小计算协方差矩阵的贡献率θ以及累计贡献率Θ:
    Figure PCTCN2022077251-appb-100006
    其中N为基因属性总数,y i为第i列的特征值,y n为第n列的特征值
    Figure PCTCN2022077251-appb-100007
    其中θ i表示协方差矩阵中第i列的贡献率,而Θ r表示协方差矩阵中前r列的累计贡献率;
    步骤S3.3:取协方差矩阵的前r维为投影矩阵S n×r,将降维的矩阵Y m×n与投影矩阵S n×r相乘,得降维后的矩阵T m×r即:
    Y m×n×S n×r=T m×r       (6)
    其中m表示乳腺癌基因数据的样本数,n表示乳腺癌基因数据的原始基因属性个数,r表示降维后得到的乳腺癌基因数据的基因属性个数;
    步骤S3.4:通过轮廓系数确定一个k值粗略的取值区间,再通过PCA降维可视化方式细化区间选取最佳k值,得到信息粒的个数。
  3. 根据权利要求1所述的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,所述步骤S5的具体步骤如下:
    步骤S5.1:在单个信息粒度下,计算每个乳腺癌基因样本x i在单个基因属性下B上的邻域关系:
    n B(x i)={x∈U|Δ B(x i,x)≤δ}       (7)
    其中Δ B是距离函数,δ为邻域半径,δ>0;
    步骤S5.2在单个信息粒度下,计算乳腺癌基因决策属性D关于单个基因属性下B正域:
    Figure PCTCN2022077251-appb-100008
    则决策属性D关于B的依赖度定义为:
    Figure PCTCN2022077251-appb-100009
    步骤S5.3:在单个粒度下,该粒度下有z个基因属性P={a 1,a 2,...,a z},该信息粒下簇心坐标表示为(b 1,b 2,...,b n),计算求得距离下一个最近的信息粒的簇的簇心坐标表示为(d 1,d 2,...,d n),i,j为样本遍历序号初始为0,0≤i,j≤m;
    步骤S5.4:在单个粒度下,对于任意的乳腺癌基因属性a t若满足a t到该信息粒簇心距离记为S t,若
    Figure PCTCN2022077251-appb-100010
    则默认该属性为密集相似区内的乳腺癌基因属性,先初始化集合
    Figure PCTCN2022077251-appb-100011
    用于寻找基因属性i的下近似集,从x i开始计算该属性下x i到其他的点x j的距离,记x i到x j距离为W,若
    Figure PCTCN2022077251-appb-100012
    即邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2}若
    Figure PCTCN2022077251-appb-100013
    Figure PCTCN2022077251-appb-100014
    则称set i为x i在D 1或D 2关于a t的下近似集,否则令
    Figure PCTCN2022077251-appb-100015
    步骤S5.5:求得乳腺癌基因决策属性D关于a m的正域
    Figure PCTCN2022077251-appb-100016
    计算乳腺癌基因决策属性D对乳腺癌基因条件属性a t的依赖度如下:
    Figure PCTCN2022077251-appb-100017
    步骤S5.6:在单个粒度下,在列表list中降序排放属性的依赖度,求得乳腺癌基因决策属性D关于P粒度下基因属性的正域NPOS P(D):
    Figure PCTCN2022077251-appb-100018
    步骤S5.7:计算决策D对条件属性P的依赖度
    Figure PCTCN2022077251-appb-100019
    初始化
    Figure PCTCN2022077251-appb-100020
    步骤S5.8:若r(R 0,D)=r(P,D),算法终止;求出最终大规模乳腺癌基因约简集合R=R 0
    步骤S5.9:若r(R 0,D)≠r(P,D),将列表list中依赖度最大的属性放入R 0,跳转到步骤S5.8。
  4. 根据权利要求1所述的基于双重自适应邻域半径的多粒度乳腺癌基因分类方法,其特征在于,所述步骤S6的具体步骤如下:
    步骤S6.1:在多个粒度中得到决策表S=(U,C∪D,V,f),其中C={P 1,P 2,...,P k},U={x 1,x 2,...,x m},D={D 1,D 2},k为信息粒的个数,m为乳腺癌基因数据样本个数,基于属性包含度选择最佳邻域半径,i,j为样本遍历序号初始为0,0≤i,j≤m;
    步骤S6.2:对于任意的信息粒P t,先初始化集合
    Figure PCTCN2022077251-appb-100021
    用于寻找基因属性i的下近似集,从x i开始计算该信息粒下x i到其他的点x j的欧式距离,若x i到x j的欧式距离小于邻域半径,则令set i=set i∨x i∨x j,待遍历完每一点后最终求得set i,其中决策属性D={D 1,D 2},若
    Figure PCTCN2022077251-appb-100022
    Figure PCTCN2022077251-appb-100023
    则称set i为x i在D 1或D 2关于P t的下近似集,否则令
    Figure PCTCN2022077251-appb-100024
    步骤S6.3:求得决策属性D关于P t的正域
    Figure PCTCN2022077251-appb-100025
    计算决策D对乳腺癌基因条件属性P t的依赖度
    Figure PCTCN2022077251-appb-100026
    步骤S6.4:在列表All_list中降序排放乳腺癌基因属性的依赖度,求得决策属性D关于C的乐观多粒度正域
    Figure PCTCN2022077251-appb-100027
    如下:
    Figure PCTCN2022077251-appb-100028
    步骤S6.5:计算决策D对条件属性C的依赖度
    Figure PCTCN2022077251-appb-100029
    初始化
    Figure PCTCN2022077251-appb-100030
    步骤S6.6:若r(Red 0,D)=r(C,D),算法终止;求出最终乳腺癌基因约简集合Red=Red 0
    步骤S6.7:若r(Red 0,D)≠r(C,D),将列表All_list中依赖度最大的属性放入Red 0,跳转到步骤S6.6;
    步骤S6.8:依次从Red={P i,...P j}中选出P t中邻域依赖度最大的属性
    Figure PCTCN2022077251-appb-100031
    Figure PCTCN2022077251-appb-100032
    算法终止;求出R=R 0
    步骤S6.9:若r(R 0,D)≠r(C,D),将Red={P i,...P j}中P t+1依赖度最大的乳腺癌基因属性放入R 0,跳转到步骤S6.8。
PCT/CN2022/077251 2021-07-26 2022-02-22 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法 WO2023005196A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/798,352 US11837329B2 (en) 2021-07-26 2022-02-22 Method for classifying multi-granularity breast cancer genes based on double self-adaptive neighborhood radius

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110845531.0 2021-07-26
CN202110845531.0A CN113838532B (zh) 2021-07-26 2021-07-26 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法

Publications (1)

Publication Number Publication Date
WO2023005196A1 true WO2023005196A1 (zh) 2023-02-02

Family

ID=78962844

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077251 WO2023005196A1 (zh) 2021-07-26 2022-02-22 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法

Country Status (3)

Country Link
US (1) US11837329B2 (zh)
CN (1) CN113838532B (zh)
WO (1) WO2023005196A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838532B (zh) * 2021-07-26 2022-11-18 南通大学 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法
CN114675818B (zh) * 2022-03-29 2024-04-19 江苏科技大学 一种基于粗糙集理论的度量可视化工具的实现方法
CN115186769B (zh) * 2022-09-07 2022-11-25 山东未来网络研究院(紫金山实验室工业互联网创新应用基地) 一种基于nlp的突变基因分类方法
CN117912712B (zh) * 2024-03-20 2024-05-28 徕兄健康科技(威海)有限责任公司 基于大数据的甲状腺疾病数据智能管理方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110023759A (zh) * 2016-09-19 2019-07-16 血液学有限公司 用于使用多维分析检测异常细胞的系统、方法和制品
CN110211638A (zh) * 2019-05-28 2019-09-06 河南师范大学 一种考虑基因相关度的基因选择方法与装置
CN113838532A (zh) * 2021-07-26 2021-12-24 南通大学 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040076984A1 (en) * 2000-12-07 2004-04-22 Roland Eils Expert system for classification and prediction of generic diseases, and for association of molecular genetic parameters with clinical parameters
CA2618939A1 (en) * 2004-08-13 2006-04-27 Jaguar Bioscience Inc. Systems and methods for identifying diagnostic indicators
US8165973B2 (en) * 2007-06-18 2012-04-24 International Business Machines Corporation Method of identifying robust clustering
US10252145B2 (en) * 2016-05-02 2019-04-09 Bao Tran Smart device
US10381105B1 (en) * 2017-01-24 2019-08-13 Bao Personalized beauty system
WO2020089835A1 (en) * 2018-10-31 2020-05-07 Ancestry.Com Dna, Llc Estimation of phenotypes using dna, pedigree, and historical data
WO2021092071A1 (en) * 2019-11-07 2021-05-14 Oncxerna Therapeutics, Inc. Classification of tumor microenvironments
CN112163133B (zh) * 2020-09-25 2021-10-08 南通大学 一种基于多粒度证据邻域粗糙集的乳腺癌数据分类方法
AU2020103782A4 (en) * 2020-11-30 2021-02-11 Ningxia Medical University Pet/ct high-dimensional feature level selection method based on genetic algorithm and variable precision rough set

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110023759A (zh) * 2016-09-19 2019-07-16 血液学有限公司 用于使用多维分析检测异常细胞的系统、方法和制品
CN110211638A (zh) * 2019-05-28 2019-09-06 河南师范大学 一种考虑基因相关度的基因选择方法与装置
CN113838532A (zh) * 2021-07-26 2021-12-24 南通大学 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHENG YI, LIU YONG: "Knowledge Discovery Model Based on Neighborhood Multi-granularity Rough Sets", COMPUTER SCIENCE, KEXUE JISHU WENXIAN CHUBANSHE CHONGQING FENSHE, CN, vol. 46, no. 6, 15 June 2019 (2019-06-15), CN , pages 224 - 230, XP093028935, ISSN: 1002-137X, DOI: 10.118967j.issn.1002-137X.2019.06.034 *
DING WEIPING; GUAN ZHIJIN; WANG JIEHUA; TIAN DI: "A Layered Co-evolution Based Rough Feature Selection Using Adaptive Neighborhood Radius Hierarchy and Its Application in 3D-MRI", CHINESE JOURNAL OF ELECTRONICS, TECHNOLOGY EXCHANGE LTD., HONG KONG,, HK, vol. 26, no. 6, 1 November 2017 (2017-11-01), HK , pages 1168 - 1176, XP006072400, ISSN: 1022-4653, DOI: 10.1049/cje.2017.01.004 *
SUN LIN; WANG LANYING; DING WEIPING; QIAN YUHUA; XU JIUCHENG: "Neighborhood multi-granulation rough sets-based attribute reduction using Lebesgue and entropy measures in incomplete neighborhood decision systems", KNOWLEDGE-BASED SYSTEMS, ELSEVIER, AMSTERDAM, NL, vol. 192, 13 December 2019 (2019-12-13), AMSTERDAM, NL , XP086063825, ISSN: 0950-7051, DOI: 10.1016/j.knosys.2019.105373 *

Also Published As

Publication number Publication date
US11837329B2 (en) 2023-12-05
CN113838532A (zh) 2021-12-24
CN113838532B (zh) 2022-11-18
US20230197203A1 (en) 2023-06-22

Similar Documents

Publication Publication Date Title
WO2023005196A1 (zh) 基于双重自适应邻域半径的多粒度乳腺癌基因分类方法
Nguena Nguefack et al. Trajectory modelling techniques useful to epidemiological research: a comparative narrative review of approaches
Azadifar et al. Graph-based relevancy-redundancy gene selection method for cancer diagnosis
Qiu et al. Reproducibility and non-redundancy of radiomic features extracted from arterial phase CT scans in hepatocellular carcinoma patients: impact of tumor segmentation variability
Naseem et al. An automatic detection of breast cancer diagnosis and prognosis based on machine learning using ensemble of classifiers
Onken et al. Prognostic testing in uveal melanoma by transcriptomic profiling of fine needle biopsy specimens
CN112927757B (zh) 基于基因表达和dna甲基化数据的胃癌生物标志物识别方法
Jiang et al. A generative adversarial network model for disease gene prediction with RNA-seq data
Raina et al. A systematic review on acute leukemia detection using deep learning techniques
Ramyachitra et al. Interval-value Based Particle Swarm Optimization algorithm for cancer-type specific gene selection and sample classification
Kour et al. Study on detection of breast cancer using Machine Learning
Shi et al. Sparse discriminant analysis for breast cancer biomarker identification and classification
Surya Sashank et al. Detection of acute lymphoblastic leukemia by utilizing deep learning methods
Karim et al. Convolutional embedded networks for population scale clustering and bio-ancestry inferencing
Lo et al. Computer-aided diagnosis of isocitrate dehydrogenase genotypes in glioblastomas from radiomic patterns
Wang et al. Enhanced rotated mask r-cnn for chromosome segmentation
Yousef et al. Computational approaches for biomarker discovery
Sharma et al. Predicting survivability in oral cancer patients
Subramanian et al. A deep ensemble network model for classifying and predicting breast cancer
Sarkar et al. Breast Cancer Subtypes Classification with Hybrid Machine Learning Model
Vijaya Lakshmi et al. Cancer prediction with gene expression profiling and differential evolution
Metsis et al. DNA copy number selection using robust structured sparsity-inducing norms
Han et al. A two step method to identify clinical outcome relevant genes with microarray data
Zhang et al. A disease-related gene mining method based on weakly supervised learning model
Gunavathi et al. A survey on feature selection methods in microarray gene expression data for cancer classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22847816

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE