CN113345588A - Rapid attribute reduction method for incomplete data set - Google Patents
Rapid attribute reduction method for incomplete data set Download PDFInfo
- Publication number
- CN113345588A CN113345588A CN202110722842.8A CN202110722842A CN113345588A CN 113345588 A CN113345588 A CN 113345588A CN 202110722842 A CN202110722842 A CN 202110722842A CN 113345588 A CN113345588 A CN 113345588A
- Authority
- CN
- China
- Prior art keywords
- attribute
- breast cancer
- red
- incomplete
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
Abstract
The invention relates to a rapid attribute reduction method for an incomplete data set, which can utilize the existing data information related to breast cancer patients to perform data analysis on other people who have not undergone medical examination, thereby judging the people with potential high risk of artificial breast cancer. The invention adopts IFSPA algorithm and IFSPA-IVPR algorithm, so that the invention can more efficiently finish attribute reduction aiming at incomplete data sets under the condition of keeping the original characteristic attribute resolving power. The method is superior to the existing algorithm in time complexity, stability and the like. Meanwhile, the improvement effect when the method is used for processing a large-scale data set is also very obvious.
Description
The application is a divisional application of a patent application named 'a rapid attribute reduction method for incomplete data sets', the application date of the original application is 12 and 21 in 2018, and the application number is 201811574927.0.
Technical Field
The invention relates to the technical field of medicine, in particular to a rapid attribute reduction method for an incomplete data set.
Background
The breast cancer is a phenomenon that mammary epithelial cells generate uncontrolled proliferation under the action of various carcinogenic factors. It is associated with cervical cancer and is known as the female two major "hidden killers". Recent data from 2018 International Agency for Research on Cancer (IARC) surveys showed that breast Cancer has an incidence of 24.2% in female cancers worldwide, with 52.9% of cases occurring in developing countries, and is the leading place in female cancers. Most breast cancer patients have unobvious early symptoms, and are easily overlooked without timely hospitalization. Moreover, breast cancer needs to be diagnosed by professional medical examination such as imaging examination, tissue biopsy, tumor marker examination and the like. There is also no good way to discover and predict potential breast cancer patients in time.
Therefore, how to design an incomplete data set fast attribute reduction method to utilize the existing data information related to breast cancer patients to perform big data analysis on other people who have not undergone medical examination so as to determine which people are potentially high-risk people with artificial breast cancer becomes a technical problem to be solved in the field.
Feature selection, or a data processing method called attribute reduction, is a common important research topic in the fields of pattern recognition, data mining, machine learning, and the like. In recent years, the number and dimensions of elements in a data set have increased significantly. For example, a number of up to hundreds or even thousands of condition attributes are stored in databases in many real-world applications. It is well known that many conditional attributes that are not relevant to the recognition or classification task can significantly degrade the performance of the correlation algorithm. In other words, storing and processing all conditional attributes, including both relevant important and irrelevant unimportant attributes, incurs significant space storage costs and computation time costs. To solve this problem, some scholars propose a scheme of deleting conditional attributes that do not affect the recognition or classification accuracy. Therefore, deleting partial condition attributes is not only permissible, but even necessary for the associated reduction of computational time complexity.
Among the many attribute reduction methods, there are two main strategies of relevance, namely wrapping and filtering. The former uses a learning algorithm to evaluate and select the condition attribute subset, and the latter selects the condition attribute according to some importance measures, such as information gain, consistency, distance, dependency, and the like. These measurements can be divided into two main categories, namely distance-based metrics and consistency-based metrics. The attribute reduction method in rough set theory provides a theoretical framework for a system for consistency-based attribute reduction methods, not to maximize class separability, but rather to attempt to ensure that a selected subset of conditional attributes has the same resolvable capability as the original full set of conditional attributes.
In general, we will typically encounter two types of data, namely numeric data and symbolic data. Among them, there are two methods for numerical data. One is to use fuzzy rough set theory, and the other is to discretize the numerical condition attribute value. Many methods have been proposed by related scholars in order to deal with conditional attribute values of a hybrid. In classical rough set theory, the attribute reduction algorithm considers all attribute values as symbolic data. After pre-processing the raw data, we can use classical rough set theory to select the subset of conditional attributes that is best suited for the recognition or classification task.
The rough set theory based attribute reduction starts with a data table, which we also refer to as an information system. It contains all the data about the object we are interested in, which is described by a limited set of conditional attributes. Information systems can be classified into complete information systems and incomplete information systems according to whether there is missing data or null data. In general, we refer to an incomplete information system, meaning that there is missing data or null data in the middle of the system. For an incomplete information system, if the condition attribute and the decision attribute are distinguished from each other, we refer to them as an incomplete decision system or an incomplete decision table. Property reduction on incomplete data typically begins with an incomplete decision table.
In the last two decades, many new attribute reduction methods have emerged with respect to rough set theory. Among them, Skowron proposes a resolvable matrix method aiming at obtaining all property reduction of a data set. However, this method causes a huge time consumption in processing large-scale data. In order to make the attribute reduction more efficient, many scholars propose various heuristic attribute reduction algorithms according to rough set theory. Each of these algorithms retains some specific properties of a given information system. In order to complete the attribute reduction task of the incomplete decision table, Kryszkiewicz extends the resolvable matrix method proposed by Skowron to a generalized resolvable matrix method similar thereto, so as to obtain the complete attribute reduction of the incomplete decision table. Yang and Shu provide a heuristic attribute reduction algorithm applying the attribute reduction idea of the positive region aiming at the incomplete decision table, and the algorithm can ensure that the positive region of the target decision table is kept unchanged after the attribute reduction processing. Yan et al define a new information entropy to measure uncertainty of incomplete information systems and reduce redundant condition attributes by applying a corresponding condition information entropy. Just as attribute reductions in the classical rough set model are found by introducing Shannon's entropy of information, the method can compute the relevant attribute reductions of incomplete decision tables by introducing conditional entropy of information to expand.
However, all of the above methods have problems that the processing speed is low to some extent, and the huge time consumption generated when processing large-scale incomplete data cannot be coped with.
Disclosure of Invention
The invention aims to provide a rapid attribute reduction method for an incomplete data set, which can be used for timely discovering and predicting potential breast cancer cases under big data in advance and can be used for more efficiently completing attribute reduction under the condition that the original characteristic attribute resolution capability of the incomplete data set is maintained.
In order to achieve the purpose, the invention provides the following scheme:
a rapid attribute reduction method for incomplete data sets comprises the following steps based on a positive approximation set:
inputting a data set S ═ (U, C ═ D) of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer condition attribute complete set, and D is a breast cancer decision attribute;
step two, red is initialized to be an empty setNamely, it isWherein red is the selected breast cancer condition attribute set;
step three, calculating Siginner(akC, D, U), wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Siginner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four, akAdded to red, where Siginner(ak,C,D,U)>0;
Step five, making i ← 1, R1=red,P1={R1},U1←U;
Step six, judgingAndwhether or not the above-mentioned conditions are satisfied,is a target object subset U1The evaluation function of (3);
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
If one is not true, turning to the seventh step;
step seven, Ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
And step eight, returning a reduction result red of the breast cancer condition attribute and ending.
Optionally, the specific method in the step six is as follows:
step three, i ← i + 1;
the fourth step, red ← red & { a-0Where, Sigouter(a0,red,D,Ui)=max{Sigouter(ak,red,D,Ui)},ak∈C-red;
if both are true, go to the first step; otherwise, jumping out of the step six of ending the circulation and performing the step seven.
Optionally, in the step one, the complexity of the fast allowable class acquisition algorithm for the incomplete decision table isWherein the content of the first and second substances,representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, andthen the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
The invention also provides a rapid attribute reduction method for the incomplete data set, and the method based on the variable-precision positive approximation set comprises the following steps:
the method comprises the following steps: inputting a data set S ═ (U, C ═ D) of a breast cancer sample and a threshold value beta is less than or equal to 0.5; the method comprises the following steps of A, obtaining a data set of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer conditional attribute complete set, and D is a breast cancer decision attribute;
step two: initializing red to an empty setNamely, it isWherein red is the selected breast cancer condition attribute set;
step three: computingWherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Sig3 inner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
Step five: i ← 1, R1=red,P1={R1},U1←U;
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
If one is not true, turning to the seventh step;
step seven: ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
Step eight: and returning a reduction result red of the breast cancer condition attribute and ending.
Optionally, the specific method in the step six is as follows:
Step three, enabling i ← i + 1;
if both the first step and the second step are established, the first step is carried out to continue circulation; otherwise, jumping out of the step six of ending the circulation and entering into the step seven.
Optionally, in the step one, the method is not completeThe complexity of the fast allowable class acquisition algorithm for the decision table isWherein the content of the first and second substances,representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, andthen the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention utilizes the existing data information related to the breast cancer patients to perform data analysis on other people who have not undergone medical examination so as to judge the potential high-risk people who are artificial breast cancer. Meanwhile, the QAAC algorithm is used for estimating the time complexity of the data, and the calculation is respectively carried out under an incomplete rough set model and a variable-precision incomplete rough set model, so that the method can more efficiently finish attribute reduction aiming at the incomplete data set and the variable-precision incomplete rough set under the condition of keeping the original characteristic attribute resolution capability. The method is superior to the existing algorithm in time complexity, stability and the like, and meanwhile, the improvement effect is very obvious when the method is used for processing a large-scale data set, the calculation efficiency is improved, and the method has lower average calculation time and standard deviation, namely better robustness.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of the IFSPA algorithm of the present invention;
FIG. 2 is a flow chart of the IFSPA-IVPR algorithm of the present invention;
FIG. 3a is a statistical plot of sample number versus data set;
FIG. 3b is a statistical chart of the number of conditional attributes and data sets;
FIG. 3c is a statistical chart of the number of missing attribute values and the data set;
FIG. 3d is a statistical chart of the number of decision classes and data sets;
FIG. 3e is a statistical plot of incomplete rate versus data set;
FIG. 4 is a graph of IPR and IFSPA-IPR algorithm computation time versus data size using Breast-cancer-wisconsin data set;
FIG. 5 is a graph of the calculation time versus data size for the ILCE and IFSPA-ILCE algorithms when a Breast-cancer-wisconsin data set is used;
FIG. 6a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 6b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 6c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 7a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 7b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 7c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 8a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 8b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 8c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 9a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 9b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 9c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 10 is a comparison of the computation times of the IPR, IFSPA-IPR, ILCE and IFSPA-ILCE algorithms;
FIG. 11 is a comparison of the computation time of the IVPR and IFSPA-IVPR algorithms at different thresholds.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a rapid attribute reduction method for an incomplete data set, which can utilize the existing data information related to breast cancer patients to perform data analysis on other people who have not undergone medical examination, thereby judging the people with potential high risk of artificial breast cancer. The invention adopts IFSPA algorithm and IFSPA-IVPR algorithm, so that the invention can more efficiently finish attribute reduction aiming at incomplete data sets under the condition of keeping the original characteristic attribute resolving power. The method is superior to the existing algorithm in time complexity, stability and the like. Meanwhile, the improvement effect when the method is used for processing a large-scale data set is also very obvious.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1:
referring to fig. 1, the present invention provides a fast attribute reduction method for incomplete data sets, and the method based on a positive approximation set includes the following steps:
inputting a data set S ═ (U, C ═ D) of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer condition attribute complete set, and D is a breast cancer decision attribute;
step two, red is initialized to be an empty setNamely, it isWherein red is the selected breast cancer condition attribute set;
step three, calculating Siginner(akC, D, U), wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Siginner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four, akAdded to red, where Siginner(ak,C,D,U)>0;
Step five, making i ← 1, R1=red,P1={R1},U1←U;
Step six, judgingAndwhether or not the above-mentioned conditions are satisfied,is a target object subset U1The evaluation function of (3);
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
If one is not true, turning to the seventh step;
step seven, Ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
And step eight, returning a reduction result red of the breast cancer condition attribute and ending.
Specifically, the method in the sixth step comprises the following steps:
step three, i ← i + 1;
the fourth step, red ← red & { a-0Where, Sigouter(a0,red,D,Ui)=max{Sigouter(ak,red,D,Ui)},ak∈C-red;
if both are true, go to the first step; otherwise, jumping out of the step six of ending the circulation and performing the step seven.
In step one, the complexity of the fast allowable class acquisition algorithm without the complete decision table isWherein the content of the first and second substances,representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, andthen the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
Referring to fig. 2, the present invention further provides a method for fast attribute reduction of incomplete data set, where the method based on the variable-precision positive approximation set includes the following steps:
the method comprises the following steps: inputting a data set S ═ (U, C ═ D) of a breast cancer sample and a threshold value beta is less than or equal to 0.5; the method comprises the following steps of A, obtaining a data set of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer conditional attribute complete set, and D is a breast cancer decision attribute;
step two: initializing red to an empty setNamely, it isWherein red is the selected breast cancer condition attribute set;
step three: computingWherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Sig3 inner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
Step five: i ← 1, R1=red,P1={R1},U1←U;
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
If one is not true, turning to the seventh step;
step seven: ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
Step eight: and returning a reduction result red of the breast cancer condition attribute and ending.
Specifically, the method in the sixth step comprises the following steps:
Step three, enabling i ← i + 1;
if both the first step and the second step are established, the first step is carried out to continue circulation; otherwise, jumping out of the step six of ending the circulation and entering into the step seven.
In step one, it is not completeThe complexity of the fast allowable class acquisition algorithm for the decision table isWherein the content of the first and second substances,representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, andthen the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
By the reduction method, the data analysis can be carried out on other people who have not undergone medical examination by utilizing the existing data information related to the breast cancer patient, and the potential high-risk people who are artificial breast cancer can be judged. Meanwhile, massive fuzzy, incomplete and inaccurate information and data can be effectively processed more quickly and efficiently, useful knowledge contained in the information and data is extracted, and a good basis is provided for intelligent decision making.
To achieve efficient conditional attribute selection, researchers in the related art have proposed many heuristic attribute selection algorithms for incomplete data sets. For the sake of simplicity, we focus primarily on two representative conditional attribute selection algorithms for incomplete datasets.
Giving an incomplete decision table IDT ═ (U, C ═ D), we get the classification U/sim (C) { S ═ of all objects with respect to the complete set C of conditional attributesC(u1),SC(u2),...,SC(uU) And the division of the full space U with respect to the decision attribute D UD ═ X1,X2,...,Xr}. In practice, the division U/D of the decision attribute is expressed in the form of a permitted class corresponding to each object in the full space U, i.e., U/sim (D) { S }D(u1),SD(u2),...,SD(uU)}. To ensure its generality, letWherein | Xj|=sj,The relationship between U/D and U/sim (D) to each other is expressed as follows.
From this relationship, the positive region of an incomplete decision table is equivalently redefined by the following form.
In light of the above, focus is next placed on the importance of two types of conditional attributes.
A heuristic conditional attribute Reduction algorithm is proposed in the prior art, which is called a Positive region attribute Reduction method, namely, Positive-region Reduction, PR. The method can ensure that the positive region corresponding to the target decision attribute remains unchanged in the attribute reduction process. By applying the idea of the positive region attribute reduction method, the prior art proposes another heuristic conditional attribute reduction algorithm for incomplete decision tables, which is called IPR. The algorithm can also ensure that the positive region corresponding to the target decision attribute is unchanged. In this algorithm, the importance of the condition attribute is defined as follows.
Define 1, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subsetTo pairThe significance of the condition attribute a contained in the subset B is defined as follows.
In the formula, gammaB(D)=|POSB(D)|/|U|。
Define 2, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subsetTo pairThe importance of the condition attribute a included in the portion other than the subset B is defined as follows.
An information entropy is defined by which uncertainty in an incomplete information system is measured and at the same time conditional attributes of redundancy are pruned using the information entropy. This attribute reduction algorithm is denoted ILCE, and the conditional information entropy in the algorithm is defined as follows.
In the formula, SC(ui)∈U/SIM(C),SD(ui) e.U/SIM (D). The definitions of various other important meanings are listed below in sequence.
Define 3, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subsetTo pairThe significance of the condition attribute a contained in the subset B is defined as follows.
Define 4, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subsetTo pairThe importance of the condition attribute a included in the portion other than the subset B is defined as follows.
In the incomplete variable-precision rough set model, an algorithm capable of keeping a beta positive region corresponding to a target decision attribute unchanged is designed by using corresponding importance, so that a required condition attribute reduction result is searched.
Define 5, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subsetTo pairThe significance of the condition attribute a contained in the subset B is defined as follows.
define 6, let IDT ═ (U, C ≧ D) be an incomplete decision table, and haveConditional attribute subsetTo pairThe importance of the condition attribute a included in the portion other than the subset B is defined as follows.
All of the above definitions can be used in a heuristic property selection algorithm for incomplete data sets, so that one conditional property that we need can be selected.
In an attribute reduction algorithm based on rough set theory, allowable classes generated by conditional attributes in an incomplete decision table need to be calculated. This process greatly affects the overall computation time of the attribute reduction algorithm. In order to design an effective conditional attribute reduction algorithm, a quick allowable class acquisition algorithm applied to an incomplete decision table is firstly provided. The algorithm is mainly based on the idea of a radix ranking algorithm, and the time complexity of the algorithm isWhereinRepresents a condition attribute akNumber of all objects under which there is a missing attribute value, andthen the condition attribute a is representedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x. As is well known, a major application of rough set theory is knowledge discovery of symbolic data, where the number of attribute values corresponding to each conditional attribute is so small that it can be regarded as a constant. Thus, the time complexity of the algorithm is notIs subjected toThe influence of (c). In addition, the number of objects under each condition attribute having a missing attribute value is also typically small. In the worst case, all attribute values under a conditional attribute are missing, i.e., the number of objects with missing attribute values on the attribute reaches a maximum value | U |, which means that the conditional attribute cannot provide any useful classification information. Therefore, the temporal complexity of the attribute reduction algorithm can be further reduced to the following extent:
therefore, the algorithm can show the advantage in processing the large-scale incomplete data and calculating its allowable class that the dimensionality of the large-scale incomplete data has less influence on the length of the calculation time than the number of objects. The specific flow of the algorithm and its working principle are not discussed here.
From the above discussion, an improved forward search algorithm based on a positive approximation set and a fast algorithm to obtain the allowable classes are obtained. Under this algorithmic framework, the evaluation function or the end condition can be expressed as EFU(B,D)=EFU(C, D). For example, when the conditional entropy method is adopted, the corresponding evaluation function is ENU(B,D)=ENU(C, D). That is, when EFU(B,D)=EFUAnd when the (C and D) are met, B is a conditional attribute reduction result.
The rapid algorithm provided by the invention is a better method for the attribute selection task of the large-scale incomplete data set, and can obviously improve the time complexity, and in contrast, the influence of the dimensionality of the incomplete data set on the calculation time is less than the number of objects of the incomplete data set. Therefore, we can conclude that the generalization of incomplete attribute selection algorithms based on positive approximation sets, i.e. IFSPA, has the potential to significantly reduce the overall computation time of the attribute selection task for incomplete decision tables. To verify these conclusions, the specific temporal complexity of each step of the original algorithm and the IFSPA algorithm are listed in table 1.
TABLE 1 comparison of time complexity of IFSPA algorithm and classical algorithm
The breast cancer related data used in the experiment of the present invention is all from the UCI machine learning experiment database of California university, so that these incomplete data sets are derived from the real environment and are truly reliable, which also ensures that the experiment results are repeatable. The experimental data used in the experiment are summarized and described in table 2, with the various statistical information shown in fig. 3 a-3 e.
TABLE 2 basic information of simulation experiment data set
To facilitate the calculation and comparison, the QAATC algorithm is used to estimate its time complexity. In the whole experimental process, the time complexity of the original method and the time complexity of the new method are compared under the conditions of the incomplete rough set model and the variable-precision incomplete rough set model, which are specifically shown in fig. 4 to 11. In fig. 4 to 9c, the x-axis represents the size of the incomplete data subsets used, i.e. 20 equal data segments in each incomplete data set are sequentially increased from 1 to 20, and the y-axis represents the calculation time required by the algorithm. In fig. 6a to 11, in order to ensure the accuracy of the simulation experiment and simultaneously exhibit the improvement of the calculation performance by the improved algorithm, the threshold β is set to be 0,0.1, and 0.2, respectively. In addition, the correlation stability of the selected conditional attribute subsets was also evaluated by a ten-fold cross-validation method, as shown in tables 3 to 5. Simulation experiment results show that the novel method provided by the invention can obviously reduce the time complexity of the existing incomplete attribute reduction algorithm and improve the calculation efficiency of the incomplete attribute reduction algorithm. Meanwhile, compared with other existing algorithms, the method has the advantages of being lower in average calculation time and standard deviation on the basis of keeping the same stability of the algorithms, namely being better in robustness.
TABLE 3 comparison of the calculation time and stability of the IPR Algorithm and the IFSPA-IPR Algorithm
TABLE 4 comparison of computation time and stability of ILCE Algorithm and IFSPA-ILCE Algorithm
TABLE 5 comparison of the computation time and stability of the IVPR algorithm and the IFSPA-IVPR algorithm
In modern society, with the rapid development of information technologies such as networks and sensors, information and data acquired by people from various fields are rapidly expanding. Because of the limitation of the information data and the participation of people, the uncertainty contained in the information data is increased significantly, and the relationship between the information and the data becomes complicated. By using the reduction method provided by the invention, massive fuzzy, incomplete and inaccurate information and data can be effectively processed more quickly and efficiently, and the implied useful knowledge can be extracted, so that a good basis is provided for intelligent decision making. In addition, the data analysis method and the system can utilize the existing data information related to the breast cancer patients to perform data analysis on other people who are not subjected to medical examination, so that the potential high-risk people who are artificial breast cancer can be judged.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (6)
1. A rapid attribute reduction method for incomplete data sets is characterized in that the method based on a positive approximation set comprises the following steps:
inputting a data set S ═ (U, C ═ D) of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer condition attribute complete set, and D is a breast cancer decision attribute;
step two, red is initialized to be an empty setNamely, it isWherein red is the selected breast cancer condition attribute set;
step three, calculating Siginner(akC, D, U), wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Siginner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four, akAdded to red, where Siginner(ak,C,D,U)>0;
Step five, making i ← 1, R1=red,P1={R1},U1←U;
Step six, judgingAndwhether or not the above-mentioned conditions are satisfied,is a target object subset U1The evaluation function of (3);
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
If one is not true, turning to the seventh step;
step seven, Ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
And step eight, returning a reduction result red of the breast cancer condition attribute and ending.
2. The method for fast attribute reduction of incomplete data set according to claim 1, wherein the concrete method of the sixth step is as follows:
step three, i ← i + 1;
the fourth step, red ← red & { a-0Where, Sigouter(a0,red,D,Ui)=max{Sigouter(ak,red,D,Ui)},ak∈C-red;
if both are true, go to the first step; otherwise, jumping out of the step six of ending the circulation and performing the step seven.
3. The method for fast attribute reduction of incomplete data set according to claim 1, wherein in the first step, the complexity of the fast allowable class acquisition algorithm of the incomplete decision table isWherein the content of the first and second substances,representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, andthen the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
4. A rapid attribute reduction method for incomplete data sets is characterized in that the method based on a variable-precision positive approximation set comprises the following steps:
the method comprises the following steps: inputting a data set S ═ (U, C ═ D) of a breast cancer sample and a threshold value beta is less than or equal to 0.5; the method comprises the following steps of A, obtaining a data set of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer conditional attribute complete set, and D is a breast cancer decision attribute;
step two: initializing red to an empty setNamely, it isWherein red is the selected breast cancer condition attribute set;
step three: computingWherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Sig3 inner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
Step five: i ← 1, R1=red,P1={R1},U1←U;
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
If one is not true, turning to the seventh step;
step seven: ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
Step eight: and returning a reduction result red of the breast cancer condition attribute and ending.
5. The method for rapid attribute reduction of incomplete data set according to claim 4, wherein the concrete method of the sixth step is as follows:
Step three, enabling i ← i + 1;
if both the first step and the second step are established, the first step is carried out to continue circulation; otherwise, jumping out of the step six of ending the circulation and entering into the step seven.
6. The method for fast attribute reduction of incomplete data set according to claim 4, wherein in the first step, the complexity of the fast allowable class acquisition algorithm of the incomplete decision table isWherein the content of the first and second substances,representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, andthen the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110722842.8A CN113345588A (en) | 2018-12-21 | 2018-12-21 | Rapid attribute reduction method for incomplete data set |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811574927.0A CN109828996A (en) | 2018-12-21 | 2018-12-21 | A kind of Incomplete data set rapid attribute reduction |
CN202110722842.8A CN113345588A (en) | 2018-12-21 | 2018-12-21 | Rapid attribute reduction method for incomplete data set |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811574927.0A Division CN109828996A (en) | 2018-12-21 | 2018-12-21 | A kind of Incomplete data set rapid attribute reduction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113345588A true CN113345588A (en) | 2021-09-03 |
Family
ID=66859919
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811574927.0A Pending CN109828996A (en) | 2018-12-21 | 2018-12-21 | A kind of Incomplete data set rapid attribute reduction |
CN202110722842.8A Pending CN113345588A (en) | 2018-12-21 | 2018-12-21 | Rapid attribute reduction method for incomplete data set |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811574927.0A Pending CN109828996A (en) | 2018-12-21 | 2018-12-21 | A kind of Incomplete data set rapid attribute reduction |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN109828996A (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113221674B (en) * | 2021-04-25 | 2023-01-24 | 广东电网有限责任公司东莞供电局 | Video stream key frame extraction system and method based on rough set reduction and SIFT |
CN115392582B (en) * | 2022-09-01 | 2023-11-14 | 广东工业大学 | Crop yield prediction method based on increment fuzzy rough set attribute reduction |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763529A (en) * | 2010-01-14 | 2010-06-30 | 中山大学 | Rough set attribute reduction method based on genetic algorithm |
CN103336791A (en) * | 2013-06-06 | 2013-10-02 | 湖州师范学院 | Hadoop-based fast rough set attribute reduction method |
CN103336790A (en) * | 2013-06-06 | 2013-10-02 | 湖州师范学院 | Hadoop-based fast neighborhood rough set attribute reduction method |
-
2018
- 2018-12-21 CN CN201811574927.0A patent/CN109828996A/en active Pending
- 2018-12-21 CN CN202110722842.8A patent/CN113345588A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763529A (en) * | 2010-01-14 | 2010-06-30 | 中山大学 | Rough set attribute reduction method based on genetic algorithm |
CN103336791A (en) * | 2013-06-06 | 2013-10-02 | 湖州师范学院 | Hadoop-based fast rough set attribute reduction method |
CN103336790A (en) * | 2013-06-06 | 2013-10-02 | 湖州师范学院 | Hadoop-based fast neighborhood rough set attribute reduction method |
Non-Patent Citations (1)
Title |
---|
UHUA QIAN 等: "An efficient accelerator for attribute reduction from incomplete data in rough set framework", PATTERN RECOGNITION, pages 2 - 5 * |
Also Published As
Publication number | Publication date |
---|---|
CN109828996A (en) | 2019-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110689038B (en) | Training method and device for neural network model and medical image processing system | |
JP6362808B1 (en) | Information processing apparatus and information processing method | |
Rieger et al. | Irof: a low resource evaluation metric for explanation methods | |
CN116402825B (en) | Bearing fault infrared diagnosis method, system, electronic equipment and storage medium | |
CN110866134A (en) | Image retrieval-oriented distribution consistency keeping metric learning method | |
CN113345588A (en) | Rapid attribute reduction method for incomplete data set | |
Zheng et al. | Benchmarking unsupervised anomaly detection and localization | |
CN111524600A (en) | Liver cancer postoperative recurrence risk prediction system based on neighbor2vec | |
Frey et al. | Most powerful rank tests for perfect rankings | |
Li et al. | Explainable human‐in‐the‐loop healthcare image information quality assessment and selection | |
CN116825363B (en) | Early lung adenocarcinoma pathological type prediction system based on fusion deep learning network | |
Soliman et al. | Features selection for building an early diagnosis machine learning model for Parkinson's disease | |
CN113011086B (en) | Estimation method of forest biomass based on GA-SVR algorithm | |
Preetha et al. | Firefly based region growing and region merging for image segmentation | |
Orang et al. | Improving performance of similarity measures for uncertain time series using preprocessing techniques | |
Kiranmayee et al. | Explorative data analytics of brain tumour data using R | |
Yang et al. | Adaptive density peak clustering for determinging cluster center | |
CN114242178A (en) | Method for quantitatively predicting biological activity of ER alpha antagonist based on gradient lifting decision tree | |
CN108304546B (en) | Medical image retrieval method based on content similarity and Softmax classifier | |
CN111178180A (en) | Hyperspectral image feature selection method and device based on improved ant colony algorithm | |
CN111400885B (en) | Remote sensing image multi-target rapid predictive modeling method based on piecewise linear fitting | |
Atamanyuk et al. | Management of an agricultural enterprise on the basis of its economic state forecasting | |
Zhan et al. | Medical image clustering algorithm based on graph entropy | |
CN111144910B (en) | Bidding 'series bid, companion bid' object recommendation method and device based on fuzzy entropy mean shadow album | |
CN110503632B (en) | SVR parameter optimization method in blind image quality evaluation algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |