CN113345588A - Rapid attribute reduction method for incomplete data set - Google Patents

Rapid attribute reduction method for incomplete data set Download PDF

Info

Publication number
CN113345588A
CN113345588A CN202110722842.8A CN202110722842A CN113345588A CN 113345588 A CN113345588 A CN 113345588A CN 202110722842 A CN202110722842 A CN 202110722842A CN 113345588 A CN113345588 A CN 113345588A
Authority
CN
China
Prior art keywords
attribute
breast cancer
red
incomplete
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110722842.8A
Other languages
Chinese (zh)
Inventor
闫涛
韩崇昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110722842.8A priority Critical patent/CN113345588A/en
Publication of CN113345588A publication Critical patent/CN113345588A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing

Abstract

The invention relates to a rapid attribute reduction method for an incomplete data set, which can utilize the existing data information related to breast cancer patients to perform data analysis on other people who have not undergone medical examination, thereby judging the people with potential high risk of artificial breast cancer. The invention adopts IFSPA algorithm and IFSPA-IVPR algorithm, so that the invention can more efficiently finish attribute reduction aiming at incomplete data sets under the condition of keeping the original characteristic attribute resolving power. The method is superior to the existing algorithm in time complexity, stability and the like. Meanwhile, the improvement effect when the method is used for processing a large-scale data set is also very obvious.

Description

Rapid attribute reduction method for incomplete data set
The application is a divisional application of a patent application named 'a rapid attribute reduction method for incomplete data sets', the application date of the original application is 12 and 21 in 2018, and the application number is 201811574927.0.
Technical Field
The invention relates to the technical field of medicine, in particular to a rapid attribute reduction method for an incomplete data set.
Background
The breast cancer is a phenomenon that mammary epithelial cells generate uncontrolled proliferation under the action of various carcinogenic factors. It is associated with cervical cancer and is known as the female two major "hidden killers". Recent data from 2018 International Agency for Research on Cancer (IARC) surveys showed that breast Cancer has an incidence of 24.2% in female cancers worldwide, with 52.9% of cases occurring in developing countries, and is the leading place in female cancers. Most breast cancer patients have unobvious early symptoms, and are easily overlooked without timely hospitalization. Moreover, breast cancer needs to be diagnosed by professional medical examination such as imaging examination, tissue biopsy, tumor marker examination and the like. There is also no good way to discover and predict potential breast cancer patients in time.
Therefore, how to design an incomplete data set fast attribute reduction method to utilize the existing data information related to breast cancer patients to perform big data analysis on other people who have not undergone medical examination so as to determine which people are potentially high-risk people with artificial breast cancer becomes a technical problem to be solved in the field.
Feature selection, or a data processing method called attribute reduction, is a common important research topic in the fields of pattern recognition, data mining, machine learning, and the like. In recent years, the number and dimensions of elements in a data set have increased significantly. For example, a number of up to hundreds or even thousands of condition attributes are stored in databases in many real-world applications. It is well known that many conditional attributes that are not relevant to the recognition or classification task can significantly degrade the performance of the correlation algorithm. In other words, storing and processing all conditional attributes, including both relevant important and irrelevant unimportant attributes, incurs significant space storage costs and computation time costs. To solve this problem, some scholars propose a scheme of deleting conditional attributes that do not affect the recognition or classification accuracy. Therefore, deleting partial condition attributes is not only permissible, but even necessary for the associated reduction of computational time complexity.
Among the many attribute reduction methods, there are two main strategies of relevance, namely wrapping and filtering. The former uses a learning algorithm to evaluate and select the condition attribute subset, and the latter selects the condition attribute according to some importance measures, such as information gain, consistency, distance, dependency, and the like. These measurements can be divided into two main categories, namely distance-based metrics and consistency-based metrics. The attribute reduction method in rough set theory provides a theoretical framework for a system for consistency-based attribute reduction methods, not to maximize class separability, but rather to attempt to ensure that a selected subset of conditional attributes has the same resolvable capability as the original full set of conditional attributes.
In general, we will typically encounter two types of data, namely numeric data and symbolic data. Among them, there are two methods for numerical data. One is to use fuzzy rough set theory, and the other is to discretize the numerical condition attribute value. Many methods have been proposed by related scholars in order to deal with conditional attribute values of a hybrid. In classical rough set theory, the attribute reduction algorithm considers all attribute values as symbolic data. After pre-processing the raw data, we can use classical rough set theory to select the subset of conditional attributes that is best suited for the recognition or classification task.
The rough set theory based attribute reduction starts with a data table, which we also refer to as an information system. It contains all the data about the object we are interested in, which is described by a limited set of conditional attributes. Information systems can be classified into complete information systems and incomplete information systems according to whether there is missing data or null data. In general, we refer to an incomplete information system, meaning that there is missing data or null data in the middle of the system. For an incomplete information system, if the condition attribute and the decision attribute are distinguished from each other, we refer to them as an incomplete decision system or an incomplete decision table. Property reduction on incomplete data typically begins with an incomplete decision table.
In the last two decades, many new attribute reduction methods have emerged with respect to rough set theory. Among them, Skowron proposes a resolvable matrix method aiming at obtaining all property reduction of a data set. However, this method causes a huge time consumption in processing large-scale data. In order to make the attribute reduction more efficient, many scholars propose various heuristic attribute reduction algorithms according to rough set theory. Each of these algorithms retains some specific properties of a given information system. In order to complete the attribute reduction task of the incomplete decision table, Kryszkiewicz extends the resolvable matrix method proposed by Skowron to a generalized resolvable matrix method similar thereto, so as to obtain the complete attribute reduction of the incomplete decision table. Yang and Shu provide a heuristic attribute reduction algorithm applying the attribute reduction idea of the positive region aiming at the incomplete decision table, and the algorithm can ensure that the positive region of the target decision table is kept unchanged after the attribute reduction processing. Yan et al define a new information entropy to measure uncertainty of incomplete information systems and reduce redundant condition attributes by applying a corresponding condition information entropy. Just as attribute reductions in the classical rough set model are found by introducing Shannon's entropy of information, the method can compute the relevant attribute reductions of incomplete decision tables by introducing conditional entropy of information to expand.
However, all of the above methods have problems that the processing speed is low to some extent, and the huge time consumption generated when processing large-scale incomplete data cannot be coped with.
Disclosure of Invention
The invention aims to provide a rapid attribute reduction method for an incomplete data set, which can be used for timely discovering and predicting potential breast cancer cases under big data in advance and can be used for more efficiently completing attribute reduction under the condition that the original characteristic attribute resolution capability of the incomplete data set is maintained.
In order to achieve the purpose, the invention provides the following scheme:
a rapid attribute reduction method for incomplete data sets comprises the following steps based on a positive approximation set:
inputting a data set S ═ (U, C ═ D) of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer condition attribute complete set, and D is a breast cancer decision attribute;
step two, red is initialized to be an empty set
Figure BDA0003137070130000031
Namely, it is
Figure BDA0003137070130000032
Wherein red is the selected breast cancer condition attribute set;
step three, calculating Siginner(akC, D, U), wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Siginner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four, akAdded to red, where Siginner(ak,C,D,U)>0;
Step five, making i ← 1, R1=red,P1={R1},U1←U;
Step six, judging
Figure BDA0003137070130000033
And
Figure BDA0003137070130000034
whether or not the above-mentioned conditions are satisfied,
Figure BDA0003137070130000035
is a target object subset U1The evaluation function of (3);
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
Figure BDA0003137070130000036
If one is not true, turning to the seventh step;
step seven, Ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
And step eight, returning a reduction result red of the breast cancer condition attribute and ending.
Optionally, the specific method in the step six is as follows:
first, a positive region in the positive approximation set is calculated
Figure BDA0003137070130000041
In the second step, the first step is that,
Figure BDA0003137070130000042
step three, i ← i + 1;
the fourth step, red ← red & { a-0Where, Sigouter(a0,red,D,Ui)=max{Sigouter(ak,red,D,Ui)},ak∈C-red;
The fifth step, judge
Figure BDA0003137070130000043
And
Figure BDA0003137070130000044
whether or not:
if both are true, go to the first step; otherwise, jumping out of the step six of ending the circulation and performing the step seven.
Optionally, in the step one, the complexity of the fast allowable class acquisition algorithm for the incomplete decision table is
Figure BDA0003137070130000045
Wherein the content of the first and second substances,
Figure BDA0003137070130000046
representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, and
Figure BDA0003137070130000047
then the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
The invention also provides a rapid attribute reduction method for the incomplete data set, and the method based on the variable-precision positive approximation set comprises the following steps:
the method comprises the following steps: inputting a data set S ═ (U, C ═ D) of a breast cancer sample and a threshold value beta is less than or equal to 0.5; the method comprises the following steps of A, obtaining a data set of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer conditional attribute complete set, and D is a breast cancer decision attribute;
step two: initializing red to an empty set
Figure BDA0003137070130000048
Namely, it is
Figure BDA0003137070130000049
Wherein red is the selected breast cancer condition attribute set;
step three: computing
Figure BDA00031370701300000410
Wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Sig3 inner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four: a is tokInto red, wherein
Figure BDA00031370701300000411
Step five: i ← 1, R1=red,P1={R1},U1←U;
Step six: judgment of
Figure BDA00031370701300000412
And
Figure BDA00031370701300000413
whether or not:
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
Figure BDA00031370701300000414
If one is not true, turning to the seventh step;
step seven: ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
Step eight: and returning a reduction result red of the breast cancer condition attribute and ending.
Optionally, the specific method in the step six is as follows:
first, a positive region in the positive approximation set is calculated
Figure BDA0003137070130000051
Second, calculating
Figure BDA0003137070130000052
Step three, enabling i ← i + 1;
the fourth step, make red ← red & { a-0And (c) the step of (c) in which,
Figure BDA0003137070130000053
ak∈C-red;
the fifth step, judge
Figure BDA0003137070130000054
And
Figure BDA0003137070130000055
whether or not:
if both the first step and the second step are established, the first step is carried out to continue circulation; otherwise, jumping out of the step six of ending the circulation and entering into the step seven.
Optionally, in the step one, the method is not completeThe complexity of the fast allowable class acquisition algorithm for the decision table is
Figure BDA0003137070130000056
Wherein the content of the first and second substances,
Figure BDA0003137070130000057
representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, and
Figure BDA0003137070130000058
then the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention utilizes the existing data information related to the breast cancer patients to perform data analysis on other people who have not undergone medical examination so as to judge the potential high-risk people who are artificial breast cancer. Meanwhile, the QAAC algorithm is used for estimating the time complexity of the data, and the calculation is respectively carried out under an incomplete rough set model and a variable-precision incomplete rough set model, so that the method can more efficiently finish attribute reduction aiming at the incomplete data set and the variable-precision incomplete rough set under the condition of keeping the original characteristic attribute resolution capability. The method is superior to the existing algorithm in time complexity, stability and the like, and meanwhile, the improvement effect is very obvious when the method is used for processing a large-scale data set, the calculation efficiency is improved, and the method has lower average calculation time and standard deviation, namely better robustness.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of the IFSPA algorithm of the present invention;
FIG. 2 is a flow chart of the IFSPA-IVPR algorithm of the present invention;
FIG. 3a is a statistical plot of sample number versus data set;
FIG. 3b is a statistical chart of the number of conditional attributes and data sets;
FIG. 3c is a statistical chart of the number of missing attribute values and the data set;
FIG. 3d is a statistical chart of the number of decision classes and data sets;
FIG. 3e is a statistical plot of incomplete rate versus data set;
FIG. 4 is a graph of IPR and IFSPA-IPR algorithm computation time versus data size using Breast-cancer-wisconsin data set;
FIG. 5 is a graph of the calculation time versus data size for the ILCE and IFSPA-ILCE algorithms when a Breast-cancer-wisconsin data set is used;
FIG. 6a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 6b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 6c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 7a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 7b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 7c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 8a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 8b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 8c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 9a is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0;
FIG. 9b is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.1;
fig. 9c is a graph of the calculation time of the IVPR and IFSPA-IVPR algorithms versus the data size when β is 0.2;
FIG. 10 is a comparison of the computation times of the IPR, IFSPA-IPR, ILCE and IFSPA-ILCE algorithms;
FIG. 11 is a comparison of the computation time of the IVPR and IFSPA-IVPR algorithms at different thresholds.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a rapid attribute reduction method for an incomplete data set, which can utilize the existing data information related to breast cancer patients to perform data analysis on other people who have not undergone medical examination, thereby judging the people with potential high risk of artificial breast cancer. The invention adopts IFSPA algorithm and IFSPA-IVPR algorithm, so that the invention can more efficiently finish attribute reduction aiming at incomplete data sets under the condition of keeping the original characteristic attribute resolving power. The method is superior to the existing algorithm in time complexity, stability and the like. Meanwhile, the improvement effect when the method is used for processing a large-scale data set is also very obvious.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1:
referring to fig. 1, the present invention provides a fast attribute reduction method for incomplete data sets, and the method based on a positive approximation set includes the following steps:
inputting a data set S ═ (U, C ═ D) of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer condition attribute complete set, and D is a breast cancer decision attribute;
step two, red is initialized to be an empty set
Figure BDA0003137070130000071
Namely, it is
Figure BDA0003137070130000072
Wherein red is the selected breast cancer condition attribute set;
step three, calculating Siginner(akC, D, U), wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Siginner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four, akAdded to red, where Siginner(ak,C,D,U)>0;
Step five, making i ← 1, R1=red,P1={R1},U1←U;
Step six, judging
Figure BDA0003137070130000081
And
Figure BDA0003137070130000082
whether or not the above-mentioned conditions are satisfied,
Figure BDA0003137070130000083
is a target object subset U1The evaluation function of (3);
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
Figure BDA0003137070130000084
If one is not true, turning to the seventh step;
step seven, Ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
And step eight, returning a reduction result red of the breast cancer condition attribute and ending.
Specifically, the method in the sixth step comprises the following steps:
first, a positive region in the positive approximation set is calculated
Figure BDA0003137070130000085
In the second step, the first step is that,
Figure BDA0003137070130000086
step three, i ← i + 1;
the fourth step, red ← red & { a-0Where, Sigouter(a0,red,D,Ui)=max{Sigouter(ak,red,D,Ui)},ak∈C-red;
The fifth step, judge
Figure BDA0003137070130000087
And
Figure BDA0003137070130000088
whether or not:
if both are true, go to the first step; otherwise, jumping out of the step six of ending the circulation and performing the step seven.
In step one, the complexity of the fast allowable class acquisition algorithm without the complete decision table is
Figure BDA0003137070130000089
Wherein the content of the first and second substances,
Figure BDA00031370701300000810
representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, and
Figure BDA00031370701300000811
then the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
Referring to fig. 2, the present invention further provides a method for fast attribute reduction of incomplete data set, where the method based on the variable-precision positive approximation set includes the following steps:
the method comprises the following steps: inputting a data set S ═ (U, C ═ D) of a breast cancer sample and a threshold value beta is less than or equal to 0.5; the method comprises the following steps of A, obtaining a data set of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer conditional attribute complete set, and D is a breast cancer decision attribute;
step two: initializing red to an empty set
Figure BDA00031370701300000812
Namely, it is
Figure BDA00031370701300000813
Wherein red is the selected breast cancer condition attribute set;
step three: computing
Figure BDA00031370701300000814
Wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Sig3 inner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four: a is tokInto red, wherein
Figure BDA0003137070130000091
Step five: i ← 1, R1=red,P1={R1},U1←U;
Step six: judgment of
Figure BDA0003137070130000092
And
Figure BDA0003137070130000093
whether or not:
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
Figure BDA0003137070130000094
If one is not true, turning to the seventh step;
step seven: ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
Step eight: and returning a reduction result red of the breast cancer condition attribute and ending.
Specifically, the method in the sixth step comprises the following steps:
first, a positive region in the positive approximation set is calculated
Figure BDA0003137070130000095
Second, calculating
Figure BDA0003137070130000096
Step three, enabling i ← i + 1;
the fourth step, make red ← red & { a-0And (c) the step of (c) in which,
Figure BDA0003137070130000097
ak∈C-red;
the fifth step, judge
Figure BDA0003137070130000098
And
Figure BDA0003137070130000099
whether or not:
if both the first step and the second step are established, the first step is carried out to continue circulation; otherwise, jumping out of the step six of ending the circulation and entering into the step seven.
In step one, it is not completeThe complexity of the fast allowable class acquisition algorithm for the decision table is
Figure BDA00031370701300000910
Wherein the content of the first and second substances,
Figure BDA00031370701300000911
representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, and
Figure BDA00031370701300000912
then the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
By the reduction method, the data analysis can be carried out on other people who have not undergone medical examination by utilizing the existing data information related to the breast cancer patient, and the potential high-risk people who are artificial breast cancer can be judged. Meanwhile, massive fuzzy, incomplete and inaccurate information and data can be effectively processed more quickly and efficiently, useful knowledge contained in the information and data is extracted, and a good basis is provided for intelligent decision making.
To achieve efficient conditional attribute selection, researchers in the related art have proposed many heuristic attribute selection algorithms for incomplete data sets. For the sake of simplicity, we focus primarily on two representative conditional attribute selection algorithms for incomplete datasets.
Giving an incomplete decision table IDT ═ (U, C ═ D), we get the classification U/sim (C) { S ═ of all objects with respect to the complete set C of conditional attributesC(u1),SC(u2),...,SC(uU) And the division of the full space U with respect to the decision attribute D UD ═ X1,X2,...,Xr}. In practice, the division U/D of the decision attribute is expressed in the form of a permitted class corresponding to each object in the full space U, i.e., U/sim (D) { S }D(u1),SD(u2),...,SD(uU)}. To ensure its generality, let
Figure BDA0003137070130000101
Wherein | Xj|=sj
Figure BDA0003137070130000102
The relationship between U/D and U/sim (D) to each other is expressed as follows.
Figure BDA0003137070130000103
Figure BDA0003137070130000104
From this relationship, the positive region of an incomplete decision table is equivalently redefined by the following form.
Figure BDA0003137070130000105
In light of the above, focus is next placed on the importance of two types of conditional attributes.
A heuristic conditional attribute Reduction algorithm is proposed in the prior art, which is called a Positive region attribute Reduction method, namely, Positive-region Reduction, PR. The method can ensure that the positive region corresponding to the target decision attribute remains unchanged in the attribute reduction process. By applying the idea of the positive region attribute reduction method, the prior art proposes another heuristic conditional attribute reduction algorithm for incomplete decision tables, which is called IPR. The algorithm can also ensure that the positive region corresponding to the target decision attribute is unchanged. In this algorithm, the importance of the condition attribute is defined as follows.
Define 1, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subset
Figure BDA0003137070130000106
To pair
Figure BDA0003137070130000107
The significance of the condition attribute a contained in the subset B is defined as follows.
Figure BDA0003137070130000108
In the formula, gammaB(D)=|POSB(D)|/|U|。
Define 2, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subset
Figure BDA0003137070130000109
To pair
Figure BDA00031370701300001010
The importance of the condition attribute a included in the portion other than the subset B is defined as follows.
Figure BDA0003137070130000111
An information entropy is defined by which uncertainty in an incomplete information system is measured and at the same time conditional attributes of redundancy are pruned using the information entropy. This attribute reduction algorithm is denoted ILCE, and the conditional information entropy in the algorithm is defined as follows.
Figure BDA0003137070130000112
In the formula, SC(ui)∈U/SIM(C),SD(ui) e.U/SIM (D). The definitions of various other important meanings are listed below in sequence.
Define 3, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subset
Figure BDA0003137070130000113
To pair
Figure BDA0003137070130000114
The significance of the condition attribute a contained in the subset B is defined as follows.
Figure BDA0003137070130000115
Define 4, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subset
Figure BDA0003137070130000116
To pair
Figure BDA0003137070130000117
The importance of the condition attribute a included in the portion other than the subset B is defined as follows.
Figure BDA0003137070130000118
In the incomplete variable-precision rough set model, an algorithm capable of keeping a beta positive region corresponding to a target decision attribute unchanged is designed by using corresponding importance, so that a required condition attribute reduction result is searched.
Define 5, let IDT ═ (U, C ═ D) be an incomplete decision table, and conditional attribute subset
Figure BDA0003137070130000119
To pair
Figure BDA00031370701300001110
The significance of the condition attribute a contained in the subset B is defined as follows.
Figure BDA00031370701300001111
In the formula (I), the compound is shown in the specification,
Figure BDA00031370701300001112
define 6, let IDT ═ (U, C ≧ D) be an incomplete decision table, and haveConditional attribute subset
Figure BDA00031370701300001113
To pair
Figure BDA00031370701300001114
The importance of the condition attribute a included in the portion other than the subset B is defined as follows.
Figure BDA00031370701300001115
All of the above definitions can be used in a heuristic property selection algorithm for incomplete data sets, so that one conditional property that we need can be selected.
In an attribute reduction algorithm based on rough set theory, allowable classes generated by conditional attributes in an incomplete decision table need to be calculated. This process greatly affects the overall computation time of the attribute reduction algorithm. In order to design an effective conditional attribute reduction algorithm, a quick allowable class acquisition algorithm applied to an incomplete decision table is firstly provided. The algorithm is mainly based on the idea of a radix ranking algorithm, and the time complexity of the algorithm is
Figure BDA0003137070130000121
Wherein
Figure BDA0003137070130000122
Represents a condition attribute akNumber of all objects under which there is a missing attribute value, and
Figure BDA0003137070130000123
then the condition attribute a is representedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x. As is well known, a major application of rough set theory is knowledge discovery of symbolic data, where the number of attribute values corresponding to each conditional attribute is so small that it can be regarded as a constant. Thus, the time complexity of the algorithm is notIs subjected to
Figure BDA0003137070130000124
The influence of (c). In addition, the number of objects under each condition attribute having a missing attribute value is also typically small. In the worst case, all attribute values under a conditional attribute are missing, i.e., the number of objects with missing attribute values on the attribute reaches a maximum value | U |, which means that the conditional attribute cannot provide any useful classification information. Therefore, the temporal complexity of the attribute reduction algorithm can be further reduced to the following extent:
Figure BDA0003137070130000125
therefore, the algorithm can show the advantage in processing the large-scale incomplete data and calculating its allowable class that the dimensionality of the large-scale incomplete data has less influence on the length of the calculation time than the number of objects. The specific flow of the algorithm and its working principle are not discussed here.
From the above discussion, an improved forward search algorithm based on a positive approximation set and a fast algorithm to obtain the allowable classes are obtained. Under this algorithmic framework, the evaluation function or the end condition can be expressed as EFU(B,D)=EFU(C, D). For example, when the conditional entropy method is adopted, the corresponding evaluation function is ENU(B,D)=ENU(C, D). That is, when EFU(B,D)=EFUAnd when the (C and D) are met, B is a conditional attribute reduction result.
The rapid algorithm provided by the invention is a better method for the attribute selection task of the large-scale incomplete data set, and can obviously improve the time complexity, and in contrast, the influence of the dimensionality of the incomplete data set on the calculation time is less than the number of objects of the incomplete data set. Therefore, we can conclude that the generalization of incomplete attribute selection algorithms based on positive approximation sets, i.e. IFSPA, has the potential to significantly reduce the overall computation time of the attribute selection task for incomplete decision tables. To verify these conclusions, the specific temporal complexity of each step of the original algorithm and the IFSPA algorithm are listed in table 1.
TABLE 1 comparison of time complexity of IFSPA algorithm and classical algorithm
Figure BDA0003137070130000131
The breast cancer related data used in the experiment of the present invention is all from the UCI machine learning experiment database of California university, so that these incomplete data sets are derived from the real environment and are truly reliable, which also ensures that the experiment results are repeatable. The experimental data used in the experiment are summarized and described in table 2, with the various statistical information shown in fig. 3 a-3 e.
TABLE 2 basic information of simulation experiment data set
Figure BDA0003137070130000132
To facilitate the calculation and comparison, the QAATC algorithm is used to estimate its time complexity. In the whole experimental process, the time complexity of the original method and the time complexity of the new method are compared under the conditions of the incomplete rough set model and the variable-precision incomplete rough set model, which are specifically shown in fig. 4 to 11. In fig. 4 to 9c, the x-axis represents the size of the incomplete data subsets used, i.e. 20 equal data segments in each incomplete data set are sequentially increased from 1 to 20, and the y-axis represents the calculation time required by the algorithm. In fig. 6a to 11, in order to ensure the accuracy of the simulation experiment and simultaneously exhibit the improvement of the calculation performance by the improved algorithm, the threshold β is set to be 0,0.1, and 0.2, respectively. In addition, the correlation stability of the selected conditional attribute subsets was also evaluated by a ten-fold cross-validation method, as shown in tables 3 to 5. Simulation experiment results show that the novel method provided by the invention can obviously reduce the time complexity of the existing incomplete attribute reduction algorithm and improve the calculation efficiency of the incomplete attribute reduction algorithm. Meanwhile, compared with other existing algorithms, the method has the advantages of being lower in average calculation time and standard deviation on the basis of keeping the same stability of the algorithms, namely being better in robustness.
TABLE 3 comparison of the calculation time and stability of the IPR Algorithm and the IFSPA-IPR Algorithm
Figure BDA0003137070130000141
TABLE 4 comparison of computation time and stability of ILCE Algorithm and IFSPA-ILCE Algorithm
Figure BDA0003137070130000142
TABLE 5 comparison of the computation time and stability of the IVPR algorithm and the IFSPA-IVPR algorithm
Figure BDA0003137070130000143
In modern society, with the rapid development of information technologies such as networks and sensors, information and data acquired by people from various fields are rapidly expanding. Because of the limitation of the information data and the participation of people, the uncertainty contained in the information data is increased significantly, and the relationship between the information and the data becomes complicated. By using the reduction method provided by the invention, massive fuzzy, incomplete and inaccurate information and data can be effectively processed more quickly and efficiently, and the implied useful knowledge can be extracted, so that a good basis is provided for intelligent decision making. In addition, the data analysis method and the system can utilize the existing data information related to the breast cancer patients to perform data analysis on other people who are not subjected to medical examination, so that the potential high-risk people who are artificial breast cancer can be judged.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (6)

1. A rapid attribute reduction method for incomplete data sets is characterized in that the method based on a positive approximation set comprises the following steps:
inputting a data set S ═ (U, C ═ D) of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer condition attribute complete set, and D is a breast cancer decision attribute;
step two, red is initialized to be an empty set
Figure FDA0003137070120000011
Namely, it is
Figure FDA0003137070120000012
Wherein red is the selected breast cancer condition attribute set;
step three, calculating Siginner(akC, D, U), wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Siginner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four, akAdded to red, where Siginner(ak,C,D,U)>0;
Step five, making i ← 1, R1=red,P1={R1},U1←U;
Step six, judging
Figure FDA0003137070120000013
And
Figure FDA0003137070120000014
whether or not the above-mentioned conditions are satisfied,
Figure FDA0003137070120000015
is a target object subset U1The evaluation function of (3);
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
Figure FDA0003137070120000016
If one is not true, turning to the seventh step;
step seven, Ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
And step eight, returning a reduction result red of the breast cancer condition attribute and ending.
2. The method for fast attribute reduction of incomplete data set according to claim 1, wherein the concrete method of the sixth step is as follows:
first, a positive region in the positive approximation set is calculated
Figure FDA0003137070120000017
In the second step, the first step is that,
Figure FDA0003137070120000018
step three, i ← i + 1;
the fourth step, red ← red & { a-0Where, Sigouter(a0,red,D,Ui)=max{Sigouter(ak,red,D,Ui)},ak∈C-red;
The fifth step, judge
Figure FDA0003137070120000019
And
Figure FDA00031370701200000110
whether or not:
if both are true, go to the first step; otherwise, jumping out of the step six of ending the circulation and performing the step seven.
3. The method for fast attribute reduction of incomplete data set according to claim 1, wherein in the first step, the complexity of the fast allowable class acquisition algorithm of the incomplete decision table is
Figure FDA0003137070120000021
Wherein the content of the first and second substances,
Figure FDA0003137070120000022
representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, and
Figure FDA0003137070120000023
then the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
4. A rapid attribute reduction method for incomplete data sets is characterized in that the method based on a variable-precision positive approximation set comprises the following steps:
the method comprises the following steps: inputting a data set S ═ (U, C ═ D) of a breast cancer sample and a threshold value beta is less than or equal to 0.5; the method comprises the following steps of A, obtaining a data set of a breast cancer sample, wherein S is the data set of the breast cancer sample, U is the breast cancer sample, C is a breast cancer conditional attribute complete set, and D is a breast cancer decision attribute;
step two: initializing red to an empty set
Figure FDA0003137070120000024
Namely, it is
Figure FDA0003137070120000025
Wherein red is the selected breast cancer condition attribute set;
step three: computing
Figure FDA0003137070120000026
Wherein k is less than or equal to | C |, akIs the k-th breast cancer condition attribute, Sig3 inner() The kth breast cancer condition attribute a contained in the breast cancer condition attribute complete set CkThe attribute importance of (2);
step four: a is tokInto red, wherein
Figure FDA0003137070120000027
Step five: i ← 1, R1=red,P1={R1},U1←U;
Step six: judgment of
Figure FDA0003137070120000028
And
Figure FDA0003137070120000029
whether or not:
if all the attributes are satisfied, circularly searching and adding the breast cancer condition attribute with the maximum attribute importance in the red in the parts except the red until the condition attribute satisfies the requirement
Figure FDA00031370701200000210
If one is not true, turning to the seventh step;
step seven: ri←Ri∪{a0},Pi←{R1,R2,...,Ri};
Step eight: and returning a reduction result red of the breast cancer condition attribute and ending.
5. The method for rapid attribute reduction of incomplete data set according to claim 4, wherein the concrete method of the sixth step is as follows:
first, a positive region in the positive approximation set is calculated
Figure FDA00031370701200000211
Second, calculating
Figure FDA00031370701200000212
Step three, enabling i ← i + 1;
the fourth step, make red ← red & { a-0And (c) the step of (c) in which,
Figure FDA00031370701200000213
ak∈C-red;
the fifth step, judge
Figure FDA00031370701200000214
And
Figure FDA00031370701200000215
whether or not:
if both the first step and the second step are established, the first step is carried out to continue circulation; otherwise, jumping out of the step six of ending the circulation and entering into the step seven.
6. The method for fast attribute reduction of incomplete data set according to claim 4, wherein in the first step, the complexity of the fast allowable class acquisition algorithm of the incomplete decision table is
Figure FDA0003137070120000031
Wherein the content of the first and second substances,
Figure FDA0003137070120000032
representing Condition Attribute a of Breast cancerkNumber of all objects under which there is a missing attribute value, and
Figure FDA0003137070120000033
then the breast cancer condition attribute a is indicatedkThe number of all objects below which there are non-missing attribute values, i.e., attribute values that are not x.
CN202110722842.8A 2018-12-21 2018-12-21 Rapid attribute reduction method for incomplete data set Pending CN113345588A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110722842.8A CN113345588A (en) 2018-12-21 2018-12-21 Rapid attribute reduction method for incomplete data set

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811574927.0A CN109828996A (en) 2018-12-21 2018-12-21 A kind of Incomplete data set rapid attribute reduction
CN202110722842.8A CN113345588A (en) 2018-12-21 2018-12-21 Rapid attribute reduction method for incomplete data set

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201811574927.0A Division CN109828996A (en) 2018-12-21 2018-12-21 A kind of Incomplete data set rapid attribute reduction

Publications (1)

Publication Number Publication Date
CN113345588A true CN113345588A (en) 2021-09-03

Family

ID=66859919

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201811574927.0A Pending CN109828996A (en) 2018-12-21 2018-12-21 A kind of Incomplete data set rapid attribute reduction
CN202110722842.8A Pending CN113345588A (en) 2018-12-21 2018-12-21 Rapid attribute reduction method for incomplete data set

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201811574927.0A Pending CN109828996A (en) 2018-12-21 2018-12-21 A kind of Incomplete data set rapid attribute reduction

Country Status (1)

Country Link
CN (2) CN109828996A (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221674B (en) * 2021-04-25 2023-01-24 广东电网有限责任公司东莞供电局 Video stream key frame extraction system and method based on rough set reduction and SIFT
CN115392582B (en) * 2022-09-01 2023-11-14 广东工业大学 Crop yield prediction method based on increment fuzzy rough set attribute reduction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763529A (en) * 2010-01-14 2010-06-30 中山大学 Rough set attribute reduction method based on genetic algorithm
CN103336791A (en) * 2013-06-06 2013-10-02 湖州师范学院 Hadoop-based fast rough set attribute reduction method
CN103336790A (en) * 2013-06-06 2013-10-02 湖州师范学院 Hadoop-based fast neighborhood rough set attribute reduction method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763529A (en) * 2010-01-14 2010-06-30 中山大学 Rough set attribute reduction method based on genetic algorithm
CN103336791A (en) * 2013-06-06 2013-10-02 湖州师范学院 Hadoop-based fast rough set attribute reduction method
CN103336790A (en) * 2013-06-06 2013-10-02 湖州师范学院 Hadoop-based fast neighborhood rough set attribute reduction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UHUA QIAN 等: "An efficient accelerator for attribute reduction from incomplete data in rough set framework", PATTERN RECOGNITION, pages 2 - 5 *

Also Published As

Publication number Publication date
CN109828996A (en) 2019-05-31

Similar Documents

Publication Publication Date Title
CN110689038B (en) Training method and device for neural network model and medical image processing system
JP6362808B1 (en) Information processing apparatus and information processing method
Rieger et al. Irof: a low resource evaluation metric for explanation methods
CN116402825B (en) Bearing fault infrared diagnosis method, system, electronic equipment and storage medium
CN110866134A (en) Image retrieval-oriented distribution consistency keeping metric learning method
CN113345588A (en) Rapid attribute reduction method for incomplete data set
Zheng et al. Benchmarking unsupervised anomaly detection and localization
CN111524600A (en) Liver cancer postoperative recurrence risk prediction system based on neighbor2vec
Frey et al. Most powerful rank tests for perfect rankings
Li et al. Explainable human‐in‐the‐loop healthcare image information quality assessment and selection
CN116825363B (en) Early lung adenocarcinoma pathological type prediction system based on fusion deep learning network
Soliman et al. Features selection for building an early diagnosis machine learning model for Parkinson's disease
CN113011086B (en) Estimation method of forest biomass based on GA-SVR algorithm
Preetha et al. Firefly based region growing and region merging for image segmentation
Orang et al. Improving performance of similarity measures for uncertain time series using preprocessing techniques
Kiranmayee et al. Explorative data analytics of brain tumour data using R
Yang et al. Adaptive density peak clustering for determinging cluster center
CN114242178A (en) Method for quantitatively predicting biological activity of ER alpha antagonist based on gradient lifting decision tree
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
CN111178180A (en) Hyperspectral image feature selection method and device based on improved ant colony algorithm
CN111400885B (en) Remote sensing image multi-target rapid predictive modeling method based on piecewise linear fitting
Atamanyuk et al. Management of an agricultural enterprise on the basis of its economic state forecasting
Zhan et al. Medical image clustering algorithm based on graph entropy
CN111144910B (en) Bidding 'series bid, companion bid' object recommendation method and device based on fuzzy entropy mean shadow album
CN110503632B (en) SVR parameter optimization method in blind image quality evaluation algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination