CN110222023B

CN110222023B - Multi-objective parallel attribute reduction method based on Spark and ant colony optimization

Info

Publication number: CN110222023B
Application number: CN201910492176.6A
Authority: CN
Inventors: 危前进; 魏继鹏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2022-09-16
Anticipated expiration: 2039-06-06
Also published as: CN110222023A

Abstract

The invention discloses a multi-target parallel attribute reduction method based on Spark and ant colony optimization, which introduces the idea of combining a cloud computing Spark parallel technology with an intelligent ant colony algorithm into rough set theoretical attribute reduction, and on the basis, utilizes an information gain rate as heuristic information to carry out an innovative strategy of redundancy detection on selected attributes and each generation of optimal solution, so that the algorithm can be rapidly converged to the global optimal solution, the possibility of adding redundant attributes into a reduction set can be effectively avoided, and the redundancy caused by random selection of initial attributes is eliminated. In addition, a multi-target parallel solving strategy is adopted in the calculation of the heuristic information, the heuristic information of a plurality of attributes relative to the current attribute can be solved simultaneously, and the time complexity is O (| n) ² I) is reduced to O (| n |).

Description

Multi-objective parallel attribute reduction method based on Spark and ant colony optimization

Technical Field

The invention relates to the technical field of big data cloud, in particular to a multi-objective parallel attribute reduction method based on Spark and ant colony optimization.

Background

The attribute reduction is one of important research contents of rough set theory and is a key step of knowledge acquisition. The attribute reduction means that unnecessary knowledge in the information system is deleted under the condition that the classification capability of the information system is kept unchanged. By deleting the redundant attribute information, the potential definition of the information system can be improved, high-quality cleaning data can be obtained, and effective information with theoretical analysis and application values can be mined.

The development of information technology and the continuous increase of data size make the traditional data mining method, including attribute reduction solving algorithm, face the challenges of data size and computing power. General attribute reduction cannot handle large data because the data set is too large to fit into computer memory at one time and the computational space of a single computer node is limited. Therefore, the minimum attribute reduction solution under big data is particularly necessary, and is an important research target in the field today.

Reduction is an NP-hard problem due to finding the minimum attribute. Traditional solution methods such as the blind add-drop method or the greedy strategy-based heuristic method cannot effectively solve the minimum attribute reduction. The intelligent algorithm has strong global optimization capability and can be used for solving the problem of combinatorial optimization in attribute reduction in a rough set theory, but the existing attribute reduction method based on the intelligent algorithm faces three challenges: 1. the convergence rate is slow, and usually iteration is needed for many times to converge to obtain a stable solution; 2. the optimal solution cannot be obtained, namely the minimum attribute reduction is often converged to a local optimal solution, which is a phenomenon commonly existing in an intelligent algorithm; 3. no matter the attribute reduction research of the intelligent algorithm is combined or the method is based on the traditional heuristic method or the add-delete method, when the scale of the data set is increased, the effect of the existing algorithm in the process of processing the large data attribute reduction is not ideal.

Disclosure of Invention

Aiming at the problem that the traditional algorithm can not effectively process big data attribute reduction and the problem of combined optimization in solving the attribute reduction, the invention provides a multi-target parallel attribute reduction method based on Spark and ant colony optimization, which is used for solving the attribute reduction problem of big data, effectively solving the minimum attribute reduction while processing the big data, and simultaneously calculating the time complexity in calculating the attribute importance from O (n) ² ) Down to O (| n |).

In order to solve the problems, the invention is realized by the following technical scheme:

the multi-objective parallel attribute reduction method based on Spark and ant colony optimization specifically comprises the following steps:

step 1, uploading a decision information system to a distributed file system (HDFS), wherein the HDFS automatically divides data in the decision information system into a plurality of data fragments which are distributed to each computer node, and each data fragment is independent and not overlapped; one of the computer nodes is a master node, and the rest are slave nodes;

step 2, each computer node comprises a main node and a slave node, the attribute extraction is carried out on the data fragments obtained by the main node and the slave nodes, and the data fragments are stored in a key value pair (key, value) mode; at this time, the key value in the key value pair < key, value > is the value of the condition attribute i and the value thereof, the value of the condition attribute j and the value thereof, and the value of the decision attribute; the value in the resulting key-value pair < key, value > is 1;

step 3, each slave node uploads the stored key value pairs < key, value > to the master node, the master node performs equivalent summation operation on all the key value pairs < key, value > stored by the master node and all the key value pairs < key, value > uploaded by each slave node, namely, the key value pairs < key, value > perform numerical value accumulation on the value values corresponding to the same key value; at this time, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key-value pair < key, value > is the value resulting from the summation;

step 4, the main node redefines the key value pairs < key, value > obtained in the step 3, namely, the key values in the key value pairs < key, value > contain decision attribute values, and the decision attribute values are removed; at this time, the key value in the key value pair < key, value > is the condition attribute i and its value, and the condition attribute j and its value; the value in the resulting key-value pair < key, value > remains unchanged;

step 5, the main node performs equivalent merging operation on the key value pairs < key, value > obtained in the step 4 again, namely merging the value values corresponding to the same key value in the key value pairs < key, value > in an array form; at this point, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key value pair < key, value > is an array obtained by combination;

step 6, the master node broadcasts the key value pair < key, value > obtained in the step 5 to each slave node; each computer node comprises a main node and a slave node, and single-attribute local conditional entropy of each attribute relative to a decision attribute and attribute-pair local conditional entropy of each relative decision attribute between every two (2) attributes are calculated according to the obtained key value pair < key, value >;

step 7, the slave node uploads all the calculated single-attribute local conditional entropies and the attribute pair local conditional entropies to the master node at the same time; the main node adds all the single-attribute local conditional entropies of all the attributes to obtain a single-attribute global conditional entropy; meanwhile, the main node adds the local conditional entropies of the attribute pairs to obtain the global conditional entropies of the attribute pairs; finally, the main node calculates the global conditional entropy according to the single-attribute global conditional entropy and the attribute pair global conditional entropy to obtain heuristic information among every 2 attributes;

step 8, the master node reduces the conditional attribute set in the decision information system by using an ant colony algorithm based on the heuristic information between every 2 attributes calculated in the step 7;

the above i, j belongs to C, C represents a condition attribute set, and j ≠ i.

In the step 3, the key values are the same, namely the condition attribute i, the value of the condition attribute j and the value of the decision attribute are all the same; in step 5, the key values are identical, which means that the condition attribute i, the value of the condition attribute j, and the value of the condition attribute j are all identical.

The above heuristic information η _ij Comprises the following steps:

where H (D | i) represents the conditional entropy of the conditional attribute i with respect to the decision attribute D, H (D | j) represents the conditional entropy of the conditional attribute j with respect to the decision attribute D, and H (D | i ≧ j) represents the conditional entropy of the union of the conditional attribute i and the conditional attribute j with respect to the decision attribute D.

The concrete process of reducing the condition attribute set in the decision information system by using the ant colony algorithm is as follows:

step 1) let initial global reduction set R _min C, the initial iteration parameter t is 0; given an initial pheromone concentration τ _ij (0) Giving the total iteration times maxGeneration, wherein i is more than or equal to 1 and less than or equal to | C |, and j is more than or equal to 1 and less than or equal to | C |; wherein C represents a condition attribute set, and | C | represents the number of attributes in the condition attribute set;

step 2) adding 1 to the iteration parameter t to represent the t-th iteration;

step 3), the initial Ant k is 1, which represents the kth Ant, and Ant ants in each generation are solved by the minimum reduction set;

step 4) for the current ant k, randomly selecting 1 condition attribute from the condition attribute set C as the currently selected condition attribute, and putting the currently selected condition attribute into the local reduction set R of the current ant k _k Performing the following steps;

step 5) utilizing the current pheromone concentration tau _ij (t) and heuristic information η _ij Calculating the probability between the currently selected condition attribute of the current ant k and other unselected condition attributes in the condition attribute set C, and selecting other condition attributes corresponding to the maximum probability as the condition attributes to be selected of the current ant k;

step 6) judging the local reduction set R of the current ant k _k After adding the condition attribute to be selected, whether the mutual information with the decision attribute D changes or not is judged:

if the mutual information value changes, the condition attribute to be selected is taken as the currently selected condition attribute and is put into the local reduction set R of the current ant k _k And go to step 7);

if the mutual information value is not changed, the condition attribute redundancy to be selected is represented, and the step 7) is directly carried out;

step 7) judging a reduction set R of the current ant k _k Whether the number of contained attributes is more than or equal to the current global reduction set R _min Number of attributes contained:

if so, the current global reduction set R _min The change is not changed;

otherwise, further judging the current reduction set R of the current ant k _k Whether the mutual information with the decision attribute D is equal to the mutual information of the condition attribute set C and the decision attribute D:

if so, then the current reduction set R _min To update R _k ；

Otherwise, the current global reduction set R _min The change is not changed;

step 8) adding 1 to Ant k, and judging whether k is equal to the total number Ant: when k equals Ant, go directly to step 9); otherwise, returning to the step 4);

step 9) for the current global reduction set R _min Performing redundancy check on each attributeMeasuring, i.e. determining, the global reduction set R _min After deleting the attribute, whether the mutual information with the decision attribute changes or not, when the mutual information value does not change, the attribute redundancy is indicated, and the attribute is reduced from a global reduction set R _min Deleting; otherwise, indicating that the attribute is not redundant, and continuously keeping the attribute in the global reduction set R _min The preparation method comprises the following steps of (1) performing;

step 10) calculating the pheromone concentration tau of the next generation _ij (t+1)；

Step 11) judging whether t is equal to the total iteration times maxgernation: when t is maxGenerntion, outputting a current global reduction set; otherwise, returning to the step 3).

In the above step 5), at the t generation, the probability between the currently selected conditional attribute i of the ant k and the other conditional attributes j in the conditional attribute set C

Comprises the following steps:

wherein, tau _ij (t) is the pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j, η _ij Allowed as heuristic information between conditional attributes i and j _k The condition attribute set C is a set formed by unselected condition attributes in the condition attribute set C, alpha is the importance of given pheromone concentration, alpha is more than or equal to 0 and less than or equal to 1, beta is the importance of given heuristic information, and beta is more than or equal to 0 and less than or equal to 1.

In the above step 10), the pheromone concentration τ of the t +1 th generation _ij (t +1) is:

τ _ij (t+1)＝ρτ _ij (t)+Δτ _ij (t)

where ρ is the volatilization rate of a given pheromone concentration, τ _ij (t) is the pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j, Δ τ _ij (t) is the increase in pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j.

The t-th generation condition attribute i and the condition attribute jDelta tau of pheromone concentration between _ij (t) is:

wherein, | R _min And | is the number of attributes in the global reduction set, and q is a given constant.

Compared with the prior art, the method introduces the idea of combining the cloud computing Spark parallel technology and the intelligent ant colony algorithm into the rough set theoretical attribute reduction, and on the basis, provides an innovative strategy of performing redundancy detection on the selected attributes and each generation of optimal solution by taking an improved information gain rate as heuristic information, so that the algorithm can be rapidly converged to the overall optimal solution, the possibility of adding the redundant attributes into the reduction set can be effectively avoided, and the redundancy caused by random selection of the initial attributes is eliminated. Most importantly, when the heuristic information is calculated, a multi-target parallel solving strategy is adopted, the heuristic information of a plurality of attributes relative to the current attribute can be solved simultaneously, and the time complexity is O (| n) ² I) is reduced to O (| n |). The invention fully utilizes the advantages of Spark parallel technology and intelligent ant colony algorithm for complementation, can effectively obtain the minimum attribute reduction while processing mass data, enriches the method and the application range for solving the attribute reduction by a rough set theory, and researches, innovates and solves the attribute reduction problem and the solving efficiency under large-scale data.

Drawings

FIG. 1 is a flow chart of a multi-objective parallel attribute reduction method based on Spark and ant colony optimization.

Fig. 2 is a flowchart illustrating the ant colony algorithm in fig. 1 reducing a conditional attribute set in the decision information system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.

The method is used for solving the minimum attribute reduction under big data based on the ant colony optimization algorithm and Spark parallel processing technology. The method comprises the steps of solving a minimum attribute reduction set by utilizing good global optimization capability of an ant colony optimization algorithm, using parallelism of 'equivalence class' calculation in a rough set theory for parallel calculation under big data, taking improved information gain rate as heuristic information, and providing a new multi-target parallel solving strategy on the basis of Spark distributed parallel calculation in the process of calculating the heuristic information, wherein the importance of a plurality of attributes relative to the current attribute can be calculated at the same time, so that the algorithm solving efficiency is greatly improved, and the time complexity is reduced.

Referring to fig. 1, a multi-objective parallel attribute reduction method based on Spark and ant colony optimization specifically includes the following steps:

step 1, building a Hadoop distributed platform in a computer cluster, deploying a Spark operating environment, and uploading a decision information system S ═ (U, C ^ D, V, f) to a distributed file system HDFS.

The HDFS can automatically segment data in the decision information system into a plurality of data fragments which are distributed to all computer nodes, and all the data fragments are independent and do not overlap. One of the computer nodes is a master node, and the rest of the computer nodes are slave nodes, and the master node is responsible for data distribution, scheduling communication and the like of the rest of the slave nodes; wherein U represents a sample set, C represents a condition attribute set, D represents a decision attribute, V is a value range of the attribute, and f is an information function.

And (3) building a Hadoop Distributed platform, realizing a Distributed File System (HDFS), and deploying a Spark operating environment. The HDFS is a specific node structure based on a master-slave structure system, the nodes comprise a NameNode and a plurality of DataNodes, the NameNode is responsible for managing a file system name space and controlling the access of an external client, and the DataNodes are responsible for storing and accessing data.

Given a decision information system S ═ (U, C ═ D, V, f), the decision information system, i.e., a data set, is added to the HDFS, which will automatically segment the data in the decision information system into a plurality of data segments for storage in a computer cluster.

In the present embodiment, the decision information system is given as shown in table 1:

TABLE 1 given information System S

Suppose i ═ C ₁ Is the attribute of the ant currently located, j ═ C ₂ ，C ₃ Represents the selected attribute, and the condition attribute C needs to be calculated according to the formula (3) respectively ₁ And C ₂ ，C ₃ Conditional entropy of D with decision attribute, i.e. H (D | C) ₁ ∪C ₂ )，H(D|C ₁ ∪C ₃ )。

Distributing data to two computing nodes, each including the first three x ₁ -x ₃ And the last three x ₄ -x ₆ Data, as shown in table 2:

TABLE 2 data sharding distributed across different computer nodes

The HDFS can decompose large-scale data into a plurality of data fragments, the data fragments are independent and do not overlap, and the uploaded data fragments are obtained from the HDFS by calling a sparkContext () method through Spark.

Step 2, the distributed file system HDFS adopts a distributed method (the main node and the slave node are mutually matched) to calculate heuristic information eta _ij 。

Aiming at the problem that when the conditional entropy of the condition attribute relative to the decision attribute is calculated by the traditional method, the condition attribute i and the other condition attributes j, j is not equal to i, i belongs to C, and j belongs to C are required to be calculated in sequence, the invention obtains the uploaded data fragments from the distributed file system, and designs a proper data fragment according to the Spark parallel principle under a Spark distributed platform<key，value>Key value pairs in a form to simultaneously obtain conditional entropies of the condition attribute i and the rest condition attributes j relative to the decision attribute D, and further calculate heuristic information eta in parallel _ij 。

Spark abstracts data Distributed on each node of the cluster into one data fragment through an elastic Distributed data set (RDD), and performs a series of parallel operations. The RDD is created by reading a file (data) of the HDFS, and may be converted from one RDD. The invention provides a novel multi-target parallel solving method according to the principle of parallel computing, which can simultaneously calculate the conditional entropy of other selected attributes and the current attribute relative to decision data through the execution combination of a plurality of operators of RDD in Spark, thereby obtaining heuristic information.

Step 2.1, each computer node comprises a main node and a slave node, the attribute extraction is carried out on the data fragments obtained by the nodes, and the data fragments are subjected to key value pair<key，value>Storing the form of (1); at this point, the resulting key-value pair<key，value>The key value in the key is a condition attribute i and a value thereof, a condition attribute j and a value thereof, a decision attribute value, and the condition attribute i is C ₁ Representing the currently selected attribute when computing heuristic information, j ═ { C ₂ ,C ₃ Representing other unselected condition attributes; the resulting key-value pair<key，value>The value in (1).

In this embodiment, the specific form is as follows:

step 2.2, each slave node uploads the key value pairs < key, value > stored after the definition to the master node, the master node performs equivalent summation operation on all the key value pairs < key, value > stored after the definition of the master node and all the key value pairs < key, value > uploaded by each slave node, namely, the key value pairs < key, value > perform numerical value accumulation summation on the value values corresponding to the same key value; at this time, the key value in the key value pair < key, value > remains unchanged, and is still the condition attribute i and its value, the condition attribute j and its value, and the decision attribute and its value; the value in the resulting key-value pair < key, value > is the value resulting from the summation.

In this implementation, the results are as follows:

step 2.3, the main node redefines the key value pairs < key, value > obtained in the step 2.2, namely, the decision attribute values contained in the key values in the key value pairs < key, value > are removed; at this time, the key value in the key value pair < key, value > is the condition attribute i and its value, and the condition attribute j and its value; the value in the resulting key-value pair < key, value > remains unchanged.

In this implementation, the results are as follows:

step 2.4, the key value pair obtained in the step 2.3 is paired by the main node<key，value>Performing equivalence class merging operation, i.e. combining the key-value pairs<key，value>The value values corresponding to the same key value in the key values are merged in an array form; at this point, the resulting key-value pair<key，value>The key value in (1) remains unchanged; the resulting key-value pair<key，value>The value in (1) is an array obtained by merging and returns to the form of<key，[value ₁ ，value ₂ ，…，value _n ]>。

In this implementation, the results are as follows:

step 2.5, the master node broadcasts the key value pair < key, value > obtained in the step 2.4 to each slave node; each computer node comprises a main node and a slave node, and according to the obtained key value pair < key, value >, the single-attribute local conditional entropy of each attribute relative to the decision attribute in the data fragment obtained from the computer node and the attribute-to-local conditional entropy of the relative decision attribute between 2 attributes are respectively calculated.

And (3) calculating the conditional entropy of all selected condition attributes j and the current condition attribute i for the decision attribute according to the information entropy of the formula (1) and the conditional entropy of the formula (2). Respectively taking all selected condition attributes j as keys, taking the calculated condition entropy as value, and defining and storing by using mapToPair;

wherein, h (D) represents the information entropy of the decision attribute D, defined as:

h (D | C) represents the conditional entropy of the conditional attribute C relative to the decision attribute D, and is defined as;

wherein, p (X) _i )＝|X _i |/|U|,p(Y _j |X _i )＝p|X _i ∩Y _j |/|X _i L, n and m represent the value ranges of the condition attribute C and the decision attribute D, | X _i I is X _i The group (2) of (a).

In the present embodiment, the condition attribute C is calculated from the formulas (1) and (2) ₂ ，C ₃ And condition attribute C ₁ Conditional entropy with respect to decision attributes. Respectively with conditional attribute C ₂ ，C ₃ As a key, the resulting conditional entropy is calculated as a value. The maptapair was used for definition and storage, and the results were as follows:

step 2.6, the slave node uploads all the calculated single-attribute local conditional entropies and attribute combined local conditional entropies to the master node at the same time; the main node adds all the single-attribute local conditional entropies of all the attributes to obtain a single-attribute global conditional entropy; meanwhile, the main node adds the local conditional entropies of the attribute combinations to obtain an attribute combination global conditional entropy; finally, the master node is global according to the single attributeCalculating the conditional entropy and the attribute-to-global conditional entropy to obtain heuristic information eta between each 2 attributes _ij 。

Heuristic information η _ij The method takes the improved information gain rate as the heuristic information, not only considers the mutual information increment between the added selected attribute of the current attribute and the decision attribute, but also considers the conditional entropy contained by the selected attribute, and the heuristic information calculation is finally equivalent to the conditional entropy calculation. Invention heuristic information eta _ij The calculation formula is as follows:

where H (D | i) represents the conditional entropy of the condition attribute i with respect to the decision attribute D, H (D | j) represents the conditional entropy of the condition attribute j with respect to the decision attribute D, and H (D | i ≦ j) represents the union of the condition attribute i and the condition attribute j with respect to the conditional entropy representing the decision attribute D.

In this implementation, the results are as follows:

the conditional entropy of all selected attributes relative to the current number attribute is thus calculated: h (D | C) ₁ ∪C ₂ )＝0.24，H(D|C ₁ ∪C ₃ ) 0.14. And further calculates heuristic information using equation (3).

The improved information gain rate is used as heuristic information, and an innovative strategy for carrying out redundancy detection on the selected attribute and each generation of optimal solution enables the algorithm to be rapidly converged to the global optimal solution, so that the possibility of adding the redundant attribute to the reduction set can be effectively avoided, and the redundancy caused by random selection of the initial attribute is eliminated.

Step 3, the master node calculates heuristic information eta between every 2 attributes based on the step 2 _ij The ant colony algorithm is used to perform attribute reduction on the condition attributes in the decision information system, see fig. 2.

Pheromone concentration and heuristic information influence the walking path of ants, namely a reduction set of solution. Each ant leaves a certain pheromone on a path in the walking process, namely the solution process, and the pheromone gradually volatilizes and reduces in concentration along with the lapse of time. Therefore, the pheromone concentration of the places where the ants walk through the paths is large, and the ants tend to advance towards the places where the pheromone concentration is large during path selection, so that a positive feedback effect is formed, and the optimal path among the attributes, namely the minimum attribute reduction set is achieved. By performing redundancy detection on each selected attribute and each generation of optimal solution, the minimum reduction set can be converged quickly. And updating the pheromone concentration value on the path of the optimal solution of each generation, and judging whether a termination condition is reached through multi-generation evolution to finally obtain a stable global optimal solution.

Step 3.1, initializing various parameters: let the initial global reduction set, i.e. the minimum reduction set R _min Let the length L of the initial global reduction set, i.e. the minimum reduction set _min Making initial iteration parameter t equal to 0; the total iteration times maxGeneration is given in advance, the total Ant number Ant of each generation is given in advance, the initial pheromone concentration is given in advance, the initial 0 th generation two attributes i and the inter-j pheromone concentration tau are given in advance _ij (0) The initial value is generally set according to experience, for example, 0.5, i and j represent the number of the condition attribute, and the condition attribute i, j belongs to C; i is more than or equal to 1, and | C is more than or equal to j; | C | represents the total number of condition attributes in the condition attribute set.

Step 3.2, performing algebraic iteration with an iteration parameter t + +, wherein each generation has Ant ants, and each Ant performs solving of a minimum reduction set;

and 3.3, the current ant is solved, each generation of initial k is 1, and the initial reduction set solved by each ant is R _k ＝{a _k }，R _k Represents a reduced set, a, of ant k _k Representing that one attribute is randomly selected from the conditional attribute set C as an initial attribute, namely the attribute where the ant is currently located, L _k 1 indicates the number of attributes included in the initial reduced set.

Step 3.4. according to pheromone concentration tau _ij And heuristic information η _ij Formed probability formula

Selecting the next selected attribute, the probability formula is defined as follows:

representing the probability of the kth ant in the current attribute i when selecting the next selected attribute j, wherein tau _ij (t) the intensity of pheromone concentration from t generation attribute i to attribute j, τ _il (t) represents the concentration of pheromones from the attribute i to the attribute l of the tth generation, and j belongs to allowed _k ＝{C-R _k }，allowed _k For the unselected condition attribute set, alpha is more than or equal to 0 and less than or equal to 1 and beta is more than or equal to 0 and less than or equal to 1 represents the concentration of pheromone on the path and the importance of heuristic information, selecting the attribute j with the maximum next selected probability according to a probability formula, and making b _k J, go to the next step;

step 3.5, for the kth ant, reducing the set R by calculating _k Adding selected attributes b _k Whether the mutual information of the post and decision attributes D has changed, i.e. I (R) _k (ii) a D) Whether or not it is equal to I (R) _k ∪b _k (ii) a D) And the redundancy of the selected attribute is judged, so that the redundant attribute can be effectively prevented from being added into the reduction set as the selected attribute. If the mutual information value is not changed, the selected attribute redundancy is indicated, the next step is directly carried out, otherwise b _k Added to the reduced set R _k Middle and at the same time L _k ++；

Step 3.6, when the following conditions are met, stopping searching by the current ant k, entering the next step, otherwise, returning to the step 3.4, and selecting the next selected attribute;

(1) reduced set of ants k R _k The number of contained attributes is greater than the current global reduction set R _min The number of attributes contained. The first condition for stopping searching indicates that the number of attributes contained in the current reduction set is alreadyWhen the number of the attributes contained in the global reduction set is larger than that of the attributes contained in the global reduction set, the search of the next attribute is not needed to be carried out any more, and the result is not optimal;

(2) current reduction set R of ant k _k The mutual information of decision attribute D is equal to the mutual information of conditional attribute set C and decision attribute D, I (C; D) is I (R) _k (ii) a D) Wherein R is _k A local solution, i.e., a reduction set, constructed for the kth ant. The second condition for stopping searching represents the mutual information of the current reduction set and the decision attribute, and is equivalent to the mutual information of all the condition attributes and the decision attribute. At this time, the minimum reduction set R is required _min Updating, i.e. updating the minimal reduction set R _min ＝R _k And L _min ＝L _k ；

Step 3.7.k + +, the next Ant continues to search, when k equals to the total number Ant, it indicates that all ants have searched, then step 3.8 is entered, otherwise, step 3.3 is returned;

step 3.8, solving the current generation t to obtain a minimum attribute reduction set R _min Wherein each attribute a ∈ R _min Performing redundancy detection to avoid redundancy attribute in the reduction set caused by initial random selection, and judging the reduction set R _min If the mutual information with the decision attribute is not changed after the attribute a is deleted, the redundant attribute still exists in the current reduction set, and R is deleted _min ＝R _min -a，L _min -, updating R after all attributes have been checked _min And L _min Entering step 3.9;

step 3.9 after all ants in the tth generation complete the search, the pheromone concentration between attributes is updated according to the following formula:

τ _ij (t+1)＝ρτ _ij (t)+Δτ _ij (t) (5)

the parameter rho (0. ltoreq. rho. ltoreq.1) is a constant and represents the volatility of the pheromone concentration, Δ τ _ij (t) represents the increment of pheromone concentration from the attribute i to the attribute j of the tth generation, and the calculation formula is as follows:

where q is a given constant parameter, | r (t) | denotes the base of the t-th generation minimum attribute reduction set;

step 3.10, judging whether the iteration time t is equal to the maximum iteration time maxGeneration, if so, entering the next step, otherwise, returning to the step 3.2, and performing the next round of iterative calculation;

step 3.11, outputting the minimum reduction set R _min And its length L _min 。

The satisfaction condition I (C; D) of the reduction set is calculated, representing the mutual information of all the conditional attributes relative to the decision attribute, and is defined as follows:

I(C；D)＝H(D)-H(D|C) (7)

where H (D) represents the information entropy of the decision attribute D, and H (D | C) represents the conditional entropy of the condition attribute C with respect to the decision attribute D.

The method is based on the rough set theory, relies on an intelligent algorithm to well process the combined optimization force and a Spark parallel distributed processing technology, solves the problem that the traditional attribute reduction algorithm cannot process mass data and has an unsatisfactory effect of solving the minimum reduction attribute, and enriches the application range of rough set theory knowledge reduction.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be devised by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization is characterized by comprising the following steps:

step 4, the main node redefines the key value pairs < key, value > obtained in the step 3, namely, the key values in the key value pairs < key, value > contain decision attribute values which are removed; at this time, the key value in the key value pair < key, value > is the condition attribute i and its value, and the condition attribute j and its value; the value in the resulting key-value pair < key, value > remains unchanged;

step 5, the main node performs equivalent merging operation on the key value pairs < key, value > obtained in the step 4 again, namely, the value values corresponding to the same key value in the key value pairs < key, value > are merged in an array form; at this point, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key value pair < key, value > is an array obtained by combination;

2. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization according to claim 1, wherein,

in step 3, the key values are the same, namely the condition attribute i, the value of the condition attribute j and the value of the decision attribute are all the same;

in step 5, the key values are identical, which means that the condition attribute i, the value of the condition attribute j, and the value of the condition attribute j are all identical.

3. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 1, wherein the heuristic information η _ij Comprises the following steps:

4. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 1, wherein the specific process of reducing the conditional attribute set in the decision information system by using the ant colony algorithm is as follows:

step 1) let initial global reduction set R _min C, the initial iteration parameter t is 0; given an initial pheromone concentration τ _ij (0) Giving total iteration times maxGeneration, wherein i is more than or equal to 1 and less than or equal to | C |, and j is more than or equal to 1 and less than or equal to | C |; wherein C represents a condition attribute set, and | C | represents the number of attributes in the condition attribute set;

step 2) adding 1 to the iteration parameter t to represent the t-th iteration;

if so, the current global reduction set R _min The change is not changed;

if so, then the current reduction set R _min To update R _k ；

Otherwise, the current global reduction set R _min The change is not changed;

step 9) for the current global reduction set R _min Each attribute of the global reduction set R is subjected to redundancy detection, namely, the global reduction set R is judged _min After deleting the attribute, whether the mutual information with the decision attribute changes or not, when the mutual information value does not change, the attribute redundancy is indicated, and the attribute is reduced from a global reduction set R _min Deleting; otherwise, indicating that the attribute is not redundant, and continuously keeping the attribute in the global reduction set R _min Performing the following steps;

5. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 4, wherein in step 5), in the t-th generation, the probability between the currently selected conditional attribute i of ant k and other conditional attributes j in the conditional attribute set C

Comprises the following steps:

6. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 4, wherein in step 10), the pheromone concentration τ of t +1 th generation _ij (t +1) is:

τ _ij (t+1)＝ρτ _ij (t)+Δτ _ij (t)

7. The method for multi-objective parallel attribute reduction based on Spark and ant colony optimization as claimed in claim 6, wherein the pheromone concentration increment between the t-th generation condition attribute i and the condition attribute j is Δ τ _ij (t) is: