CN110222023B - Multi-objective parallel attribute reduction method based on Spark and ant colony optimization - Google Patents

Multi-objective parallel attribute reduction method based on Spark and ant colony optimization Download PDF

Info

Publication number
CN110222023B
CN110222023B CN201910492176.6A CN201910492176A CN110222023B CN 110222023 B CN110222023 B CN 110222023B CN 201910492176 A CN201910492176 A CN 201910492176A CN 110222023 B CN110222023 B CN 110222023B
Authority
CN
China
Prior art keywords
attribute
value
key
condition
conditional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910492176.6A
Other languages
Chinese (zh)
Other versions
CN110222023A (en
Inventor
危前进
魏继鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN201910492176.6A priority Critical patent/CN110222023B/en
Publication of CN110222023A publication Critical patent/CN110222023A/en
Application granted granted Critical
Publication of CN110222023B publication Critical patent/CN110222023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-target parallel attribute reduction method based on Spark and ant colony optimization, which introduces the idea of combining a cloud computing Spark parallel technology with an intelligent ant colony algorithm into rough set theoretical attribute reduction, and on the basis, utilizes an information gain rate as heuristic information to carry out an innovative strategy of redundancy detection on selected attributes and each generation of optimal solution, so that the algorithm can be rapidly converged to the global optimal solution, the possibility of adding redundant attributes into a reduction set can be effectively avoided, and the redundancy caused by random selection of initial attributes is eliminated. In addition, a multi-target parallel solving strategy is adopted in the calculation of the heuristic information, the heuristic information of a plurality of attributes relative to the current attribute can be solved simultaneously, and the time complexity is O (| n) 2 I) is reduced to O (| n |).

Description

Multi-objective parallel attribute reduction method based on Spark and ant colony optimization
Technical Field
The invention relates to the technical field of big data cloud, in particular to a multi-objective parallel attribute reduction method based on Spark and ant colony optimization.
Background
The attribute reduction is one of important research contents of rough set theory and is a key step of knowledge acquisition. The attribute reduction means that unnecessary knowledge in the information system is deleted under the condition that the classification capability of the information system is kept unchanged. By deleting the redundant attribute information, the potential definition of the information system can be improved, high-quality cleaning data can be obtained, and effective information with theoretical analysis and application values can be mined.
The development of information technology and the continuous increase of data size make the traditional data mining method, including attribute reduction solving algorithm, face the challenges of data size and computing power. General attribute reduction cannot handle large data because the data set is too large to fit into computer memory at one time and the computational space of a single computer node is limited. Therefore, the minimum attribute reduction solution under big data is particularly necessary, and is an important research target in the field today.
Reduction is an NP-hard problem due to finding the minimum attribute. Traditional solution methods such as the blind add-drop method or the greedy strategy-based heuristic method cannot effectively solve the minimum attribute reduction. The intelligent algorithm has strong global optimization capability and can be used for solving the problem of combinatorial optimization in attribute reduction in a rough set theory, but the existing attribute reduction method based on the intelligent algorithm faces three challenges: 1. the convergence rate is slow, and usually iteration is needed for many times to converge to obtain a stable solution; 2. the optimal solution cannot be obtained, namely the minimum attribute reduction is often converged to a local optimal solution, which is a phenomenon commonly existing in an intelligent algorithm; 3. no matter the attribute reduction research of the intelligent algorithm is combined or the method is based on the traditional heuristic method or the add-delete method, when the scale of the data set is increased, the effect of the existing algorithm in the process of processing the large data attribute reduction is not ideal.
Disclosure of Invention
Aiming at the problem that the traditional algorithm can not effectively process big data attribute reduction and the problem of combined optimization in solving the attribute reduction, the invention provides a multi-target parallel attribute reduction method based on Spark and ant colony optimization, which is used for solving the attribute reduction problem of big data, effectively solving the minimum attribute reduction while processing the big data, and simultaneously calculating the time complexity in calculating the attribute importance from O (n) 2 ) Down to O (| n |).
In order to solve the problems, the invention is realized by the following technical scheme:
the multi-objective parallel attribute reduction method based on Spark and ant colony optimization specifically comprises the following steps:
step 1, uploading a decision information system to a distributed file system (HDFS), wherein the HDFS automatically divides data in the decision information system into a plurality of data fragments which are distributed to each computer node, and each data fragment is independent and not overlapped; one of the computer nodes is a master node, and the rest are slave nodes;
step 2, each computer node comprises a main node and a slave node, the attribute extraction is carried out on the data fragments obtained by the main node and the slave nodes, and the data fragments are stored in a key value pair (key, value) mode; at this time, the key value in the key value pair < key, value > is the value of the condition attribute i and the value thereof, the value of the condition attribute j and the value thereof, and the value of the decision attribute; the value in the resulting key-value pair < key, value > is 1;
step 3, each slave node uploads the stored key value pairs < key, value > to the master node, the master node performs equivalent summation operation on all the key value pairs < key, value > stored by the master node and all the key value pairs < key, value > uploaded by each slave node, namely, the key value pairs < key, value > perform numerical value accumulation on the value values corresponding to the same key value; at this time, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key-value pair < key, value > is the value resulting from the summation;
step 4, the main node redefines the key value pairs < key, value > obtained in the step 3, namely, the key values in the key value pairs < key, value > contain decision attribute values, and the decision attribute values are removed; at this time, the key value in the key value pair < key, value > is the condition attribute i and its value, and the condition attribute j and its value; the value in the resulting key-value pair < key, value > remains unchanged;
step 5, the main node performs equivalent merging operation on the key value pairs < key, value > obtained in the step 4 again, namely merging the value values corresponding to the same key value in the key value pairs < key, value > in an array form; at this point, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key value pair < key, value > is an array obtained by combination;
step 6, the master node broadcasts the key value pair < key, value > obtained in the step 5 to each slave node; each computer node comprises a main node and a slave node, and single-attribute local conditional entropy of each attribute relative to a decision attribute and attribute-pair local conditional entropy of each relative decision attribute between every two (2) attributes are calculated according to the obtained key value pair < key, value >;
step 7, the slave node uploads all the calculated single-attribute local conditional entropies and the attribute pair local conditional entropies to the master node at the same time; the main node adds all the single-attribute local conditional entropies of all the attributes to obtain a single-attribute global conditional entropy; meanwhile, the main node adds the local conditional entropies of the attribute pairs to obtain the global conditional entropies of the attribute pairs; finally, the main node calculates the global conditional entropy according to the single-attribute global conditional entropy and the attribute pair global conditional entropy to obtain heuristic information among every 2 attributes;
step 8, the master node reduces the conditional attribute set in the decision information system by using an ant colony algorithm based on the heuristic information between every 2 attributes calculated in the step 7;
the above i, j belongs to C, C represents a condition attribute set, and j ≠ i.
In the step 3, the key values are the same, namely the condition attribute i, the value of the condition attribute j and the value of the decision attribute are all the same; in step 5, the key values are identical, which means that the condition attribute i, the value of the condition attribute j, and the value of the condition attribute j are all identical.
The above heuristic information η ij Comprises the following steps:
Figure BDA0002087394390000021
where H (D | i) represents the conditional entropy of the conditional attribute i with respect to the decision attribute D, H (D | j) represents the conditional entropy of the conditional attribute j with respect to the decision attribute D, and H (D | i ≧ j) represents the conditional entropy of the union of the conditional attribute i and the conditional attribute j with respect to the decision attribute D.
The concrete process of reducing the condition attribute set in the decision information system by using the ant colony algorithm is as follows:
step 1) let initial global reduction set R min C, the initial iteration parameter t is 0; given an initial pheromone concentration τ ij (0) Giving the total iteration times maxGeneration, wherein i is more than or equal to 1 and less than or equal to | C |, and j is more than or equal to 1 and less than or equal to | C |; wherein C represents a condition attribute set, and | C | represents the number of attributes in the condition attribute set;
step 2) adding 1 to the iteration parameter t to represent the t-th iteration;
step 3), the initial Ant k is 1, which represents the kth Ant, and Ant ants in each generation are solved by the minimum reduction set;
step 4) for the current ant k, randomly selecting 1 condition attribute from the condition attribute set C as the currently selected condition attribute, and putting the currently selected condition attribute into the local reduction set R of the current ant k k Performing the following steps;
step 5) utilizing the current pheromone concentration tau ij (t) and heuristic information η ij Calculating the probability between the currently selected condition attribute of the current ant k and other unselected condition attributes in the condition attribute set C, and selecting other condition attributes corresponding to the maximum probability as the condition attributes to be selected of the current ant k;
step 6) judging the local reduction set R of the current ant k k After adding the condition attribute to be selected, whether the mutual information with the decision attribute D changes or not is judged:
if the mutual information value changes, the condition attribute to be selected is taken as the currently selected condition attribute and is put into the local reduction set R of the current ant k k And go to step 7);
if the mutual information value is not changed, the condition attribute redundancy to be selected is represented, and the step 7) is directly carried out;
step 7) judging a reduction set R of the current ant k k Whether the number of contained attributes is more than or equal to the current global reduction set R min Number of attributes contained:
if so, the current global reduction set R min The change is not changed;
otherwise, further judging the current reduction set R of the current ant k k Whether the mutual information with the decision attribute D is equal to the mutual information of the condition attribute set C and the decision attribute D:
if so, then the current reduction set R min To update R k
Otherwise, the current global reduction set R min The change is not changed;
step 8) adding 1 to Ant k, and judging whether k is equal to the total number Ant: when k equals Ant, go directly to step 9); otherwise, returning to the step 4);
step 9) for the current global reduction set R min Performing redundancy check on each attributeMeasuring, i.e. determining, the global reduction set R min After deleting the attribute, whether the mutual information with the decision attribute changes or not, when the mutual information value does not change, the attribute redundancy is indicated, and the attribute is reduced from a global reduction set R min Deleting; otherwise, indicating that the attribute is not redundant, and continuously keeping the attribute in the global reduction set R min The preparation method comprises the following steps of (1) performing;
step 10) calculating the pheromone concentration tau of the next generation ij (t+1);
Step 11) judging whether t is equal to the total iteration times maxgernation: when t is maxGenerntion, outputting a current global reduction set; otherwise, returning to the step 3).
In the above step 5), at the t generation, the probability between the currently selected conditional attribute i of the ant k and the other conditional attributes j in the conditional attribute set C
Figure BDA0002087394390000041
Comprises the following steps:
Figure BDA0002087394390000042
wherein, tau ij (t) is the pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j, η ij Allowed as heuristic information between conditional attributes i and j k The condition attribute set C is a set formed by unselected condition attributes in the condition attribute set C, alpha is the importance of given pheromone concentration, alpha is more than or equal to 0 and less than or equal to 1, beta is the importance of given heuristic information, and beta is more than or equal to 0 and less than or equal to 1.
In the above step 10), the pheromone concentration τ of the t +1 th generation ij (t +1) is:
τ ij (t+1)=ρτ ij (t)+Δτ ij (t)
where ρ is the volatilization rate of a given pheromone concentration, τ ij (t) is the pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j, Δ τ ij (t) is the increase in pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j.
The t-th generation condition attribute i and the condition attribute jDelta tau of pheromone concentration between ij (t) is:
Figure BDA0002087394390000043
wherein, | R min And | is the number of attributes in the global reduction set, and q is a given constant.
Compared with the prior art, the method introduces the idea of combining the cloud computing Spark parallel technology and the intelligent ant colony algorithm into the rough set theoretical attribute reduction, and on the basis, provides an innovative strategy of performing redundancy detection on the selected attributes and each generation of optimal solution by taking an improved information gain rate as heuristic information, so that the algorithm can be rapidly converged to the overall optimal solution, the possibility of adding the redundant attributes into the reduction set can be effectively avoided, and the redundancy caused by random selection of the initial attributes is eliminated. Most importantly, when the heuristic information is calculated, a multi-target parallel solving strategy is adopted, the heuristic information of a plurality of attributes relative to the current attribute can be solved simultaneously, and the time complexity is O (| n) 2 I) is reduced to O (| n |). The invention fully utilizes the advantages of Spark parallel technology and intelligent ant colony algorithm for complementation, can effectively obtain the minimum attribute reduction while processing mass data, enriches the method and the application range for solving the attribute reduction by a rough set theory, and researches, innovates and solves the attribute reduction problem and the solving efficiency under large-scale data.
Drawings
FIG. 1 is a flow chart of a multi-objective parallel attribute reduction method based on Spark and ant colony optimization.
Fig. 2 is a flowchart illustrating the ant colony algorithm in fig. 1 reducing a conditional attribute set in the decision information system.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.
The method is used for solving the minimum attribute reduction under big data based on the ant colony optimization algorithm and Spark parallel processing technology. The method comprises the steps of solving a minimum attribute reduction set by utilizing good global optimization capability of an ant colony optimization algorithm, using parallelism of 'equivalence class' calculation in a rough set theory for parallel calculation under big data, taking improved information gain rate as heuristic information, and providing a new multi-target parallel solving strategy on the basis of Spark distributed parallel calculation in the process of calculating the heuristic information, wherein the importance of a plurality of attributes relative to the current attribute can be calculated at the same time, so that the algorithm solving efficiency is greatly improved, and the time complexity is reduced.
Referring to fig. 1, a multi-objective parallel attribute reduction method based on Spark and ant colony optimization specifically includes the following steps:
step 1, building a Hadoop distributed platform in a computer cluster, deploying a Spark operating environment, and uploading a decision information system S ═ (U, C ^ D, V, f) to a distributed file system HDFS.
The HDFS can automatically segment data in the decision information system into a plurality of data fragments which are distributed to all computer nodes, and all the data fragments are independent and do not overlap. One of the computer nodes is a master node, and the rest of the computer nodes are slave nodes, and the master node is responsible for data distribution, scheduling communication and the like of the rest of the slave nodes; wherein U represents a sample set, C represents a condition attribute set, D represents a decision attribute, V is a value range of the attribute, and f is an information function.
And (3) building a Hadoop Distributed platform, realizing a Distributed File System (HDFS), and deploying a Spark operating environment. The HDFS is a specific node structure based on a master-slave structure system, the nodes comprise a NameNode and a plurality of DataNodes, the NameNode is responsible for managing a file system name space and controlling the access of an external client, and the DataNodes are responsible for storing and accessing data.
Given a decision information system S ═ (U, C ═ D, V, f), the decision information system, i.e., a data set, is added to the HDFS, which will automatically segment the data in the decision information system into a plurality of data segments for storage in a computer cluster.
In the present embodiment, the decision information system is given as shown in table 1:
TABLE 1 given information System S
Figure BDA0002087394390000051
Suppose i ═ C 1 Is the attribute of the ant currently located, j ═ C 2 ,C 3 Represents the selected attribute, and the condition attribute C needs to be calculated according to the formula (3) respectively 1 And C 2 ,C 3 Conditional entropy of D with decision attribute, i.e. H (D | C) 1 ∪C 2 ),H(D|C 1 ∪C 3 )。
Distributing data to two computing nodes, each including the first three x 1 -x 3 And the last three x 4 -x 6 Data, as shown in table 2:
TABLE 2 data sharding distributed across different computer nodes
Figure BDA0002087394390000061
The HDFS can decompose large-scale data into a plurality of data fragments, the data fragments are independent and do not overlap, and the uploaded data fragments are obtained from the HDFS by calling a sparkContext () method through Spark.
Step 2, the distributed file system HDFS adopts a distributed method (the main node and the slave node are mutually matched) to calculate heuristic information eta ij
Aiming at the problem that when the conditional entropy of the condition attribute relative to the decision attribute is calculated by the traditional method, the condition attribute i and the other condition attributes j, j is not equal to i, i belongs to C, and j belongs to C are required to be calculated in sequence, the invention obtains the uploaded data fragments from the distributed file system, and designs a proper data fragment according to the Spark parallel principle under a Spark distributed platform<key,value>Key value pairs in a form to simultaneously obtain conditional entropies of the condition attribute i and the rest condition attributes j relative to the decision attribute D, and further calculate heuristic information eta in parallel ij
Spark abstracts data Distributed on each node of the cluster into one data fragment through an elastic Distributed data set (RDD), and performs a series of parallel operations. The RDD is created by reading a file (data) of the HDFS, and may be converted from one RDD. The invention provides a novel multi-target parallel solving method according to the principle of parallel computing, which can simultaneously calculate the conditional entropy of other selected attributes and the current attribute relative to decision data through the execution combination of a plurality of operators of RDD in Spark, thereby obtaining heuristic information.
Step 2.1, each computer node comprises a main node and a slave node, the attribute extraction is carried out on the data fragments obtained by the nodes, and the data fragments are subjected to key value pair<key,value>Storing the form of (1); at this point, the resulting key-value pair<key,value>The key value in the key is a condition attribute i and a value thereof, a condition attribute j and a value thereof, a decision attribute value, and the condition attribute i is C 1 Representing the currently selected attribute when computing heuristic information, j ═ { C 2 ,C 3 Representing other unselected condition attributes; the resulting key-value pair<key,value>The value in (1).
In this embodiment, the specific form is as follows:
Figure BDA0002087394390000062
step 2.2, each slave node uploads the key value pairs < key, value > stored after the definition to the master node, the master node performs equivalent summation operation on all the key value pairs < key, value > stored after the definition of the master node and all the key value pairs < key, value > uploaded by each slave node, namely, the key value pairs < key, value > perform numerical value accumulation summation on the value values corresponding to the same key value; at this time, the key value in the key value pair < key, value > remains unchanged, and is still the condition attribute i and its value, the condition attribute j and its value, and the decision attribute and its value; the value in the resulting key-value pair < key, value > is the value resulting from the summation.
In this implementation, the results are as follows:
Figure BDA0002087394390000071
step 2.3, the main node redefines the key value pairs < key, value > obtained in the step 2.2, namely, the decision attribute values contained in the key values in the key value pairs < key, value > are removed; at this time, the key value in the key value pair < key, value > is the condition attribute i and its value, and the condition attribute j and its value; the value in the resulting key-value pair < key, value > remains unchanged.
In this implementation, the results are as follows:
Figure BDA0002087394390000072
step 2.4, the key value pair obtained in the step 2.3 is paired by the main node<key,value>Performing equivalence class merging operation, i.e. combining the key-value pairs<key,value>The value values corresponding to the same key value in the key values are merged in an array form; at this point, the resulting key-value pair<key,value>The key value in (1) remains unchanged; the resulting key-value pair<key,value>The value in (1) is an array obtained by merging and returns to the form of<key,[value 1 ,value 2 ,…,value n ]>。
In this implementation, the results are as follows:
Figure BDA0002087394390000073
step 2.5, the master node broadcasts the key value pair < key, value > obtained in the step 2.4 to each slave node; each computer node comprises a main node and a slave node, and according to the obtained key value pair < key, value >, the single-attribute local conditional entropy of each attribute relative to the decision attribute in the data fragment obtained from the computer node and the attribute-to-local conditional entropy of the relative decision attribute between 2 attributes are respectively calculated.
And (3) calculating the conditional entropy of all selected condition attributes j and the current condition attribute i for the decision attribute according to the information entropy of the formula (1) and the conditional entropy of the formula (2). Respectively taking all selected condition attributes j as keys, taking the calculated condition entropy as value, and defining and storing by using mapToPair;
wherein, h (D) represents the information entropy of the decision attribute D, defined as:
Figure BDA0002087394390000081
h (D | C) represents the conditional entropy of the conditional attribute C relative to the decision attribute D, and is defined as;
Figure BDA0002087394390000082
wherein, p (X) i )=|X i |/|U|,p(Y j |X i )=p|X i ∩Y j |/|X i L, n and m represent the value ranges of the condition attribute C and the decision attribute D, | X i I is X i The group (2) of (a).
In the present embodiment, the condition attribute C is calculated from the formulas (1) and (2) 2 ,C 3 And condition attribute C 1 Conditional entropy with respect to decision attributes. Respectively with conditional attribute C 2 ,C 3 As a key, the resulting conditional entropy is calculated as a value. The maptapair was used for definition and storage, and the results were as follows:
<C 2 ;0> <C 3 ;0>
<C 2 ;0.24> <C 3 ;0.14>
<C 3 ;0>
step 2.6, the slave node uploads all the calculated single-attribute local conditional entropies and attribute combined local conditional entropies to the master node at the same time; the main node adds all the single-attribute local conditional entropies of all the attributes to obtain a single-attribute global conditional entropy; meanwhile, the main node adds the local conditional entropies of the attribute combinations to obtain an attribute combination global conditional entropy; finally, the master node is global according to the single attributeCalculating the conditional entropy and the attribute-to-global conditional entropy to obtain heuristic information eta between each 2 attributes ij
Heuristic information η ij The method takes the improved information gain rate as the heuristic information, not only considers the mutual information increment between the added selected attribute of the current attribute and the decision attribute, but also considers the conditional entropy contained by the selected attribute, and the heuristic information calculation is finally equivalent to the conditional entropy calculation. Invention heuristic information eta ij The calculation formula is as follows:
Figure BDA0002087394390000083
where H (D | i) represents the conditional entropy of the condition attribute i with respect to the decision attribute D, H (D | j) represents the conditional entropy of the condition attribute j with respect to the decision attribute D, and H (D | i ≦ j) represents the union of the condition attribute i and the condition attribute j with respect to the conditional entropy representing the decision attribute D.
In this implementation, the results are as follows:
<C 2 ;0.24> <C 3 ;0.14>
the conditional entropy of all selected attributes relative to the current number attribute is thus calculated: h (D | C) 1 ∪C 2 )=0.24,H(D|C 1 ∪C 3 ) 0.14. And further calculates heuristic information using equation (3).
The improved information gain rate is used as heuristic information, and an innovative strategy for carrying out redundancy detection on the selected attribute and each generation of optimal solution enables the algorithm to be rapidly converged to the global optimal solution, so that the possibility of adding the redundant attribute to the reduction set can be effectively avoided, and the redundancy caused by random selection of the initial attribute is eliminated.
Step 3, the master node calculates heuristic information eta between every 2 attributes based on the step 2 ij The ant colony algorithm is used to perform attribute reduction on the condition attributes in the decision information system, see fig. 2.
Pheromone concentration and heuristic information influence the walking path of ants, namely a reduction set of solution. Each ant leaves a certain pheromone on a path in the walking process, namely the solution process, and the pheromone gradually volatilizes and reduces in concentration along with the lapse of time. Therefore, the pheromone concentration of the places where the ants walk through the paths is large, and the ants tend to advance towards the places where the pheromone concentration is large during path selection, so that a positive feedback effect is formed, and the optimal path among the attributes, namely the minimum attribute reduction set is achieved. By performing redundancy detection on each selected attribute and each generation of optimal solution, the minimum reduction set can be converged quickly. And updating the pheromone concentration value on the path of the optimal solution of each generation, and judging whether a termination condition is reached through multi-generation evolution to finally obtain a stable global optimal solution.
Step 3.1, initializing various parameters: let the initial global reduction set, i.e. the minimum reduction set R min Let the length L of the initial global reduction set, i.e. the minimum reduction set min Making initial iteration parameter t equal to 0; the total iteration times maxGeneration is given in advance, the total Ant number Ant of each generation is given in advance, the initial pheromone concentration is given in advance, the initial 0 th generation two attributes i and the inter-j pheromone concentration tau are given in advance ij (0) The initial value is generally set according to experience, for example, 0.5, i and j represent the number of the condition attribute, and the condition attribute i, j belongs to C; i is more than or equal to 1, and | C is more than or equal to j; | C | represents the total number of condition attributes in the condition attribute set.
Step 3.2, performing algebraic iteration with an iteration parameter t + +, wherein each generation has Ant ants, and each Ant performs solving of a minimum reduction set;
and 3.3, the current ant is solved, each generation of initial k is 1, and the initial reduction set solved by each ant is R k ={a k },R k Represents a reduced set, a, of ant k k Representing that one attribute is randomly selected from the conditional attribute set C as an initial attribute, namely the attribute where the ant is currently located, L k 1 indicates the number of attributes included in the initial reduced set.
Step 3.4. according to pheromone concentration tau ij And heuristic information η ij Formed probability formula
Figure BDA0002087394390000093
Selecting the next selected attribute, the probability formula is defined as follows:
Figure BDA0002087394390000091
Figure BDA0002087394390000092
representing the probability of the kth ant in the current attribute i when selecting the next selected attribute j, wherein tau ij (t) the intensity of pheromone concentration from t generation attribute i to attribute j, τ il (t) represents the concentration of pheromones from the attribute i to the attribute l of the tth generation, and j belongs to allowed k ={C-R k },allowed k For the unselected condition attribute set, alpha is more than or equal to 0 and less than or equal to 1 and beta is more than or equal to 0 and less than or equal to 1 represents the concentration of pheromone on the path and the importance of heuristic information, selecting the attribute j with the maximum next selected probability according to a probability formula, and making b k J, go to the next step;
step 3.5, for the kth ant, reducing the set R by calculating k Adding selected attributes b k Whether the mutual information of the post and decision attributes D has changed, i.e. I (R) k (ii) a D) Whether or not it is equal to I (R) k ∪b k (ii) a D) And the redundancy of the selected attribute is judged, so that the redundant attribute can be effectively prevented from being added into the reduction set as the selected attribute. If the mutual information value is not changed, the selected attribute redundancy is indicated, the next step is directly carried out, otherwise b k Added to the reduced set R k Middle and at the same time L k ++;
Step 3.6, when the following conditions are met, stopping searching by the current ant k, entering the next step, otherwise, returning to the step 3.4, and selecting the next selected attribute;
(1) reduced set of ants k R k The number of contained attributes is greater than the current global reduction set R min The number of attributes contained. The first condition for stopping searching indicates that the number of attributes contained in the current reduction set is alreadyWhen the number of the attributes contained in the global reduction set is larger than that of the attributes contained in the global reduction set, the search of the next attribute is not needed to be carried out any more, and the result is not optimal;
(2) current reduction set R of ant k k The mutual information of decision attribute D is equal to the mutual information of conditional attribute set C and decision attribute D, I (C; D) is I (R) k (ii) a D) Wherein R is k A local solution, i.e., a reduction set, constructed for the kth ant. The second condition for stopping searching represents the mutual information of the current reduction set and the decision attribute, and is equivalent to the mutual information of all the condition attributes and the decision attribute. At this time, the minimum reduction set R is required min Updating, i.e. updating the minimal reduction set R min =R k And L min =L k
Step 3.7.k + +, the next Ant continues to search, when k equals to the total number Ant, it indicates that all ants have searched, then step 3.8 is entered, otherwise, step 3.3 is returned;
step 3.8, solving the current generation t to obtain a minimum attribute reduction set R min Wherein each attribute a ∈ R min Performing redundancy detection to avoid redundancy attribute in the reduction set caused by initial random selection, and judging the reduction set R min If the mutual information with the decision attribute is not changed after the attribute a is deleted, the redundant attribute still exists in the current reduction set, and R is deleted min =R min -a,L min -, updating R after all attributes have been checked min And L min Entering step 3.9;
step 3.9 after all ants in the tth generation complete the search, the pheromone concentration between attributes is updated according to the following formula:
τ ij (t+1)=ρτ ij (t)+Δτ ij (t) (5)
the parameter rho (0. ltoreq. rho. ltoreq.1) is a constant and represents the volatility of the pheromone concentration, Δ τ ij (t) represents the increment of pheromone concentration from the attribute i to the attribute j of the tth generation, and the calculation formula is as follows:
Figure BDA0002087394390000101
where q is a given constant parameter, | r (t) | denotes the base of the t-th generation minimum attribute reduction set;
step 3.10, judging whether the iteration time t is equal to the maximum iteration time maxGeneration, if so, entering the next step, otherwise, returning to the step 3.2, and performing the next round of iterative calculation;
step 3.11, outputting the minimum reduction set R min And its length L min
The satisfaction condition I (C; D) of the reduction set is calculated, representing the mutual information of all the conditional attributes relative to the decision attribute, and is defined as follows:
I(C;D)=H(D)-H(D|C) (7)
where H (D) represents the information entropy of the decision attribute D, and H (D | C) represents the conditional entropy of the condition attribute C with respect to the decision attribute D.
The method is based on the rough set theory, relies on an intelligent algorithm to well process the combined optimization force and a Spark parallel distributed processing technology, solves the problem that the traditional attribute reduction algorithm cannot process mass data and has an unsatisfactory effect of solving the minimum reduction attribute, and enriches the application range of rough set theory knowledge reduction.
It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be devised by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims (7)

1. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization is characterized by comprising the following steps:
step 1, uploading a decision information system to a distributed file system (HDFS), wherein the HDFS automatically divides data in the decision information system into a plurality of data fragments which are distributed to each computer node, and each data fragment is independent and not overlapped; one of the computer nodes is a master node, and the rest are slave nodes;
step 2, each computer node comprises a main node and a slave node, the attribute extraction is carried out on the data fragments obtained by the main node and the slave nodes, and the data fragments are stored in a key value pair (key, value) mode; at this time, the key value in the key value pair < key, value > is the value of the condition attribute i and the value thereof, the value of the condition attribute j and the value thereof, and the value of the decision attribute; the value in the resulting key-value pair < key, value > is 1;
step 3, each slave node uploads the stored key value pairs < key, value > to the master node, the master node performs equivalent summation operation on all the key value pairs < key, value > stored by the master node and all the key value pairs < key, value > uploaded by each slave node, namely, the key value pairs < key, value > perform numerical value accumulation on the value values corresponding to the same key value; at this time, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key-value pair < key, value > is the value resulting from the summation;
step 4, the main node redefines the key value pairs < key, value > obtained in the step 3, namely, the key values in the key value pairs < key, value > contain decision attribute values which are removed; at this time, the key value in the key value pair < key, value > is the condition attribute i and its value, and the condition attribute j and its value; the value in the resulting key-value pair < key, value > remains unchanged;
step 5, the main node performs equivalent merging operation on the key value pairs < key, value > obtained in the step 4 again, namely, the value values corresponding to the same key value in the key value pairs < key, value > are merged in an array form; at this point, the key value in the resulting key-value pair < key, value > remains unchanged; the value in the key value pair < key, value > is an array obtained by combination;
step 6, the master node broadcasts the key value pair < key, value > obtained in the step 5 to each slave node; each computer node comprises a main node and a slave node, and single-attribute local conditional entropy of each attribute relative to a decision attribute and attribute-pair local conditional entropy of each relative decision attribute between every two (2) attributes are calculated according to the obtained key value pair < key, value >;
step 7, the slave node uploads all the calculated single-attribute local conditional entropies and the attribute pair local conditional entropies to the master node at the same time; the main node adds all the single-attribute local conditional entropies of all the attributes to obtain a single-attribute global conditional entropy; meanwhile, the main node adds the local conditional entropies of the attribute pairs to obtain the global conditional entropies of the attribute pairs; finally, the main node calculates the global conditional entropy according to the single-attribute global conditional entropy and the attribute pair global conditional entropy to obtain heuristic information among every 2 attributes;
step 8, the master node reduces the conditional attribute set in the decision information system by using an ant colony algorithm based on the heuristic information between every 2 attributes calculated in the step 7;
the above i, j belongs to C, C represents a condition attribute set, and j ≠ i.
2. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization according to claim 1, wherein,
in step 3, the key values are the same, namely the condition attribute i, the value of the condition attribute j and the value of the decision attribute are all the same;
in step 5, the key values are identical, which means that the condition attribute i, the value of the condition attribute j, and the value of the condition attribute j are all identical.
3. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 1, wherein the heuristic information η ij Comprises the following steps:
Figure FDA0002087394380000021
where H (D | i) represents the conditional entropy of the condition attribute i with respect to the decision attribute D, H (D | j) represents the conditional entropy of the condition attribute j with respect to the decision attribute D, and H (D | i ≦ j) represents the union of the condition attribute i and the condition attribute j with respect to the conditional entropy representing the decision attribute D.
4. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 1, wherein the specific process of reducing the conditional attribute set in the decision information system by using the ant colony algorithm is as follows:
step 1) let initial global reduction set R min C, the initial iteration parameter t is 0; given an initial pheromone concentration τ ij (0) Giving total iteration times maxGeneration, wherein i is more than or equal to 1 and less than or equal to | C |, and j is more than or equal to 1 and less than or equal to | C |; wherein C represents a condition attribute set, and | C | represents the number of attributes in the condition attribute set;
step 2) adding 1 to the iteration parameter t to represent the t-th iteration;
step 3), the initial Ant k is 1, which represents the kth Ant, and Ant ants in each generation are solved by the minimum reduction set;
step 4) for the current ant k, randomly selecting 1 condition attribute from the condition attribute set C as the currently selected condition attribute, and putting the currently selected condition attribute into the local reduction set R of the current ant k k Performing the following steps;
step 5) utilizing the current pheromone concentration tau ij (t) and heuristic information η ij Calculating the probability between the currently selected condition attribute of the current ant k and other unselected condition attributes in the condition attribute set C, and selecting other condition attributes corresponding to the maximum probability as the condition attributes to be selected of the current ant k;
step 6) judging the local reduction set R of the current ant k k After adding the condition attribute to be selected, whether the mutual information with the decision attribute D changes or not is judged:
if the mutual information value changes, the condition attribute to be selected is taken as the currently selected condition attribute and is put into the local reduction set R of the current ant k k And go to step 7);
if the mutual information value is not changed, the condition attribute redundancy to be selected is represented, and the step 7) is directly carried out;
step 7) judging a reduction set R of the current ant k k Whether the number of contained attributes is more than or equal to the current global reduction set R min Number of attributes contained:
if so, the current global reduction set R min The change is not changed;
otherwise, further judging the current reduction set R of the current ant k k Whether the mutual information with the decision attribute D is equal to the mutual information of the condition attribute set C and the decision attribute D:
if so, then the current reduction set R min To update R k
Otherwise, the current global reduction set R min The change is not changed;
step 8) adding 1 to Ant k, and judging whether k is equal to the total number Ant: when k equals Ant, go directly to step 9); otherwise, returning to the step 4);
step 9) for the current global reduction set R min Each attribute of the global reduction set R is subjected to redundancy detection, namely, the global reduction set R is judged min After deleting the attribute, whether the mutual information with the decision attribute changes or not, when the mutual information value does not change, the attribute redundancy is indicated, and the attribute is reduced from a global reduction set R min Deleting; otherwise, indicating that the attribute is not redundant, and continuously keeping the attribute in the global reduction set R min Performing the following steps;
step 10) calculating the pheromone concentration tau of the next generation ij (t+1);
Step 11) judging whether t is equal to the total iteration times maxgernation: when t is maxGenerntion, outputting a current global reduction set; otherwise, returning to the step 3).
5. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 4, wherein in step 5), in the t-th generation, the probability between the currently selected conditional attribute i of ant k and other conditional attributes j in the conditional attribute set C
Figure FDA0002087394380000031
Comprises the following steps:
Figure FDA0002087394380000032
wherein, tau ij (t) is the pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j, η ij Allowed as heuristic information between conditional attributes i and j k The condition attribute set C is a set formed by unselected condition attributes in the condition attribute set C, alpha is the importance of given pheromone concentration, alpha is more than or equal to 0 and less than or equal to 1, beta is the importance of given heuristic information, and beta is more than or equal to 0 and less than or equal to 1.
6. The multi-objective parallel attribute reduction method based on Spark and ant colony optimization as claimed in claim 4, wherein in step 10), the pheromone concentration τ of t +1 th generation ij (t +1) is:
τ ij (t+1)=ρτ ij (t)+Δτ ij (t)
where ρ is the volatilization rate of a given pheromone concentration, τ ij (t) is the pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j, Δ τ ij (t) is the increase in pheromone concentration between the t-th generation conditional attribute i and the conditional attribute j.
7. The method for multi-objective parallel attribute reduction based on Spark and ant colony optimization as claimed in claim 6, wherein the pheromone concentration increment between the t-th generation condition attribute i and the condition attribute j is Δ τ ij (t) is:
Figure FDA0002087394380000041
wherein, | R min And | is the number of attributes in the global reduction set, and q is a given constant.
CN201910492176.6A 2019-06-06 2019-06-06 Multi-objective parallel attribute reduction method based on Spark and ant colony optimization Active CN110222023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910492176.6A CN110222023B (en) 2019-06-06 2019-06-06 Multi-objective parallel attribute reduction method based on Spark and ant colony optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910492176.6A CN110222023B (en) 2019-06-06 2019-06-06 Multi-objective parallel attribute reduction method based on Spark and ant colony optimization

Publications (2)

Publication Number Publication Date
CN110222023A CN110222023A (en) 2019-09-10
CN110222023B true CN110222023B (en) 2022-09-16

Family

ID=67815949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910492176.6A Active CN110222023B (en) 2019-06-06 2019-06-06 Multi-objective parallel attribute reduction method based on Spark and ant colony optimization

Country Status (1)

Country Link
CN (1) CN110222023B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816270B (en) * 2020-06-18 2022-12-09 南通大学 Attribute parallel reduction Spark method for large-scale liver electronic medical record lesion classification
CN111856954B (en) * 2020-07-20 2022-08-02 桂林电子科技大学 Smart home data completion method based on combination of rough set theory and rules

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163300A (en) * 2011-04-20 2011-08-24 南京航空航天大学 Method for optimizing fault diagnosis rules based on ant colony optimization algorithm
CN102184449A (en) * 2011-04-15 2011-09-14 厦门理工学院 Intelligent decision making system reduction method based on ant colony
CN106650936A (en) * 2016-11-25 2017-05-10 天津津航计算技术研究所 Rough set attribute reduction method
AU2016281776A1 (en) * 2015-06-24 2018-02-08 Oxford BioDynamics PLC Detection processes using sites of chromosome interaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184449A (en) * 2011-04-15 2011-09-14 厦门理工学院 Intelligent decision making system reduction method based on ant colony
CN102163300A (en) * 2011-04-20 2011-08-24 南京航空航天大学 Method for optimizing fault diagnosis rules based on ant colony optimization algorithm
AU2016281776A1 (en) * 2015-06-24 2018-02-08 Oxford BioDynamics PLC Detection processes using sites of chromosome interaction
CN106650936A (en) * 2016-11-25 2017-05-10 天津津航计算技术研究所 Rough set attribute reduction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"A New Approach of Attribute Reduction Based on ant Colony Optimization";Huanglin Zeng等;《2009 fifth international conference on natural computation》;20090816;全文 *
"基于等价划分与蚁群优化的并行属性约简算法";王慧等;《北京邮电大学学报》;20111228;全文 *

Also Published As

Publication number Publication date
CN110222023A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
CN111782965B (en) Intention recommendation method, device, equipment and storage medium
Zhu et al. Joint optimization of tree-based index and deep model for recommender systems
CN111176807B (en) Multi-star collaborative task planning method
Ozbay et al. A novel approach for detection of fake news on social media using metaheuristic optimization algorithms
Nishimura et al. Restreaming graph partitioning: simple versatile algorithms for advanced balancing
Potamias et al. K-nearest neighbors in uncertain graphs
Brezina Jr et al. Solving the travelling salesman problem using the ant colony optimization
CN108804619A (en) Interest preference prediction technique, device, computer equipment and storage medium
Guendouz et al. A discrete modified fireworks algorithm for community detection in complex networks
CN111008521B (en) Method, device and computer storage medium for generating wide table
CN102945333B (en) Key protein predicating method based on prior knowledge and network topology characteristics
CN110222023B (en) Multi-objective parallel attribute reduction method based on Spark and ant colony optimization
US20110131208A1 (en) Systems and methods for large-scale link analysis
Khemmarat et al. Fast top-k path-based relevance query on massive graphs
Faerman et al. Lasagne: Locality and structure aware graph node embedding
CN105808754A (en) Method for rapidly discovering accumulation mode from movement trajectory data
CN113642730A (en) Convolutional network pruning method and device and electronic equipment
CN108038001B (en) Junk file cleaning strategy generation method and device and server
Mousavian et al. Solving minimum vertex cover problem using learning automata
CN103646035B (en) A kind of information search method based on heuristic
Nitash et al. An artificial bee colony algorithm for minimum weight dominating set
Cai et al. A recommendation-based parameter tuning approach for Hadoop
Jabbar Controlling the Balance of Exploration and Exploitation in ACO Algorithm
CN106709597B (en) Method and device for parallel optimization processing of TSP problem based on artificial bee colony algorithm
Wang et al. Reachability-driven influence maximization in time-dependent road-social networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant