CN106598743A

CN106598743A - Attribute reduction method for information system based on MPI parallel solving

Info

Publication number: CN106598743A
Application number: CN201611259383.XA
Authority: CN
Inventors: 胡峰; 胡宗容; 于洪; 欧阳卫华; 邓维斌; 张清华
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-04-26
Anticipated expiration: 2036-12-30
Also published as: CN106598743B

Abstract

The invention discloses an attribute reduction method for an information system based on MPI parallel solving. The attribute reduction method comprises the following steps: firstly, reading the data of the information system, pretreating numerical values, and carrying out data discretization; then, horizontally dividing the information system into p sample data subsets, assigning to n nodes through communication, calculating the equivalence class of the data subsets in parallel, and integrating the result of each node to obtain m equivalence class division information subsystems of the whole information system; then, assigning the m information subsystems to the n nodes, carrying out parallel attribute core computing until all the information subsystems are processed, then merging the results of the nodes to obtain the attribute core of the whole information system; and finally, carrying out parallel solving attribute reduction, and merging the attribute reduction results of the nodes to obtain the attribute reduction of the whole information system. A rough set attribute reduction method is combined with MPI parallel computing, so that parallel solving can be carried out by using differentiation matrix solving attribute reduction calculation, and the algorithmic efficiency is improved.

Description

A kind of method that information system attribute reduction is asked parallel based on MPI

Technical field

The invention belongs to data mining, rough set, parallel computation field, and in particular to one kind is based on MPI using resolution square The method that battle array obtains parallel attribute reduction,.

Background technology

As in recent years data explosion formula increases, concurrent technique seems more and more important, and the main purpose of parallel computation is The process time of large complicated problem or mass data is saved, the computer resource for integrating " cheap " sets up parallel computing platform gram Take the restriction that unit calculates performance bottleneck and unit memory space.

Parallel computation refer on parallel computer or parallel computing platform a mass computing task is split as it is multiple Subtask, is assigned to each processor, and mutually collaboration completes subtask between each processor, and so as to reach solution efficiency or complete is improved Into the purpose of extensive task.It is with the key of parallel computation optimal solution that pending problem has concurrency.Parallel Calculating is divided into time parallel and spatial parallelism, and time parallel actually refers to pipelining, and spatial parallelism is then multiple places Reason device simultaneously participates in calculating, is mainly studying a question for parallel computation.Parallel computation can be divided into data parallel and task simultaneously again OK, allow multiple processors to participate in calculating, improve efficiency and performance.

Message passing interface (Message Passing Interface, abbreviation MPI) since the nineties in 20th century always The fact that be high-performance computing sector parallel program development standard, current major part high-performance calculation platform is provided which that MPI is parallel Environment.MPI is currently the most important ones multiple programming instrument, and it has transplantability good, powerful, various advantages such as efficiency height, And having the version of realizing of various different free highly effectives, almost all of parallel computer manufacturer all provides props up it Hold, this is that other all of Parallel Programming Environments are all incomparable.

MPI was produced in 1994, although generation time is relatively later, because it absorbs the excellent of other various parallel environments Point, while the features such as taking into account performance function transplantability, just rapid popularization becomes message transmission multiple programming in short several years The standard of pattern.This also illustrates the vitality and superiority of MPI from one side, and MPI is exactly in fact a storehouse, has up to a hundred These functions directly can be called by individual function call interface in C language, although calling for MPI offers is a lot, most Only 6 for often using, only just need to can complete almost all of communication function by using this 6 functions.

The characteristic of MPI：(1) easily use, it is portable good.Almost all of parallel computer all supports MPI frameworks, appoints What supports that the parallel computer of interprocess communication all supports the programming of MPI.(2) there is perfect asynchronous mechanism.Each Concurrent process has oneself independent memory headroom, and ensure that is carried out between process in the case where other parallel processes of getting along well are clashed Communication, solves the problems, such as data syn-chronization, realizes real asynchronous communication.(3) explicit data exchange.User must be by aobvious Formula sends and receives message to realize the message between concurrent process and data exchange.(4) parallel granularity is big.Message-Passing Model Programming need task resolution well, adapt to compute-intensive applications, be to reduce communication to consume, it is adaptable to parallel computation The big extensive scalable parallel algorithm of granularity.

The attribute of information system is not only diversified in actual life, and dimension is high, and comprising noise, redundancy and uncorrelated category Property, in order to solve data complexity of the calculation and accuracy problem, abates the noise etc. what calculating process and final result were caused Affect, reduce the calculating time of rule extraction algorithm, so as to see the distribution situation of response data substitutive characteristics clearly, attribute reduction must It is indispensable.In recent years, rough set theory becomes the effective mathematical tool for processing uncertain information.

Rough set：The theory is taught by Polish scholar Pawlak and is proposed in nineteen eighty-two, be it is a kind of can effective process it is inaccurate, The mathematical theory of uncertain and fuzzy message.At present, rough set has been successfully applied to machine learning, data mining, intelligent data The field such as analysis and control algolithm acquisition.The main thought of rough set theory be using known knowledge base, will inaccurately or not It is determined that knowledge portrayed come (approximate) with the knowledge in known knowledge storehouse.Rough set can be independent of priori, according to data Decision-making and distribution carry out Knowledge Discovery.

Attribute reduction：Attribute reduction is one of important research contents of rough set theory, is mainly should for rough set theory With direction, the also always study hotspot of rough set theory.Attribute reduction is the process for full dataset, is in holding information On the basis of system or the original classification capacity of information system, redundancy or incoherent attribute are deleted, generally also regarded as to data The dimensionality reduction of collection.Common attribute reduction method has discrimination matrix reduction method, positive region reduction method and comentropy reduction method. Discrimination matrix method has readily understood and realizes convenient advantage, obtains many scholar's favors.

Discrimination matrix：Old attribute reduction algorithms based on discrimination matrix are directed to each given information system, all accordingly A corresponding discrimination matrix is given, each element community set different from other elements is found out, is represented in information system and is had The knowledge of body.The advantage of this method is that the information possessed in knowledge-representation system is represented by discrimination matrix visualization, Can be with the open-and-shut differentiation attribute found out between each object by discrimination matrix.

The content of the invention

Attribute variation, high-dimensional in the information system that the present invention exists for prior art, comprising noise, redundancy, number According to the defect such as amount is big, it is proposed that a kind of utilization discrimination matrix tries to achieve parallel the method for attribute reduction to solve data meter based on MPI The complexity and accuracy problem of calculation, improve calculate performance and computational efficiency information system attribute reduction is asked parallel based on MPI Method.Technical scheme is as follows：

A kind of method for seeking information system attribute reduction parallel based on MPI, it is comprised the following steps：

Step 1), in data preprocessing phase, read the data of information system, numerical value is carried out into pretreatment, i.e. discretization Process, according to the feature of data, can using simple directly wide interval or wait frequency interval method, based on Importance of Attributes from Dispersion method, the discretization method based on cluster so that continuous data discretization；

Step 2), information system is equably divided horizontally in units of sample p sample data subset, and by p To n node, each node, then will be each according to the equivalence class of conditional attribute parallel computation data subset for sample data subset allocation The result of individual node is integrated, so as to obtain m equivalence class partition of whole information system, each equivalence class one sub-information of correspondence System；

Step 3), again m sub- information systeies are distributed to into n node, each node is to assigned sub-information system Parallel computation attribute nucleus, until having processed all sub-information systems, then the result of each node are merged, and draw prime information system Attribute nucleus；

Step 4), it is last, then the attribute nucleus of prime information system be sent to into each nodal parallel obtain attribute reduction, then will The attribute reduction result of each node merges integration, obtains the attribute reduction result of whole information system.

Further, the step 1) data preprocessing phase first reads in information system and specifically includes：Described information system Namely decision table is that (U, A, V, f), wherein U represents the set of all objects in field question to four-tuple IS=, referred to as discusses Domain；A=C ∪ D are community sets, and subset C and D represent respectively conditional attribute collection and decision kind set；Va is attribute The codomain of a；f:U × A → V is an information function, gives a value of information to each attribute of an object, i.e.,X ∈ U, there is f (x, a) ∈ Va.

Further, when the data to information system carry out continuous data discretization, according to the feature of data, can be with Using it is simple directly it is wide it is interval, etc. frequency interval method, based on Importance of Attributes, based on discretization methods such as clusters.

Further, the step 2) information system equivalence class partition be using equivalence relation to domain according to condition belong to Property classified, the conditional attribute collection of data set, form be { conditional attribute 1, conditional attribute 2 ... ... conditional attribute p }, this etc. Valency apoplexy due to endogenous wind contains consistent object and inconsistent object, if conditional attribute is consistent with decision attribute, for consistent object, if condition Attribute is consistent, and decision attribute is inconsistent, then be inconsistent object.

Further, the step 3) m sub- information system is distributed to into n node, adopt principal and subordinate's mould during distribution task Formula, selects a node to be host node, and remaining node is that, from node, host node is responsible for for task being allocated to each from node, The task action result from node is received, task distribution adopts dynamically distributes mode, takes and be randomly assigned or order-assigned mode, The fast node distribution task of process is more, sub-information system is distributed to each time the node of free time, until all sub-information systems Have been processed.

Further, the step 3) parallel computation attribute nucleus are that sub- decision-making discrimination matrix is created on each node, if Decision attribute D, i.e. sub-information system are included in sub-information system decision-making discrimination matrix for inconsistent object set, the sub-information system Attribute nucleus beOtherwise, single attribute that decision-making is determined in sub-information system is found out, the union of each single attribute of condition is met, As attribute nucleus of the sub-information system.

Further, it is described seek sub-information system parallel attribute reduction be in sub- decision-making discrimination matrix, will be comprising core The value of the element of attribute is revised as empty set, so as to obtain a new matrix, resettles and extract accordingly logical expression, will be all Logical expression of extracting carry out conjunction computing, obtain conjunctive normal form, then conjunctive normal form is converted to into disjunctive normal form form, finally Each conjunct all core attributes being added in disjunctive normal form, obtains the result of sub-information system property yojan.

Further, it is described to create being specially for sub- decision-making discrimination matrix：To each division, each is found out in the division The element attribute different from other elements.

Advantages of the present invention and have the beneficial effect that：

The present invention provides a kind of method for seeking information system attribute reduction parallel based on MPI.The problem to be solved includes：Letter Attribute has variation in breath system, high-dimensional, comprising noise, redundancy, and the defect such as data volume is big, traditional attribute reduction method Limited by calculating the time, it is impossible to fast and effeciently attribute reduction is carried out to Large Information Systems.By by MPI concurrent techniques The method for being applied to seek attribute reduction using discrimination matrix, can solve the problem that data complexity of the calculation and accuracy problem, improve Calculate performance and computational efficiency.The method can process the not treatable large-scale dataset of serial algorithm, and increase substantially The time of attribute reduction is obtained, the time that conventional serial old attribute reduction algorithms occur when to Large Information Systems yojan is solved The problems such as long, internal memory overflows and delays machine.

Description of the drawings

Fig. 1 is that present invention offer preferred embodiment seeks parallel information system attribute reduction method FB(flow block) based on MPI；

Fig. 2 master slave modes design a model；

Fig. 3 ad-hoc modes design a model.

Fig. 4 node tasks distribution diagrams；

Fig. 5 computation attribute core flow charts；

Fig. 6 seeks attribute reduction flow chart by attribute nucleus；

Fig. 7 MPI communication pattern flow charts.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only a part of embodiment of the present invention.

The present invention solves the technical scheme of above-mentioned technical problem,

A kind of method that attribute reduction is obtained parallel using discrimination matrix based on MPI is proposed, is comprised the following steps：

First, in data preprocessing phase, the data of information system are read, data is carried out into sliding-model control.Secondly, exist Data divide the stage, and the data level of information system is divided into sample data subset and the different nodes of MPI clusters are distributed to, So as to realize parallel computation equivalence class, then collect the result of calculation of each node, obtain the equivalence class of information system, equivalence class is made For the foundation of division information system, equivalence class one sub- information system of correspondence.Then, in the parallel computation attribute nucleus stage, Distribute task using master slave mode, sub-information system is distributed to into different nodes and realizes the parallel computation of attribute nucleus, then pass through Communication amalgamation result, obtains the attribute nucleus of information system.Finally, distributed according to task on last stage, attribute reduction sought parallel, The result of each node is merged, the attribute reduction of whole information system is obtained.

Specifically, information system is first read in data preprocessing phase.(U, A, V f) are a letter to one four-tuple IS= Breath system (is also decision table), and wherein U represents the set of all objects in field question, referred to as domain；A=C ∪ D are property sets Close, subset C and D represent respectively conditional attribute collection and decision kind set；

Va is the codomain of attribute a；f:U × A → V is an information function, and to each attribute of an object one is given The individual value of information, i.e.,X ∈ U, there is f (x, a) ∈ Va.

Then discretization is carried out to data.The scientific and rational transformation of continuous variable is referred to as to meet real data distribution spy The discrete magnitude levied.

The stage is divided in data, first information system p data subset is divided horizontally into into, n node, data are distributed to The number of subset is appropriate, and communication overhead can be increased too much, again parallel granularity can be caused little very little so that node processing time phase Difference is too big, and total time overhead increases.

Reciprocity program can be classified as when carrying out parallel Programming based on MPI according to the mutual relation between each node to set Meter model and principal and subordinate's programming model.When carrying out programming using reciprocity programming model, each node is mutually cooperateed with Task is completed jointly, is not interdepended between node.During using principal and subordinate's programming model, it is divided into host node and from node.Main section Point is responsible for distribution calculating task, coordinates from node progress and collects result of calculation.Corresponding task is received from node and calculated, assist It is same to complete task.In actual pairing programming design, often both of which is applied in combination, improves parallel efficiency calculation.

Then different nodal parallels calculate the equivalence class of data subset.Information system IS=(U, A, V, f) in, each Attribute setDetermine the relation that can not offer an explanation (i.e. equivalence relation) IND (P):

Relations I ND (B),A division of U is constituted, is represented with U/IND (P), and the friendship of all of equivalence relation Collection is also a kind of equivalence relation, i.e. [x]_IND(P)=∩ [x]_P, any element therein

Referred to as equivalence class.

P derived divisions on U are designated as U/P.With The lower aprons collection and upper approximate set of X are referred to as, claim POS_P(X)=P_X is the positive region of the P of X, claimsFor P with regard to D positive region.In information system IS, if Have f (x, C)=f (y, C) ∧ f (x, D) ≠ f (y, D), then IS is called Inconsistent information, and x and y is referred to as inconsistent object.Otherwise claim IS For consistent information system.

The result of calculation of different nodes is collected again, is merged same equivalence class, so as to obtain whole information system Equivalence class partition, each equivalence class one sub- information system of correspondence, if equivalence class has m, accordingly, information system drawn It is divided into m sub- information system.

In the parallel computation attribute nucleus stage, task is distributed first.M sub- information system is distributed to into n node, distribution is appointed Master slave mode is adopted during business, selects a node to be host node, remaining node is that, from node, host node is responsible for dividing task To each from node, the task action result from node is received.Task distribution adopts dynamically distributes mode, due to sub-information system Order do not affect result of calculation, therefore can take and be randomly assigned or order-assigned mode, it is more to process fast node distribution task, often Sub-information system is once distributed to the node of free time, until all sub-information systems have been processed.

Then each nodal parallel computation attribute core.First decision-making is built to each sub-information system in parallel by different nodes Discrimination matrix, (U, A, V, decision-making discrimination matrix f) is defined as DM={ m to information system IS=_ij, wherein m_ijMeet：

In information system IS, U/C={ U are made₁, U₂..., U_m, information system can be divided horizontally into m sub- information system, Then sub-information system representation is IS_k=(U_k, A, V, f) (1≤k≤m).Sub-information system IS_k=(U_k, A, V, sub- decision-making f) point Distinguish that matrix is defined asWhereinMeet：

Then

Then the attribute nucleus of each sub-information system are calculated.Sub-information system IS_k=(U_k, A, V, core attributes f) It is defined as DCORE_k(C), then meet：

(U, A, V, the attribute nucleus of decision-making discrimination matrix f) are defined as information system IS=

Communicated between final node, attribute nucleus are merged.The union operation of attribute nucleus is right i.e. according to the definition of attribute nucleus The attribute nucleus that each node is calculated take union operation.

The attribute reduction stage is being asked parallel, and being first according to the computation attribute core stage of the task is divided, each node processing phase The sub-information system answered, antithetical phrase information system seeks attribute reduction.

(U, A, V, f), A=C ∪ D are community sets to given information system IS=, and subset C and D represent that respectively condition belongs to Property collection and decision kind set, ifIf have g (P, D)=g (C, D) andG (B, D) ≠ γ (C, D), then P is called IS A yojan.

According to the definition of attribute reduction, attribute nucleus are broadcast to into each from node by collective communication mode by host node, In sub- decision-making discrimination matrix, the value of the element comprising attribute nucleus is revised asSo as to obtain a new matrix, to decision-making point Distinguish all values in matrix for nonempty set elementSet up the logical expression L that extracts accordingly_ij,

By all of logical expression L that extracts_ijConjunction computing is carried out, a conjunctive normal form is obtained, i.e.,

Again conjunctive normal form is converted to into disjunctive normal form form, is obtained

Then each conjunct all properties core being added in disjunctive normal form, obtains the result of attribute reduction.

Finally communication merges attribute reduction result.Respectively result is submitted to into host node from node, by host node by it is all about Simple result carries out conjunction computing, then turns to disjunctive normal form, and as final result obtains the attribute reduction of whole information system.

MPI communication mechanisms：MPI communications refer to that program carries out a kind of row of message and data exchange between concurrent process For.Communication mode can be divided into two classes by the difference according to message transmission target：Point-to-point communication and collective communication.

MPI provides the point-to-point communication function of two big types.First type is referred to as obstructive type (blocking), second Type is referred to as non-obstructive type (non blocking).Obstruction type function needs to wait actually accomplishing for assigned operation, or at least institute The data being related to just are returned after being securely backed up by MPI systems.Non- obstructive type function call is always returned immediately, and is actually grasped Work is then carried out by MPI systems on backstage.For point-to-point message sends, MPI provides four kinds of sending modes, mode standard, buffering Pattern, synchronous mode and ready mode.

MPI standard communication pattern is all the counting in units of transmission/receiving data type, and receiving buffer pool size can not Less than the data to be received, mistake can be otherwise pointed out.If the data volume for arriving is less than buffer pool size, relief area is only received Changed by received data in the region of the actual received quantity length for starting anew.The data for receiving unknown lengths are such as needed, Then utilize MPI_Probe.Receiving process may specify generic reception envelope i.e. MPI_ANY_SOURCE, MPI_ANY_TAG, receives and From any flag information of any originating process.As can be seen here, send and recv operations are asymmetrical, i.e., sender must be given Specific destination address, and recipient then can be from any source receive information.Source, purpose process can also be specified to be same process, But to block communication due to easily causing deadlock.

Mode standard is determined it is first immediately after to return message copying to a relief area (now to disappear by MPI systems The transmission of breath is carried out by MPI systems on backstage), also it is to wait for being returned again to after data is activation is gone out.Most of MPI systems are reserved A certain size relief area, by message copying to relief area and then can stand when the message-length for sending is less than buffer size Return, otherwise then just return after part or all of message is sent completely.It is non-local that mode standard sends operation, because Its needs that complete are got in touch with recipient.It is MPI_Send that mode standard obstructive type sends function.

MPI collective communications can allow message and data only at this using the node for oneself creating as a communication subset Transmitted in communication subset, different from point-to-point communication, collective communication is all obstruction, it is therefore desirable to all parallel in set Process is carried out, and next operation can be just done after having performed, and can otherwise be absorbed in unlimited wait, relative to point-to-point communication, set Communication can more play parallel efficiency calculation.

The direction of collective communication can be divided into one-to-many communication, many-one communication and many-many communication Three models.It is synchronous Function is used for the implementation progress of Coordinator, equivalent to a synchronous point is provided with, until all processes are carried out the synchronization Operation could be continued after point.Computing function refers to that data of the process to receiving are processed.

Collective communication is based on point-to-point communication, but it is not to its simple encapsulation, but according to collective communication Own characteristic carries out specific aim optimization.Collective communication greatly alleviates the burden of programmer, not only makes concurrent program terseization, and And improve the performance and efficiency of program.For example, if user wishes to transmit the message to all processes in communication domain, can be straight Connect and call MPI_Bcast collective communication functions, only one sentence just can be completed.

Fig. 1 is FB(flow block) of the present invention.Comprise the steps：

(1) data preprocessing phase.

This stage mainly reads in information system and processes the Data Discretization of information system, and step is as follows：

From UCI experimental data platform (network address：http://archive.ics.uci.edu/ml/) downloading data collection, data The form of collection is { conditional attribute 1, conditional attribute 2 ..., conditional attribute n, decision attribute }, and wherein conditional attribute collection is { condition Attribute 1, conditional attribute 2 ..., conditional attribute n }, decision kind set is { decision attribute 1, decision attribute 2 ... ... decision attribute p}。

(U, A, V, are f) information system (being also decision table) to one four-tuple IS=, and wherein U is represented in field question The set of all objects, referred to as domain；A=C ∪ D are community sets, and subset C and D represent respectively conditional attribute collection and decision-making category Property collection.

The data of information system are read in by host node, according to the distribution of data, continuous data discretization is converted into meeting The discrete magnitude being actually needed.

(2) data divide the stage.

This stage is divided into task distribution, parallel computation equivalence class and merges three steps of equivalence class：

1. task distribution.

According to the definition of information system, information system is divided horizontally into into p data subset.By this p data subset point The different node of dispensing, so as to next step calculating.Task distribute when adopt master slave mode, by host node distribution data subset give from Node.

2. parallel computation equivalence class.

According to the definition of equivalence class, each nodal parallel calculates the equivalence class in data subset, using equity when calculating Pattern, i.e. host node also serve as a calculate node and are calculated.

Information system IS=(U, A, V, f) in, each attribute setDetermine the relation that can not offer an explanation (i.e. etc. Valency relation) IND (P):

Referred to as equivalence class.

Cite a plain example below and illustrate how to divide equivalence class.

The influenza data set example of table 1

Individuality numbering	Headache	Myalgia	Body temperature	Influenza
					e₁	It is	It is	Normally	It is no
e₂	It is	It is	It is high	It is
					e₃	It is	It is	It is very high	It is
e₄	It is no	It is	Normally	It is no
					e₅	It is no	It is no	It is high	It is no
e₆	It is no	It is	It is very high	It is

Classify according to myalgia：

U/ myalgias={ { e₁, e₂, e₃, e₄, e₆, { e₅}}。

Carry out common category according to headache and influenza the two attributes：

U/ headaches and influenza={ { e₁, { e₂, e₃, { e₄, e₅, { e₆}}。

We can be classified with different standards to domain, obtain different concept and abstract.

3. equivalence class is merged.

Using master slave mode, collected respectively from the result of calculation of node by host node, identical equivalence class is collected, obtain whole The equivalence class partition result of individual information system, this result as division information system foundation, i.e., one equivalence class correspondence one Sub-information system, if there is m equivalence class, that is, has m sub- information system, realizes the division of data.

(3) the attribute nucleus stage is asked parallel.

This stage is divided into task distribution, parallel computation sub-information system property core and merging three steps of attribute nucleus.

1. task distribution.

MPI Parallel programming models are combined using master slave mode and ad-hoc mode, such as Fig. 2, Fig. 3, are cooperateed between node Task is completed jointly so that parallel efficiency calculation is improved.

Because the order of sub-information system does not affect result of calculation, thus can using order-assigned or to be randomly assigned mode, Which node processing it is fast, it is possible to overabsorption sub-information system, each time by sub-information system distribute to free time node, And the result of each node is in no particular order sequentially.Node tasks distribution such as Fig. 4.

Such as, wherein three sub- information systeies are first given three nodes, node by 3 node processings, 7 sub- information systeies 1, node 2, node 3.If node 2 has first been processed, also four sub- information systeies are untreated, then continue to give the distribution task of node 2, If node 3 distributes task, by that analogy, until all sub-information systems have been processed in the free time to node 3.

2. parallel computation attribute nucleus.

Each node is to assigned sub-information system in parallel computation attribute core.For each sub-information system, First build sub- decision-making discrimination matrix.

According to the definition of decision-making discrimination matrix, each node builds decision-making and differentiates to different sub-information system in parallel respectively Matrix.Sub-information system IS_k=(U_k, A, V, sub- decision-making discrimination matrix f) is defined asWhereinMeet：

The attribute nucleus of sub-information system are calculated again.

According to the definition of attribute nucleus, each node seeks attribute nucleus to each sub-information system in parallel respectively, and flow chart is such as Fig. 5.(Uk, A, V, core attributes f) are defined as DCOREk (C) to sub-information system ISk=, then meet：

3. attribute nucleus are merged.

According to the relation between information system and sub-information system property core, communicated between node, result is closed And, will the attribute nucleus that calculate of each node seek union.

(4) the attribute reduction stage is asked parallel.

This stage is divided into and ask parallel the yojan of sub-information system property and merging two steps of attribute reduction.

1. sub-information system property yojan is asked parallel.

According to the definition of attribute reduction, attribute nucleus are broadcast to each from node by host node by collective communication mode, In sub- decision-making discrimination matrix, the value of the element comprising attribute nucleus is revised asSo as to obtain a new matrix, decision-making is differentiated All values are the element of nonempty set in matrixSet up the logical expression L that extracts accordingly_ij,

Then each conjunct all properties core being added in disjunctive normal form, obtains the attribute of all sub-information systems Yojan result.

2. attribute reduction, flow chart such as Fig. 6 are merged.

Respectively communicated with host node from node, merged attribute reduction result.Process communication mode with seek attribute nucleus parallel Stage is consistent, using mode standard.So-called merging attribute reduction result, will all yojan results carry out conjunction computing, then change For disjunctive normal form, final result, i.e., the attribute reduction of whole information system are obtained.

Node communication is using normal communications mode and collective communication pattern.Normal communications mode belongs to blocking communication, message The send of sender calls the cooperation that the recv for needing recipient is called just can complete, flow chart such as Fig. 7.The resistance of mode standard The message that plug communication itself is decided whether to to be sent by MPI environment enters row buffering, if MPI has buffered the data for sending, Even if receiving terminal not yet starts reception action, sending action also can be returned immediately.Consider in performance and resource optimization, MPI rings Border can provide a number of relief area, then need obstruction to finish after data just until there is corresponding reception operation to collect more than after Can return to.That is in blocking communication, whether transmitting terminal completes the state for depending not only on local process, to be also subject to distal end The state left-right of receiving process.

Collective communication pattern can allow message and data only at this using the node for oneself creating as a communication subset Transmitted in communication subset, different from point-to-point communication, collective communication is all obstruction, it is therefore desirable to all parallel in set Process is carried out, and next operation can be just done after having performed, and can otherwise be absorbed in unlimited wait, relative to point-to-point communication, set Communication can more play parallel efficiency calculation.

The enforcement to the present invention of below illustrating is described further.This example enters under premised on technical solution of the present invention Row is implemented, and gives detailed embodiment and specific operating process, but protection scope of the present invention is not limited to following realities Example.

Information system IS of table 2

Step (1) data prediction.The data of information system are read in, data are carried out into sliding-model control, e.g., condition is belonged to Property a carry out more than 0.5 and the differentiation less than 0.5, will be greater than 0.5 discrete and turn to 1, discrete less than 0.5 turns to 0, then each number 1,0,1,0,1,1,0 are respectively according to a property values of object.In the same manner, sliding-model control is carried out to conditional attribute c, uses 0,1 to represent, 0 0 number, i.e. negative are represented less than, 1 represents the number more than 0, i.e. positive number, then the value of each data object corresponding to attribute c is 1,1,1,1,1,0,0.Whole information system IS carries out the such as table 3 of the result after Data Discretization.

Information system IS of the discretization of table 3₀

Step (2) data are divided.Information system is divided into into m equivalence class.That is determining according to information system and equivalence class Justice, by information system m sub- information system is divided into.

1. task distribution.

Information system is divided horizontally into into p data subset, and is given this p data subset allocation not using master slave mode Same node.Assume IS₀A data subset is divided into per three data objects, then has 3 data subset IS₁IS₂IS₃, 3 nodes are distributed to, task is also distributed using ad-hoc mode, i.e. host node during calculating, so, node 1 is host node, is first distributed IS₁To node 2, IS₂To node 3, IS₃To node 1, the distribution of task is completed.

The data subset IS of table 3.1₁

The data subset IS of table 3.2₂

The data subset IS of table 3.3₃

2. parallel computation equivalence class.

According to the definition of equivalence class, each node is calculated according to the tasks in parallel of distribution and be based in data subset conditional attribute Equivalence class, one calculate node is also served as using ad-hoc mode, i.e. host node when calculating and is calculated.

Then in each node, conditional attribute identical data object is returned together, according to the allocation result of previous step, obtained final product To IS₁', IS₂', IS₃', it is as follows：

The data subset IS of table 4.1₁’

The data subset IS of table 4.2₂’

The data subset IS of table 4.3₃’

In data subset IS₁' in, by x₁And x₃It is divided into same equivalence class, x₂For another equivalence class, due to data Subset IS₂' in, x₄, x₅And x₆Conditional attribute value it is all different, therefore there is no equivalence class, same IS₃' there is no equivalence class yet.

3. equivalence class is merged.

Using master slave mode, collected respectively from the result of calculation of node by host node, merge identical equivalence class, therefore obtained The equivalence class partition of information system, equivalence class one sub- information system of correspondence, is achieved in the division of data.

IS₁' in x₂And IS₂' in x₄Conditional attribute is consistent, is same equivalence class, can merge, then finally obtain 5 Individual equivalence class, U₁={ x₁, x₃, U₂={ x₂, x₄, U₃={ x₅, U₄={ x₆, U₅={ x₇, 5 sub- information systeies of correspondence.Such as Shown in lower：

Sub-information system U of table 5.1₁

Sub-information system U of table 5.2₂

Sub-information system U of table 5.3₃

Sub-information system U of table 5.4₄

Sub-information system U of table 5.5₅

Step (3) seeks parallel attribute nucleus.Sub-information system is distributed to into different nodes, each nodal parallel calculates sub-information The attribute nucleus of system again merge attribute nucleus.

1. task distribution.

Sub-information system is distributed to each node by host node, master slave mode is adopted during distribution task, using right during calculating Isotype.This mode causes parallel efficiency calculation to be improved.The order of sub-information system does not affect result of calculation, which Node processing it is fast, it is possible to overabsorption sub-information system, each time by sub-information system distribute to free time node.3 sections Point processes 5 sub- information systeies, by U₁Distribute to node 2, U₂Distribute to node 3, U₃Node 1 is distributed to, because host node will enter The distribution of row task and the collection of result, calculating speed can be reduced, if node 2 has processed at first task, therefore by U₄Distribute to section 2 are put, therewith node 1 is disposed, by U₅Distribute to node 1, i.e. node 1 and process U₃, U₅, the process U of node 2₁, U₄, node 3 processes U₂。

2. parallel computation attribute nucleus.Build decision-making discrimination matrix parallel first, then by the definition of attribute nucleus, calculate sub-information System property core.

The definition of the task and decision-making discrimination matrix distributed according to node, each node difference antithetical phrase information system U₁, U₂, U₃, U₄, U₅Build decision-making discrimination matrix.The decision-making discrimination matrix of the sub-information system of structure is as follows：

The sub-information system decision-making discrimination matrix DM of table 6.1₁

The sub-information system decision-making discrimination matrix DM of table 6.2₂

The sub-information system decision-making discrimination matrix DM of table 6.3₃

The sub-information system decision-making discrimination matrix DM of table 6.4₄

The sub-information system decision-making discrimination matrix DM of table 6.5₅

Further according to the task distribution and the definition of attribute nucleus of node, using sub- discrimination matrix, antithetical phrase information system U₁, U₂, U₃, U₄, U₅Attribute nucleus are sought parallel.In DM₁In, find D ∈ DM₁, i.e. sub-information system U₁For inconsistent object set, therefore DM₁In not There are attribute nucleus,In the same manner, it can be deduced that, DM₂In there are no attribute nucleus,For DM₃, there is no D ∈ DM₃And aD ∈ DM₃With dD ∈ DM₃, Fructus Psoraleae information system U₃There is attribute nucleus, and CORE₃(C)={ a, d }； In DM₄In, there is no D ∈ DM₄, and have aD ∈ DM₄, Fructus Psoraleae information system U₄There is attribute nucleus, CORE₄(C)={ a }；In DM₅ In, there is no D ∈ DM₅, and have aD ∈ DM₅, Fructus Psoraleae information system U₅There is attribute nucleus, CORE₅(C)={ a }.

3. attribute nucleus are merged.

According to the relation between information system and sub-information system property core, other are collected by host node using master slave mode Node result of calculation, merges, will the attribute nucleus that calculate of each node seek union.

According to the definition of attribute nucleus, by 2. can obtaining,CORE₃(C)={ a, d }, CORE₄(C)={ a }, CORE₅(C)={ a },

Can be obtained using above-mentioned steps, the attribute nucleus DCORE (C) of information system IS=={ a, d }, obtain with positive region method Core attributes be consistent.

Step (4) seeks parallel attribute reduction.Sub-information system is sought parallel in task distribution of each node in step (3) Attribute reduction, each node does not interfere with each other, and obtains result, and last host node merges each result.

1. sub-information system property yojan is asked parallel.According to the definition of attribute reduction, based on the result of previous step, obtain Attribute reduction.

In sub- decision-making discrimination matrix, it is set to decision attribute is only included with the value not comprising decision attributeAnd will include The value concurrent modification of the element of core attributes isIt is as follows so as to obtain a new matrix：

The amended decision-making discrimination matrix DM of table 7.1₁

The amended decision-making discrimination matrix DM of table 7.2₂

The amended decision-making discrimination matrix DM of table 7.3₃

The amended decision-making discrimination matrix DM of table 7.4₄

The amended decision-making discrimination matrix DM of table 7.5₅

Pair and DM then logical expression of extracting accordingly is set up to each sub-information system, i.e.,₄Set up expression formula b ∨ C, by all of logical expression of extracting conjunction computing is carried out, and obtains conjunctive normal form, then conjunctive normal form is converted to into disjunctive normal form Form, abbreviation obtains b ∨ c, each conjunct being finally added to all properties core in disjunctive normal form, i.e. obtain each sub- letter The result of breath system property yojan, each conjunct is an attribute reduction.DM₁Obtain result (a ∧ b ∧ d) ∨ (a ∧ c ∧ d), i.e. DM₁Attribute reduction result be a ∧ b ∧ d or a ∧ c ∧ d, DM₄Result (a ∧ b ∧ d) ∨ (a ∧ c ∧ d) is obtained, i.e., DM₄Attribute reduction result be a ∧ b ∧ d or a ∧ c ∧ d, remaining is

2. attribute reduction is merged.Using master slave mode, host node collects the result of calculation of other nodes, and result is transported Calculate.

When attribute reduction result is merged, all non-NULL yojan results are carried out into conjunction computing, i.e. ((a ∧ b ∧ d) ∨ (a ∧ c ∧ d)) ∧ ((a ∧ b ∧ d) ∨ (a ∧ c ∧ d)), then disjunctive normal form (a ∧ b ∧ d) ∨ (a ∧ c ∧ d) is turned to, as finally Result, i.e., the attribute reduction of whole information system, the result finally tried to achieve be a ∧ b ∧ d or a ∧ c ∧ d.

The above embodiment is interpreted as being merely to illustrate the present invention rather than limits the scope of the invention. After the content of the record for having read the present invention, technical staff can make various changes or modifications to the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. a kind of method that information system attribute reduction is asked parallel based on MPI, it is characterised in that comprise the following steps：

Step 1), in data preprocessing phase, read the data of information system, numerical value is carried out into pretreatment, i.e. sliding-model control, According to the feature of data so that continuous data discretization；

Step 2), information system is equably divided horizontally in units of sample p sample data subset, and by p sample Data subset distributes to n node, and each node is according to the equivalence class of conditional attribute parallel computation data subset, then each is saved The result of point is integrated, so as to obtain m equivalence class partition of whole information system, each one sub-information system of equivalence class correspondence System；

Step 3), again m sub- information systeies are distributed to into n node, each node is to assigned sub-information system in parallel Computation attribute core, until having processed all sub-information systems, then the result of each node is merged, and draws the category of prime information system Property core；

Step 4), it is last, then the attribute nucleus of prime information system be sent to into each nodal parallel obtain attribute reduction, then by each The attribute reduction result of node merges integration, obtains the attribute reduction result of whole information system.

2. the method that information system attribute reduction is asked parallel based on MPI according to claim 1, it is characterised in that described Step 1) data preprocessing phase first reads in information system and specifically includes：Described information system namely decision table are a four-tuple (U, A, V, f), wherein U represents the set of all objects in field question, referred to as domain to IS=；A=C ∪ D are community sets, son Collection C and D represents respectively conditional attribute collection and decision kind set；Va is the codomain of attribute a；f:U × A → V is one Individual information function, gives a value of information, i.e., to each attribute of an objectX ∈ U, there is f (x, a) ∈ Va.

3. the method that information system attribute reduction is asked parallel based on MPI according to claim 2, it is characterised in that described When carrying out continuous data discretization to the data of information system, according to the feature of data, can using it is wide it is interval, etc. frequency it is interval Method, based on Importance of Attributes, based on cluster in interior discretization method.

4. the method for seeking information system attribute reduction parallel based on MPI according to one of claim 1-3, its feature exists In the step 2) equivalence class partition of information system is that domain is classified according to conditional attribute using equivalence relation, number According to the conditional attribute collection of collection, form is { conditional attribute 1, conditional attribute 2 ... ... conditional attribute p }, containing consistent in the equivalence class Object and inconsistent object, if conditional attribute is consistent with decision attribute, for consistent object, if conditional attribute is consistent, decision-making Attribute is inconsistent, then be inconsistent object.

5. the method that information system attribute reduction is asked parallel based on MPI according to claim 4, it is characterised in that described Step 3) m sub- information system is distributed to into n node, master slave mode is adopted during distribution task, select section based on a node Point, remaining node is that, from node, host node is responsible for for task being allocated to each from node, receives from the tasks carrying of node and ties Really, task distribution adopts dynamically distributes mode, takes and is randomly assigned or order-assigned mode, processes fast node distribution task It is many, sub-information system is distributed to each time the node of free time, until all sub-information systems have been processed.

6. the method that information system attribute reduction is asked parallel based on MPI according to claim 5, it is characterised in that described Step 3) parallel computation attribute nucleus are that sub- decision-making discrimination matrix is created on each node, if sub-information system decision-making discrimination matrix In comprising decision attribute D, i.e. sub-information system be inconsistent object set, the attribute nucleus of the sub-information system areOtherwise, find out Single attribute of decision-making is determined in sub-information system, the union of each single attribute of condition, the as attribute of the sub-information system is met Core.

7. the method that information system attribute reduction is asked parallel based on MPI according to claim 2, it is characterised in that described The attribute reduction for seeking sub-information system parallel is in sub- decision-making discrimination matrix, the value of the element comprising core attributes to be revised as into sky Collection, so as to obtain a new matrix, resettles and extract accordingly logical expression, and all of logical expression of extracting is closed Computing is taken, conjunctive normal form is obtained, then conjunctive normal form is converted to into disjunctive normal form form, finally all core attributes are added to and are extracted Each conjunct in normal form, obtains the result of sub-information system property yojan.

8. the method that information system attribute reduction is asked parallel based on MPI according to claim 6 or 7, it is characterised in that institute State and create being specially for sub- decision-making discrimination matrix：To each division, each element in the division is found out different from other elements Attribute.