CN104537383A

CN104537383A - Massive organizational structure data classification method and system based on particle swarm

Info

Publication number: CN104537383A
Application number: CN201510027069.8A
Authority: CN
Inventors: 孙镇; 孙泰; 赵捷; 袁辉; 金江; 李晟飞; 钱晓东; 宫政
Original assignee: NATIONAL ADMINISTRATION FOR CODE ALLOCATION TO ORGANIZATIONS
Current assignee: NATIONAL ADMINISTRATION FOR CODE ALLOCATION TO ORGANIZATIONS
Priority date: 2015-01-20
Filing date: 2015-01-20
Publication date: 2015-04-22

Abstract

The invention discloses a massive organizational structure data classification method based on a particle swarm. The method comprises the following steps: firstly constructing a data classification rule by adopting the particle swarm; establishing classification rules of different industries; then, acquiring pre-selected data in the massive organizational structure data as a training set and a testing set, and carrying out data set rule covering process and testing evaluation according to the constructed classification rule to obtain a final classifier; and finally classifying the massive organizational structure data by using the final classifier to obtain a classification result. According to the method provided by the invention, the relationships in industrial data of the massive organizational structure data are fully considered, and the characteristic that data is processed by using a particle swarm algorithm is fully utilized, so that the industrial data of the massive organizational structure data is quickly and accurately classified. Therefore, the massive organizational structure data classification method based on the particle swarm has certain reliability and accuracy in industrial classification of organizational structures and massive data processing.

Description

A kind of magnanimity organizational structure data classification method based on population and system

Technical field

The present invention relates to mass data intelligent computation field, particularly relate to a kind of magnanimity organizational structure data classification method based on population.

Background technology

Magnanimity organizational structure data, contain the data message such as organizational structure address, coding, administrative division coding of different stage of different industries, different field, and data structure variation, content are complicated.Classification process is carried out to magnanimity organizational structure data, contributes to improving data and extract and the efficiency of data retrieval, organizing scope of business, manage other data messages such as product type in Different Industries classification can be excavated simultaneously.

All there is respective defect in current Mining Classification Rules, such as, the prior probability required in statistical method is difficult to allow people convince in theory; Machine learning method fault freedom is in a noisy environment poor; Rough set method cannot determine the degree of membership of member; There is too many node and connection weight in neural net method, makes result indigestion and checking etc.This shows, different applications and data type, often kind of a method has its merits and demerits, does not have a kind of sorting algorithm to be all better than additive method to all applications and data type.

Therefore a kind of method of mass data being carried out to accuracy classification is badly in need of.

Summary of the invention

(1) technical matters that will solve

The object of this invention is to provide a kind of magnanimity organizational structure data classification method based on population, the method, on the basis considering internal coordination effect between organizational structure, adopts and classifies to institutional industry based on population sorting algorithm.

(2) technical scheme

An object of the present invention proposes a kind of magnanimity organizational structure data classification method based on population; Two of object of the present invention proposes a kind of magnanimity organizational structure data sorting system based on population.

An object of the present invention is achieved through the following technical solutions:

A kind of magnanimity organizational structure data classification method based on population provided by the invention, comprises the following steps:

Step 1: adopt Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;

Step 2: the preliminary election data in acquisition magnanimity organizational structure data, as training set, are carried out data set rule coverage process according to the classifying rules built, generated classifying rules set;

Step 3: the preliminary election another part data in acquisition magnanimity organizational structure data, as test set, carry out testing evaluation according to classifying rules set to test set, and reservation meets the classifying rules of testing evaluation requirement as final sorter;

Step 4: adopt final sorter to carry out classification to magnanimity organizational structure data and obtain classification results.

Further, the structure Data classification rule in described step 1 is carried out in such a way:

Described classifying rules comprises condition part and conclusion part; Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification; Each particulate represents each record in each form in organizational structure's data entity table.

Further, the data set rule coverage process in described step 2 realizes according to following steps:

S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different texture data and arranges;

S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:

f(x)＝tp/pos*tn/neg(1)；

Wherein:

F (x) represents best particulate adaptive value;

Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;

Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;

Pos represents that training data concentrates positive total sample number;

Neg represents that training data concentrates negative sample sum;

S23: upgrade the particulate adapting to rule set according to following formula:

v _ij(t+1)＝v _ij(t)+c ₁r _1j(p _ij(t)-x _ij(t))+c ₂r _2j(p _gj(t)-x _ij(t)) (2)

x _ij(t+1)＝x _ij(t)+v _ij(t+1) (3)

Wherein,

(2) formula represents the velocity variations equation of the jth dimension of particulate i;

(3) formula represents the change in location equation of the jth dimension of particulate i;

Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;

Xi=(xi1, xi2 ... xin) be the current location of particulate i;

Vi=(vi1, vi2 ... vin) be the present speed of particulate i;

During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);

S24: carry out rule set rejecting according to following steps:

By comparing particulate adaptive value and training dataset, determine current optimal location and global optimum position;

S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;

S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.

Further, the testing evaluation in described step 3 carries out in such a way:

Employing stays exhibition method to carry out comprehensive evaluation: first given data set is randomly divided into two and independently gathers: training set and test set, using the data of 2/3rds as training set, the data of 1/3rd are as test set; Use training set to derive classification, then its accuracy rate test set is assessed, and calculates its fitness value in test set of the classifying rules that obtains of the first step;

Then grab sample recycling stays exhibition method to carry out predictablity rate estimation K time;

Finally the predictablity rate that this K time obtains is averaged, as obtaining final predictablity rate; Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.

Further, the testing evaluation in described step 4 carries out in such a way:

Cross-validation method is adopted to carry out comprehensive evaluation: first primary data to be divided into K mutually disjoint subset S1, S2 ..., Sk, the equal and opposite in direction of each subset;

Then, training and testing carries out K time: be used as test set at i-th iteration Si, remaining subset is all for train classification models; That is, first time iteration classification at subset S2 ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down;

Finally, accuracy is estimated is that K the correct number of categories of iteration is divided by total sample number in primary data.

Two of object of the present invention is achieved through the following technical solutions:

A kind of magnanimity organizational structure data sorting system based on population provided by the invention, comprises Data classification rule and builds module, classifying rules set generation module, classifying rules testing evaluation module and data categorization module;

Described Data classification rule builds module, adopts Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;

Described classifying rules set generation module, for obtaining preliminary election data in magnanimity organizational structure data as training set, carrying out data set rule coverage process according to the classifying rules built, generating classifying rules set;

Described classifying rules testing evaluation module, for obtaining preliminary election another part data in magnanimity organizational structure data as test set, carry out testing evaluation according to classifying rules set to test set, reservation meets the classifying rules of testing evaluation requirement as final sorter;

Described data categorization module, adopts final sorter to carry out classification to magnanimity organizational structure data and obtains classification results.

Further, the structure Data classification rule that described Data classification rule builds in module is carried out in such a way:

Described classifying rules comprises condition part and conclusion part;

Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification;

Each particulate represents each record in each form in organizational structure's data entity table.

Further, the data set rule coverage process in described classifying rules set generation module realizes according to following steps:

S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different institutional frameworks and arranges;

f(x)＝tp/pos*tn/neg(1)；

Wherein:

F (x) represents best particulate adaptive value;

Pos represents that training data concentrates positive total sample number;

Neg represents that training data concentrates negative sample sum;

v _ij(t+1)＝v _ij(t)+c ₁r _1j(p _ij(t)-x _ij(t))+c ₂r _2j(p _gi(t)-x _ij(t)) (2)

x _ij(t+1)＝x _ij(t)+v _ij(t+1) (3)

Wherein,

Xi=(xi1, xi2 ... xin) be the current location of particulate i;

Vi=(vi1, vi2 ... vin) be the present speed of particulate i;

S24: carry out rule set rejecting according to following steps:

Further, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:

(3) beneficial effect

Compare with product with prior art, the present invention has the following advantages:

The present invention has taken into full account the relation that magnanimity organizational structure data industry data interconnects exists, and take full advantage of the feature of the process data of particle cluster algorithm, namely particle cluster algorithm emphasizes the collaborative and cooperation between colony's Personal, reach with this and industry data of magnanimity organizational structure data are classified fast and accurately, can not only be correct different industries classification be divided, reach certain evaluation of classification precision, in principle of classification, consider the cooperative cooperating relation between different industries simultaneously.Therefore the magnanimity organizational structure data classification method based on population has certain reliability and accuracy in organizational structure's trade classification and mass data processing process.

Accompanying drawing explanation

Fig. 1 is the nested schematic diagram of sorting algorithm of the present invention;

Fig. 2 is the classification process figure based on particle cluster algorithm of the present invention;

Fig. 3 is classification overwrite procedure of the present invention;

Fig. 4 is the magnanimity organizational structure data classification method process flow diagram based on population of the present invention.

Embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with the drawings and the specific embodiments, the present invention is described in further detail.

As shown in the figure, the present embodiment provides a kind of magnanimity organizational structure data classification method based on population, comprises the following steps:

Adopt industrial sectors of national economy classification (GB/T4754-2011) to be classifying sum quasi-regular in the present embodiment, be divided into main 20 classes for economic sectors, 96 large classes, are specifically shown in criteria for classification detailed rules and regulations.96 kinds of industry type, for each industry type as a kind of classifying rules, as forestry, its classifying rules is: its attribute is every belongs to lower data set attribute, all belongs to forestry row class.As shown in table 1, be the trade classification rule of forestry, take following mode classification: x={x1, x2, x3, x4, x5 ..., xn}, n represent all attribute information sums of forestry, in table 1 below, have 10 attribute information.Be that attribute information expresses formula for its key message of each attributes extraction.As forest genetics attribute, comprise the key word such as forest and cultivation, seed can belong to this attribute information collection for extracting key word in film name.Expression formula is as follows:

If attributel_min≤x1≤attributel_max and

attribute2_min≤x2≤attribute2_max and

......

attributen_min≤xn≤attributen_max and

then classx

Table 1

Structure Data classification rule in described step 1 is carried out in such a way:

Described classifying rules comprises condition part and conclusion part;

Data set rule coverage process in described step 2 realizes according to following steps:

f(x)＝tp/pos*tn/neg(1)；

Wherein:

F (x) represents best particulate adaptive value;

Pos represents that training data concentrates positive total sample number;

Neg represents that training data concentrates negative sample sum;

x _ij(t+1)＝x _ij(t)+v _ij(t+1) (3)

Wherein,

Xi=(xi1, xi2 ... xin) be the current location of particulate i;

Vi=(vi1, vi2 ... vin) be the present speed of particulate i;

S24: carry out rule set rejecting according to following steps:

Compared with the training example of data centralization by the often dimension in particulate, calculate the adaptive value of particulate, adaptive value and optimal location iP and jP lived through are compared, if better, then it can be used as current optimal location and global optimum position; (after main match stop, data contrast with actual trade classification here, and nicety of grading is higher, judges as well);

Testing evaluation in described step 2 carries out in such a way:

Employing stays exhibition method and cross-validation method two kinds of classification check evaluation methods to carry out comprehensive evaluation;

Draw final sorter.As follows to the specific requirement of testing evaluation:

(1) prediction accuracy: the accuracy of prediction represents the accuracy of a disaggregated model classification, and what usually affect a sorter classification error has following factor:

The record quantity of (a) training set;

The number of (b) attribute;

Whether the information in (c) attribute is correlated with category information;

Whether the distribution of (d) record to be predicted belongs to the distribution of identical training centralized recording;

(2) complexity calculated: the complexity of calculating depends on and concrete realizes details and hardware environment;

(3) the succinct degree of model description: for the classification task of description type, model describes more succinct more welcome.

The present embodiment adopts and stays exhibition method and cross-validation method two kinds of classification check evaluation methods to carry out comprehensive evaluation;

Stay exhibition method: given data set be randomly divided into two independently gather: training set and test set, the data of usual 2/3rds as training set, its excess-three divide one data as test set.Use training set to derive classification, then its accuracy rate test set is assessed, and the appraisal procedure of employing is that the classifying rules obtained the first step calculates its fitness value in test set.Grab sample stays the one of exhibition method to change, and recycling stays exhibition method to carry out predictablity rate estimation K time, finally averages to the predictablity rate that this K time obtains, to obtain final predictablity rate.Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.

Cross-validation method: first primary data is divided into K mutually disjoint subset S1, S2 ..., Sk, the size of each subset is roughly equal.Training and testing carries out K time.Be used as test set at i-th iteration Si, remaining subset is all for train classification models.That is, the classification of iteration is at subset S2 for the first time ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down.It is that K the correct number of categories of iteration is divided by total sample number in primary data that accuracy is estimated.In layering cross validation, subset is layered, and the class of each compromise sample is distributed roughly the same with in primary data.Generally can give tacit consent to K is 10.

The present embodiment additionally provides a kind of magnanimity organizational structure data sorting system based on population, comprises Data classification rule and builds module, classifying rules set generation module, classifying rules testing evaluation module and data categorization module;

The structure Data classification rule that described Data classification rule builds in module is carried out in such a way:

Described classifying rules comprises condition part and conclusion part;

Data set rule coverage process in described classifying rules set generation module realizes according to following steps:

f(x)＝tp/pos*tn/neg(1)；

Wherein:

F (x) represents best particulate adaptive value;

Pos represents that training data concentrates positive total sample number;

Neg represents that training data concentrates negative sample sum;

x _ij(t+1)＝x _ij(t)+v _ij(t+1) (3)

Wherein,

Xi=(xi1, xi2 ... xin) be the current location of particulate i;

Vi=(vi1, vi2 ... vin) be the present speed of particulate i;

S24: carry out rule set rejecting according to following steps:

Compared with the training example of data centralization by the often dimension in particulate, calculate the adaptive value of particulate, adaptive value and optimal location iP and jP lived through are compared, if better, then it can be used as current optimal location and global optimum position;

By comparing particulate adaptive value and training dataset, adaptive value and history optimal location is determined to compare with a fine-grained global optimum position, if adaptive value higher than history optimal location and a fine-grained global optimum position, then it can be used as current optimal location and global optimum position;

Testing evaluation in described classifying rules testing evaluation module carries out in such a way:

Draw final sorter.As follows to the specific requirement of testing evaluation:

The record quantity of (a) training set;

The number of (b) attribute;

Above embodiment is only one embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.Its concrete structure and size can adjust according to actual needs accordingly.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.

Claims

1., based on a magnanimity organizational structure data classification method for population, it is characterized in that, comprise the following steps:

2. the magnanimity organizational structure data classification method based on population according to claim 1, is characterized in that, the structure Data classification rule in described step 1 is carried out in such a way:

3. the magnanimity organizational structure data classification method based on population according to claim 1, is characterized in that, the data set rule coverage process in described step 2 realizes according to following steps:

f(x)＝tp/pos*tn/neg (1)；

Wherein:

F (x) represents best particulate adaptive value;

Pos represents that training data concentrates positive total sample number;

Neg represents that training data concentrates negative sample sum;

x _ij(t+1)＝x _ij(t)+v _ij(t+1) (3)

Wherein,

Xi=(xi1, xi2 ... xin) be the current location of particulate i;

Vi=(vi1, vi2 ... vin) be the present speed of particulate i;

S24: carry out rule set rejecting according to following steps:

4. the magnanimity organizational structure data classification method based on population according to claim 1, it is characterized in that, the testing evaluation in described step 3 carries out in such a way:

5. the magnanimity organizational structure data classification method based on population according to claim 1, it is characterized in that, the testing evaluation in described step 4 carries out in such a way:

6. based on a magnanimity organizational structure data sorting system for population, it is characterized in that, comprise Data classification rule and build module, classifying rules set generation module, classifying rules testing evaluation module and data categorization module;

7. the magnanimity organizational structure data sorting system based on population according to claim 6, is characterized in that, the structure Data classification rule that described Data classification rule builds in module is carried out in such a way:

Described classifying rules comprises condition part and conclusion part;

8. the magnanimity organizational structure data sorting system based on population according to claim 6, is characterized in that, the data set rule coverage process in described classifying rules set generation module realizes according to following steps:

f(x)＝tp/pos*tn/neg (1)；

Wherein:

F (x) represents best particulate adaptive value;

Pos represents that training data concentrates positive total sample number;

Neg represents that training data concentrates negative sample sum;

x _ij(t+1)＝x _ij(t)+v _ij(t+1) (3)

Wherein,

Xi=(xi1, xi2 ... xin) be the current location of particulate i;

Vi=(vi1, vi2 ... vin) be the present speed of particulate i;

S24: carry out rule set rejecting according to following steps:

9. the magnanimity organizational structure data sorting system based on population according to claim 6, is characterized in that, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:

10. the magnanimity organizational structure data sorting system based on population according to claim 9, is characterized in that, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:

Finally, accuracy is estimated as K the correct number of categories of iteration divided by total sample number in primary data.