CN104537383A - Massive organizational structure data classification method and system based on particle swarm - Google Patents

Massive organizational structure data classification method and system based on particle swarm Download PDF

Info

Publication number
CN104537383A
CN104537383A CN201510027069.8A CN201510027069A CN104537383A CN 104537383 A CN104537383 A CN 104537383A CN 201510027069 A CN201510027069 A CN 201510027069A CN 104537383 A CN104537383 A CN 104537383A
Authority
CN
China
Prior art keywords
data
classification
rule
particulate
classifying rules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510027069.8A
Other languages
Chinese (zh)
Inventor
孙镇
孙泰
赵捷
袁辉
金江
李晟飞
钱晓东
宫政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL ADMINISTRATION FOR CODE ALLOCATION TO ORGANIZATIONS
Original Assignee
NATIONAL ADMINISTRATION FOR CODE ALLOCATION TO ORGANIZATIONS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL ADMINISTRATION FOR CODE ALLOCATION TO ORGANIZATIONS filed Critical NATIONAL ADMINISTRATION FOR CODE ALLOCATION TO ORGANIZATIONS
Priority to CN201510027069.8A priority Critical patent/CN104537383A/en
Publication of CN104537383A publication Critical patent/CN104537383A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a massive organizational structure data classification method based on a particle swarm. The method comprises the following steps: firstly constructing a data classification rule by adopting the particle swarm; establishing classification rules of different industries; then, acquiring pre-selected data in the massive organizational structure data as a training set and a testing set, and carrying out data set rule covering process and testing evaluation according to the constructed classification rule to obtain a final classifier; and finally classifying the massive organizational structure data by using the final classifier to obtain a classification result. According to the method provided by the invention, the relationships in industrial data of the massive organizational structure data are fully considered, and the characteristic that data is processed by using a particle swarm algorithm is fully utilized, so that the industrial data of the massive organizational structure data is quickly and accurately classified. Therefore, the massive organizational structure data classification method based on the particle swarm has certain reliability and accuracy in industrial classification of organizational structures and massive data processing.

Description

A kind of magnanimity organizational structure data classification method based on population and system
Technical field
The present invention relates to mass data intelligent computation field, particularly relate to a kind of magnanimity organizational structure data classification method based on population.
Background technology
Magnanimity organizational structure data, contain the data message such as organizational structure address, coding, administrative division coding of different stage of different industries, different field, and data structure variation, content are complicated.Classification process is carried out to magnanimity organizational structure data, contributes to improving data and extract and the efficiency of data retrieval, organizing scope of business, manage other data messages such as product type in Different Industries classification can be excavated simultaneously.
All there is respective defect in current Mining Classification Rules, such as, the prior probability required in statistical method is difficult to allow people convince in theory; Machine learning method fault freedom is in a noisy environment poor; Rough set method cannot determine the degree of membership of member; There is too many node and connection weight in neural net method, makes result indigestion and checking etc.This shows, different applications and data type, often kind of a method has its merits and demerits, does not have a kind of sorting algorithm to be all better than additive method to all applications and data type.
Therefore a kind of method of mass data being carried out to accuracy classification is badly in need of.
Summary of the invention
(1) technical matters that will solve
The object of this invention is to provide a kind of magnanimity organizational structure data classification method based on population, the method, on the basis considering internal coordination effect between organizational structure, adopts and classifies to institutional industry based on population sorting algorithm.
(2) technical scheme
An object of the present invention proposes a kind of magnanimity organizational structure data classification method based on population; Two of object of the present invention proposes a kind of magnanimity organizational structure data sorting system based on population.
An object of the present invention is achieved through the following technical solutions:
A kind of magnanimity organizational structure data classification method based on population provided by the invention, comprises the following steps:
Step 1: adopt Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;
Step 2: the preliminary election data in acquisition magnanimity organizational structure data, as training set, are carried out data set rule coverage process according to the classifying rules built, generated classifying rules set;
Step 3: the preliminary election another part data in acquisition magnanimity organizational structure data, as test set, carry out testing evaluation according to classifying rules set to test set, and reservation meets the classifying rules of testing evaluation requirement as final sorter;
Step 4: adopt final sorter to carry out classification to magnanimity organizational structure data and obtain classification results.
Further, the structure Data classification rule in described step 1 is carried out in such a way:
Described classifying rules comprises condition part and conclusion part; Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification; Each particulate represents each record in each form in organizational structure's data entity table.
Further, the data set rule coverage process in described step 2 realizes according to following steps:
S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different texture data and arranges;
S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:
f(x)=tp/pos*tn/neg(1);
Wherein:
F (x) represents best particulate adaptive value;
Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;
Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;
Pos represents that training data concentrates positive total sample number;
Neg represents that training data concentrates negative sample sum;
S23: upgrade the particulate adapting to rule set according to following formula:
v ij(t+1)=v ij(t)+c 1r 1j(p ij(t)-x ij(t))+c 2r 2j(p gj(t)-x ij(t)) (2)
x ij(t+1)=x ij(t)+v ij(t+1) (3)
Wherein,
(2) formula represents the velocity variations equation of the jth dimension of particulate i;
(3) formula represents the change in location equation of the jth dimension of particulate i;
Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;
Xi=(xi1, xi2 ... xin) be the current location of particulate i;
Vi=(vi1, vi2 ... vin) be the present speed of particulate i;
During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);
S24: carry out rule set rejecting according to following steps:
By comparing particulate adaptive value and training dataset, determine current optimal location and global optimum position;
S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;
S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.
Further, the testing evaluation in described step 3 carries out in such a way:
Employing stays exhibition method to carry out comprehensive evaluation: first given data set is randomly divided into two and independently gathers: training set and test set, using the data of 2/3rds as training set, the data of 1/3rd are as test set; Use training set to derive classification, then its accuracy rate test set is assessed, and calculates its fitness value in test set of the classifying rules that obtains of the first step;
Then grab sample recycling stays exhibition method to carry out predictablity rate estimation K time;
Finally the predictablity rate that this K time obtains is averaged, as obtaining final predictablity rate; Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.
Further, the testing evaluation in described step 4 carries out in such a way:
Cross-validation method is adopted to carry out comprehensive evaluation: first primary data to be divided into K mutually disjoint subset S1, S2 ..., Sk, the equal and opposite in direction of each subset;
Then, training and testing carries out K time: be used as test set at i-th iteration Si, remaining subset is all for train classification models; That is, first time iteration classification at subset S2 ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down;
Finally, accuracy is estimated is that K the correct number of categories of iteration is divided by total sample number in primary data.
Two of object of the present invention is achieved through the following technical solutions:
A kind of magnanimity organizational structure data sorting system based on population provided by the invention, comprises Data classification rule and builds module, classifying rules set generation module, classifying rules testing evaluation module and data categorization module;
Described Data classification rule builds module, adopts Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;
Described classifying rules set generation module, for obtaining preliminary election data in magnanimity organizational structure data as training set, carrying out data set rule coverage process according to the classifying rules built, generating classifying rules set;
Described classifying rules testing evaluation module, for obtaining preliminary election another part data in magnanimity organizational structure data as test set, carry out testing evaluation according to classifying rules set to test set, reservation meets the classifying rules of testing evaluation requirement as final sorter;
Described data categorization module, adopts final sorter to carry out classification to magnanimity organizational structure data and obtains classification results.
Further, the structure Data classification rule that described Data classification rule builds in module is carried out in such a way:
Described classifying rules comprises condition part and conclusion part;
Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification;
Each particulate represents each record in each form in organizational structure's data entity table.
Further, the data set rule coverage process in described classifying rules set generation module realizes according to following steps:
S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different institutional frameworks and arranges;
S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:
f(x)=tp/pos*tn/neg(1);
Wherein:
F (x) represents best particulate adaptive value;
Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;
Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;
Pos represents that training data concentrates positive total sample number;
Neg represents that training data concentrates negative sample sum;
S23: upgrade the particulate adapting to rule set according to following formula:
v ij(t+1)=v ij(t)+c 1r 1j(p ij(t)-x ij(t))+c 2r 2j(p gi(t)-x ij(t)) (2)
x ij(t+1)=x ij(t)+v ij(t+1) (3)
Wherein,
(2) formula represents the velocity variations equation of the jth dimension of particulate i;
(3) formula represents the change in location equation of the jth dimension of particulate i;
Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;
Xi=(xi1, xi2 ... xin) be the current location of particulate i;
Vi=(vi1, vi2 ... vin) be the present speed of particulate i;
During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);
S24: carry out rule set rejecting according to following steps:
By comparing particulate adaptive value and training dataset, determine current optimal location and global optimum position;
S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;
S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.
Further, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:
Employing stays exhibition method to carry out comprehensive evaluation: first given data set is randomly divided into two and independently gathers: training set and test set, using the data of 2/3rds as training set, the data of 1/3rd are as test set; Use training set to derive classification, then its accuracy rate test set is assessed, and calculates its fitness value in test set of the classifying rules that obtains of the first step;
Then grab sample recycling stays exhibition method to carry out predictablity rate estimation K time;
Finally the predictablity rate that this K time obtains is averaged, as obtaining final predictablity rate; Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.
Further, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:
Cross-validation method is adopted to carry out comprehensive evaluation: first primary data to be divided into K mutually disjoint subset S1, S2 ..., Sk, the equal and opposite in direction of each subset;
Then, training and testing carries out K time: be used as test set at i-th iteration Si, remaining subset is all for train classification models; That is, first time iteration classification at subset S2 ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down;
Finally, accuracy is estimated is that K the correct number of categories of iteration is divided by total sample number in primary data.
(3) beneficial effect
Compare with product with prior art, the present invention has the following advantages:
The present invention has taken into full account the relation that magnanimity organizational structure data industry data interconnects exists, and take full advantage of the feature of the process data of particle cluster algorithm, namely particle cluster algorithm emphasizes the collaborative and cooperation between colony's Personal, reach with this and industry data of magnanimity organizational structure data are classified fast and accurately, can not only be correct different industries classification be divided, reach certain evaluation of classification precision, in principle of classification, consider the cooperative cooperating relation between different industries simultaneously.Therefore the magnanimity organizational structure data classification method based on population has certain reliability and accuracy in organizational structure's trade classification and mass data processing process.
Accompanying drawing explanation
Fig. 1 is the nested schematic diagram of sorting algorithm of the present invention;
Fig. 2 is the classification process figure based on particle cluster algorithm of the present invention;
Fig. 3 is classification overwrite procedure of the present invention;
Fig. 4 is the magnanimity organizational structure data classification method process flow diagram based on population of the present invention.
Embodiment
Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with the drawings and the specific embodiments, the present invention is described in further detail.
As shown in the figure, the present embodiment provides a kind of magnanimity organizational structure data classification method based on population, comprises the following steps:
Step 1: adopt Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;
Adopt industrial sectors of national economy classification (GB/T4754-2011) to be classifying sum quasi-regular in the present embodiment, be divided into main 20 classes for economic sectors, 96 large classes, are specifically shown in criteria for classification detailed rules and regulations.96 kinds of industry type, for each industry type as a kind of classifying rules, as forestry, its classifying rules is: its attribute is every belongs to lower data set attribute, all belongs to forestry row class.As shown in table 1, be the trade classification rule of forestry, take following mode classification: x={x1, x2, x3, x4, x5 ..., xn}, n represent all attribute information sums of forestry, in table 1 below, have 10 attribute information.Be that attribute information expresses formula for its key message of each attributes extraction.As forest genetics attribute, comprise the key word such as forest and cultivation, seed can belong to this attribute information collection for extracting key word in film name.Expression formula is as follows:
If attributel_min≤x1≤attributel_max and
attribute2_min≤x2≤attribute2_max and
......
attributen_min≤xn≤attributen_max and
then classx
Table 1
Step 2: the preliminary election data in acquisition magnanimity organizational structure data, as training set, are carried out data set rule coverage process according to the classifying rules built, generated classifying rules set;
Step 3: the preliminary election another part data in acquisition magnanimity organizational structure data, as test set, carry out testing evaluation according to classifying rules set to test set, and reservation meets the classifying rules of testing evaluation requirement as final sorter;
Step 4: adopt final sorter to carry out classification to magnanimity organizational structure data and obtain classification results.
Structure Data classification rule in described step 1 is carried out in such a way:
Described classifying rules comprises condition part and conclusion part;
Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification;
Each particulate represents each record in each form in organizational structure's data entity table.
Data set rule coverage process in described step 2 realizes according to following steps:
S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different institutional frameworks and arranges;
S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:
f(x)=tp/pos*tn/neg(1);
Wherein:
F (x) represents best particulate adaptive value;
Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;
Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;
Pos represents that training data concentrates positive total sample number;
Neg represents that training data concentrates negative sample sum;
S23: upgrade the particulate adapting to rule set according to following formula:
v ij(t+1)=v ij(t)+c 1r 1j(p ij(t)-x ij(t))+c 2r 2j(p gi(t)-x ij(t)) (2)
x ij(t+1)=x ij(t)+v ij(t+1) (3)
Wherein,
(2) formula represents the velocity variations equation of the jth dimension of particulate i;
(3) formula represents the change in location equation of the jth dimension of particulate i;
Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;
Xi=(xi1, xi2 ... xin) be the current location of particulate i;
Vi=(vi1, vi2 ... vin) be the present speed of particulate i;
During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);
S24: carry out rule set rejecting according to following steps:
Compared with the training example of data centralization by the often dimension in particulate, calculate the adaptive value of particulate, adaptive value and optimal location iP and jP lived through are compared, if better, then it can be used as current optimal location and global optimum position; (after main match stop, data contrast with actual trade classification here, and nicety of grading is higher, judges as well);
S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;
S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.
Testing evaluation in described step 2 carries out in such a way:
Employing stays exhibition method and cross-validation method two kinds of classification check evaluation methods to carry out comprehensive evaluation;
Draw final sorter.As follows to the specific requirement of testing evaluation:
(1) prediction accuracy: the accuracy of prediction represents the accuracy of a disaggregated model classification, and what usually affect a sorter classification error has following factor:
The record quantity of (a) training set;
The number of (b) attribute;
Whether the information in (c) attribute is correlated with category information;
Whether the distribution of (d) record to be predicted belongs to the distribution of identical training centralized recording;
(2) complexity calculated: the complexity of calculating depends on and concrete realizes details and hardware environment;
(3) the succinct degree of model description: for the classification task of description type, model describes more succinct more welcome.
The present embodiment adopts and stays exhibition method and cross-validation method two kinds of classification check evaluation methods to carry out comprehensive evaluation;
Stay exhibition method: given data set be randomly divided into two independently gather: training set and test set, the data of usual 2/3rds as training set, its excess-three divide one data as test set.Use training set to derive classification, then its accuracy rate test set is assessed, and the appraisal procedure of employing is that the classifying rules obtained the first step calculates its fitness value in test set.Grab sample stays the one of exhibition method to change, and recycling stays exhibition method to carry out predictablity rate estimation K time, finally averages to the predictablity rate that this K time obtains, to obtain final predictablity rate.Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.
Cross-validation method: first primary data is divided into K mutually disjoint subset S1, S2 ..., Sk, the size of each subset is roughly equal.Training and testing carries out K time.Be used as test set at i-th iteration Si, remaining subset is all for train classification models.That is, the classification of iteration is at subset S2 for the first time ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down.It is that K the correct number of categories of iteration is divided by total sample number in primary data that accuracy is estimated.In layering cross validation, subset is layered, and the class of each compromise sample is distributed roughly the same with in primary data.Generally can give tacit consent to K is 10.
The present embodiment additionally provides a kind of magnanimity organizational structure data sorting system based on population, comprises Data classification rule and builds module, classifying rules set generation module, classifying rules testing evaluation module and data categorization module;
Described Data classification rule builds module, adopts Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;
Described classifying rules set generation module, for obtaining preliminary election data in magnanimity organizational structure data as training set, carrying out data set rule coverage process according to the classifying rules built, generating classifying rules set;
Described classifying rules testing evaluation module, for obtaining preliminary election another part data in magnanimity organizational structure data as test set, carry out testing evaluation according to classifying rules set to test set, reservation meets the classifying rules of testing evaluation requirement as final sorter;
Described data categorization module, adopts final sorter to carry out classification to magnanimity organizational structure data and obtains classification results.
The structure Data classification rule that described Data classification rule builds in module is carried out in such a way:
Described classifying rules comprises condition part and conclusion part;
Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification;
Each particulate represents each record in each form in organizational structure's data entity table.
Data set rule coverage process in described classifying rules set generation module realizes according to following steps:
S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different institutional frameworks and arranges;
S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:
f(x)=tp/pos*tn/neg(1);
Wherein:
F (x) represents best particulate adaptive value;
Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;
Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;
Pos represents that training data concentrates positive total sample number;
Neg represents that training data concentrates negative sample sum;
S23: upgrade the particulate adapting to rule set according to following formula:
v ij(t+1)=v ij(t)+c 1r 1j(p ij(t)-x ij(t))+c 2r 2j(p gi(t)-x ij(t)) (2)
x ij(t+1)=x ij(t)+v ij(t+1) (3)
Wherein,
(2) formula represents the velocity variations equation of the jth dimension of particulate i;
(3) formula represents the change in location equation of the jth dimension of particulate i;
Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;
Xi=(xi1, xi2 ... xin) be the current location of particulate i;
Vi=(vi1, vi2 ... vin) be the present speed of particulate i;
During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);
S24: carry out rule set rejecting according to following steps:
Compared with the training example of data centralization by the often dimension in particulate, calculate the adaptive value of particulate, adaptive value and optimal location iP and jP lived through are compared, if better, then it can be used as current optimal location and global optimum position;
By comparing particulate adaptive value and training dataset, adaptive value and history optimal location is determined to compare with a fine-grained global optimum position, if adaptive value higher than history optimal location and a fine-grained global optimum position, then it can be used as current optimal location and global optimum position;
S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;
S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.
Testing evaluation in described classifying rules testing evaluation module carries out in such a way:
Employing stays exhibition method and cross-validation method two kinds of classification check evaluation methods to carry out comprehensive evaluation;
Draw final sorter.As follows to the specific requirement of testing evaluation:
(1) prediction accuracy: the accuracy of prediction represents the accuracy of a disaggregated model classification, and what usually affect a sorter classification error has following factor:
The record quantity of (a) training set;
The number of (b) attribute;
Whether the information in (c) attribute is correlated with category information;
Whether the distribution of (d) record to be predicted belongs to the distribution of identical training centralized recording;
(2) complexity calculated: the complexity of calculating depends on and concrete realizes details and hardware environment;
(3) the succinct degree of model description: for the classification task of description type, model describes more succinct more welcome.
Stay exhibition method: given data set be randomly divided into two independently gather: training set and test set, the data of usual 2/3rds as training set, its excess-three divide one data as test set.Use training set to derive classification, then its accuracy rate test set is assessed, and the appraisal procedure of employing is that the classifying rules obtained the first step calculates its fitness value in test set.Grab sample stays the one of exhibition method to change, and recycling stays exhibition method to carry out predictablity rate estimation K time, finally averages to the predictablity rate that this K time obtains, to obtain final predictablity rate.Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.
Cross-validation method: first primary data is divided into K mutually disjoint subset S1, S2 ..., Sk, the size of each subset is roughly equal.Training and testing carries out K time.Be used as test set at i-th iteration Si, remaining subset is all for train classification models.That is, the classification of iteration is at subset S2 for the first time ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down.It is that K the correct number of categories of iteration is divided by total sample number in primary data that accuracy is estimated.In layering cross validation, subset is layered, and the class of each compromise sample is distributed roughly the same with in primary data.Generally can give tacit consent to K is 10.
Above embodiment is only one embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.Its concrete structure and size can adjust according to actual needs accordingly.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.

Claims (10)

1., based on a magnanimity organizational structure data classification method for population, it is characterized in that, comprise the following steps:
Step 1: adopt Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;
Step 2: the preliminary election data in acquisition magnanimity organizational structure data, as training set, are carried out data set rule coverage process according to the classifying rules built, generated classifying rules set;
Step 3: the preliminary election another part data in acquisition magnanimity organizational structure data, as test set, carry out testing evaluation according to classifying rules set to test set, and reservation meets the classifying rules of testing evaluation requirement as final sorter;
Step 4: adopt final sorter to carry out classification to magnanimity organizational structure data and obtain classification results.
2. the magnanimity organizational structure data classification method based on population according to claim 1, is characterized in that, the structure Data classification rule in described step 1 is carried out in such a way:
Described classifying rules comprises condition part and conclusion part; Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification; Each particulate represents each record in each form in organizational structure's data entity table.
3. the magnanimity organizational structure data classification method based on population according to claim 1, is characterized in that, the data set rule coverage process in described step 2 realizes according to following steps:
S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different texture data and arranges;
S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:
f(x)=tp/pos*tn/neg (1);
Wherein:
F (x) represents best particulate adaptive value;
Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;
Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;
Pos represents that training data concentrates positive total sample number;
Neg represents that training data concentrates negative sample sum;
S23: upgrade the particulate adapting to rule set according to following formula:
v ij(t+1)=v ij(t)+c 1r 1j(p ij(t)-x ij(t))+c 2r 2j(p gj(t)-x ij(t)) (2)
x ij(t+1)=x ij(t)+v ij(t+1) (3)
Wherein,
(2) formula represents the velocity variations equation of the jth dimension of particulate i;
(3) formula represents the change in location equation of the jth dimension of particulate i;
Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;
Xi=(xi1, xi2 ... xin) be the current location of particulate i;
Vi=(vi1, vi2 ... vin) be the present speed of particulate i;
During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);
S24: carry out rule set rejecting according to following steps:
By comparing particulate adaptive value and training dataset, determine current optimal location and global optimum position;
S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;
S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.
4. the magnanimity organizational structure data classification method based on population according to claim 1, it is characterized in that, the testing evaluation in described step 3 carries out in such a way:
Employing stays exhibition method to carry out comprehensive evaluation: first given data set is randomly divided into two and independently gathers: training set and test set, using the data of 2/3rds as training set, the data of 1/3rd are as test set; Use training set to derive classification, then its accuracy rate test set is assessed, and calculates its fitness value in test set of the classifying rules that obtains of the first step;
Then grab sample recycling stays exhibition method to carry out predictablity rate estimation K time;
Finally the predictablity rate that this K time obtains is averaged, as obtaining final predictablity rate; Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.
5. the magnanimity organizational structure data classification method based on population according to claim 1, it is characterized in that, the testing evaluation in described step 4 carries out in such a way:
Cross-validation method is adopted to carry out comprehensive evaluation: first primary data to be divided into K mutually disjoint subset S1, S2 ..., Sk, the equal and opposite in direction of each subset;
Then, training and testing carries out K time: be used as test set at i-th iteration Si, remaining subset is all for train classification models; That is, first time iteration classification at subset S2 ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down;
Finally, accuracy is estimated is that K the correct number of categories of iteration is divided by total sample number in primary data.
6. based on a magnanimity organizational structure data sorting system for population, it is characterized in that, comprise Data classification rule and build module, classifying rules set generation module, classifying rules testing evaluation module and data categorization module;
Described Data classification rule builds module, adopts Particle Swarm to build Data classification rule; Set up the classifying rules of different industries, different classifying rules Michigan coded systems is encoded;
Described classifying rules set generation module, for obtaining preliminary election data in magnanimity organizational structure data as training set, carrying out data set rule coverage process according to the classifying rules built, generating classifying rules set;
Described classifying rules testing evaluation module, for obtaining preliminary election another part data in magnanimity organizational structure data as test set, carry out testing evaluation according to classifying rules set to test set, reservation meets the classifying rules of testing evaluation requirement as final sorter;
Described data categorization module, adopts final sorter to carry out classification to magnanimity organizational structure data and obtains classification results.
7. the magnanimity organizational structure data sorting system based on population according to claim 6, is characterized in that, the structure Data classification rule that described Data classification rule builds in module is carried out in such a way:
Described classifying rules comprises condition part and conclusion part;
Described condition part is a logic testing set, connects with logical connector; Described conclusion part be satisfy condition part cover example classification;
Each particulate represents each record in each form in organizational structure's data entity table.
8. the magnanimity organizational structure data sorting system based on population according to claim 6, is characterized in that, the data set rule coverage process in described classifying rules set generation module realizes according to following steps:
S21: initialization rule; Initialization population, the bound of random initializtion classification in the valid interval of classified variable, the initial candidate solution of initialized Particle Swarm composition rule extraction algorithm, carries out initial trade classification to different institutional frameworks and arranges;
S22: determine the best particulate in Particle Swarm, calculate particulate adaptive value according to following formula:
f(x)=tp/pos*tn/neg (1);
Wherein:
F (x) represents best particulate adaptive value;
Tp represents correct classified instance number, namely by rule coverage and the instance number of correct classification;
Tn represents correct rejection instance number, namely not by rule coverage, and the instance number that classification is also different with training objective;
Pos represents that training data concentrates positive total sample number;
Neg represents that training data concentrates negative sample sum;
S23: upgrade the particulate adapting to rule set according to following formula:
v ij(t+1)=v ij(t)+c 1r 1j(p ij(t)-x ij(t))+c 2r 2j(p gj(t)-x ij(t)) (2)
x ij(t+1)=x ij(t)+v ij(t+1) (3)
Wherein,
(2) formula represents the velocity variations equation of the jth dimension of particulate i;
(3) formula represents the change in location equation of the jth dimension of particulate i;
Wherein, t represents t generation, c1, c2 are aceleration pulse, value is 0 ~ 2, r1 ~ U (0,1), r2 ~ U (0,1) are two separate random numbers, c1 is used for regulating particulate to fly to the step-length in self optimal location direction, and c2 is used for regulating particulate to fly to the step-length of global optimum's locality;
Xi=(xi1, xi2 ... xin) be the current location of particulate i;
Vi=(vi1, vi2 ... vin) be the present speed of particulate i;
During evolution, record particulate history optimal location Pi=up to the present (pi1, pi2 ... pin) and fine-grained global optimum position Pg=(pg1, pg2 ... pgn);
S24: carry out rule set rejecting according to following steps:
By comparing particulate adaptive value and training dataset, determine current optimal location and global optimum position;
S25: judge whether iterations reaches maximum evolutionary generation or data are all correctly classified, if enter step S26, otherwise performs step S23;
S26: the i-th rule-like generated is put into rule set, then removes the regular example covered in data centralization, checks whether remaining data number is less than the value of setting, if such Rule Extraction completes, otherwise performs step S21.
9. the magnanimity organizational structure data sorting system based on population according to claim 6, is characterized in that, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:
Employing stays exhibition method to carry out comprehensive evaluation: first given data set is randomly divided into two and independently gathers: training set and test set, using the data of 2/3rds as training set, the data of 1/3rd are as test set; Use training set to derive classification, then its accuracy rate test set is assessed, and calculates its fitness value in test set of the classifying rules that obtains of the first step;
Then grab sample recycling stays exhibition method to carry out predictablity rate estimation K time;
Finally the predictablity rate that this K time obtains is averaged, as obtaining final predictablity rate; Classifying rules is more close at the fitness value of training set and test set, then illustrate that the precision of classification is higher.
10. the magnanimity organizational structure data sorting system based on population according to claim 9, is characterized in that, the testing evaluation in described classifying rules testing evaluation module carries out in such a way:
Cross-validation method is adopted to carry out comprehensive evaluation: first primary data to be divided into K mutually disjoint subset S1, S2 ..., Sk, the equal and opposite in direction of each subset;
Then, training and testing carries out K time: be used as test set at i-th iteration Si, remaining subset is all for train classification models; That is, first time iteration classification at subset S2 ..., the upper training of Sk, and testing on S1, second time iteration classification at subset S1, S3 ..., the upper training of Sk, and test on S2; So go down;
Finally, accuracy is estimated as K the correct number of categories of iteration divided by total sample number in primary data.
CN201510027069.8A 2015-01-20 2015-01-20 Massive organizational structure data classification method and system based on particle swarm Pending CN104537383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510027069.8A CN104537383A (en) 2015-01-20 2015-01-20 Massive organizational structure data classification method and system based on particle swarm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510027069.8A CN104537383A (en) 2015-01-20 2015-01-20 Massive organizational structure data classification method and system based on particle swarm

Publications (1)

Publication Number Publication Date
CN104537383A true CN104537383A (en) 2015-04-22

Family

ID=52852903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510027069.8A Pending CN104537383A (en) 2015-01-20 2015-01-20 Massive organizational structure data classification method and system based on particle swarm

Country Status (1)

Country Link
CN (1) CN104537383A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069470A (en) * 2015-07-29 2015-11-18 腾讯科技(深圳)有限公司 Classification model training method and device
CN105427129A (en) * 2015-11-12 2016-03-23 腾讯科技(深圳)有限公司 Information delivery method and system
CN113361661A (en) * 2021-07-20 2021-09-07 红云红河烟草(集团)有限责任公司 Modeling method and device for data cooperation capability evaluation
CN114581058A (en) * 2022-03-10 2022-06-03 杭州电子科技大学 Personnel organization structure optimization method based on business process
CN115269939A (en) * 2022-09-28 2022-11-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Regular expression generation method and device, intelligent terminal and computer storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069470A (en) * 2015-07-29 2015-11-18 腾讯科技(深圳)有限公司 Classification model training method and device
CN105427129A (en) * 2015-11-12 2016-03-23 腾讯科技(深圳)有限公司 Information delivery method and system
CN113361661A (en) * 2021-07-20 2021-09-07 红云红河烟草(集团)有限责任公司 Modeling method and device for data cooperation capability evaluation
CN113361661B (en) * 2021-07-20 2023-04-07 红云红河烟草(集团)有限责任公司 Modeling method and device for evaluating data cooperation capability
CN114581058A (en) * 2022-03-10 2022-06-03 杭州电子科技大学 Personnel organization structure optimization method based on business process
CN114581058B (en) * 2022-03-10 2023-08-18 杭州电子科技大学 Personnel organization structure optimization method based on business process
CN115269939A (en) * 2022-09-28 2022-11-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Regular expression generation method and device, intelligent terminal and computer storage medium

Similar Documents

Publication Publication Date Title
Wahono et al. Metaheuristic optimization based feature selection for software defect prediction.
CN107766929B (en) Model analysis method and device
Yu et al. Pareto-optimal adaptive loss residual shrinkage network for imbalanced fault diagnostics of machines
CN108848068A (en) Based on deepness belief network-Support Vector data description APT attack detection method
CN111178611B (en) Method for predicting daily electric quantity
CN110213244A (en) A kind of network inbreak detection method based on space-time characteristic fusion
CN104537383A (en) Massive organizational structure data classification method and system based on particle swarm
CN109214444B (en) Game anti-addiction determination system and method based on twin neural network and GMM
CN101256631A (en) Method, apparatus, program and readable storage medium for character recognition
CN111126820A (en) Electricity stealing prevention method and system
CN116132104A (en) Intrusion detection method, system, equipment and medium based on improved CNN-LSTM
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
García-Vico et al. Fepds: A proposal for the extraction of fuzzy emerging patterns in data streams
Zhang Financial data anomaly detection method based on decision tree and random forest algorithm
Wei et al. [Retracted] Analysis and Risk Assessment of Corporate Financial Leverage Using Mobile Payment in the Era of Digital Technology in a Complex Environment
CN117197591B (en) Data classification method based on machine learning
Li et al. Automation recognition of pavement surface distress based on support vector machine
CN113343123A (en) Training method and detection method for generating confrontation multiple relation graph network
Li et al. Prediction of Unbalanced Financial Risk Based on GRA-TOPSIS and SMOTE-CNN
CN116150687A (en) Fluid pipeline leakage identification method based on multi-classification G-WLSTSVM model
CN115545342A (en) Risk prediction method and system for enterprise electric charge recovery
CN114996371A (en) Associated enterprise anti-fraud model construction method and system based on graph theory algorithm
Li et al. A fuzzy linear programming-based classification method
Gao et al. Statistics and Analysis of Targeted Poverty Alleviation Information Integrated with Big Data Mining Algorithm
Jiang et al. Cost-sensitive hybrid neural networks for heterogeneous and imbalanced data

Legal Events

Date Code Title Description
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150422

WD01 Invention patent application deemed withdrawn after publication