CN104765839A - Data classifying method based on correlation coefficients between attributes - Google Patents

Data classifying method based on correlation coefficients between attributes Download PDF

Info

Publication number
CN104765839A
CN104765839A CN201510180290.7A CN201510180290A CN104765839A CN 104765839 A CN104765839 A CN 104765839A CN 201510180290 A CN201510180290 A CN 201510180290A CN 104765839 A CN104765839 A CN 104765839A
Authority
CN
China
Prior art keywords
attribute
value
node
root node
namely
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510180290.7A
Other languages
Chinese (zh)
Inventor
裴廷睿
赵津锋
郭勋
朱更明
李哲涛
田淑娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan University filed Critical Xiangtan University
Priority to CN201510180290.7A priority Critical patent/CN104765839A/en
Publication of CN104765839A publication Critical patent/CN104765839A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data classifying method based on correlation coefficients between attributes. The method comprises the steps of inputting a sample set and a data set to be classified, and calculating the information gain values Gain of all attributes of the training sample set; taking the attribute of the largest information gain value Gain as the test attribute of a root node W in a decision-making tree according to the descending order; calculating the absolute value P' of the correlation coefficient between the root node attribute (the attribute of the nodes on the layer above) and the residue attribute set; establishing nodes on each layer according to P' and the values of different attributes, and updating the residue attribute set R; finally, generating the decision-making tree and classifying data to be classified according to the decision-making tree when traversal of all attributes is finished. By the adoption of the method, the efficacy of a traditional decision-making tree is improved greatly, and classifying accuracy of the decision-making tree is improved.

Description

A kind of data classification method based on related coefficient between attribute
Technical field
The invention belongs to Data Mining, relate to Data classification, specifically a kind of data classification method based on related coefficient between attribute.
Background technology
Data mining is exactly from database, excavate pattern potential between data, then finds out corresponding rule according to these patterns.Data mining technology carries out treatment and analysis fast and effectively by using computing machine to mass data in database, therefrom extract useful information, and change, understandable mode is expressed, so that decision-making in one form.The research etc. of data mining to commercial decision-making, knowledge base, science and medical science all has important using value and very wide application prospect.
At present the aspects such as association rule mining, cluster, classification, sequential pattern discovery, exception and trend discovery are mainly concentrated on to the research of data mining, wherein due to the widespread use of classified excavation in the fields such as business, it is made to become most active research direction in data mining.The object of classification is proposition classification function or disaggregated model (sorter), and this model can some in given classification of the data-mapping in database.
Because sorting technique can provide good decision support to industry-by-industry, there is the sorting algorithm of multiple different field method in different industries, such as traditional decision-tree, neural net method, bayes method, Rough Set method etc.In these algorithms, the most easy understand of traditional decision-tree, application is also extensive especially.Decision tree learning is a kind of method of approaching discrete-valued objective function, by from one group of training data learning to function representation be a decision tree, it is a kind of algorithm being usually used in forecast model, by by autotelic for mass data classification, therefrom find the information that some tools are valuable, potential.Although decision tree generate pattern is simple, also there are the following problems: the 1) mistake of individual training sample, and the accuracy of decision tree may be caused poor; 2) mutual relationship between attribute has insufficient emphasis on, and easily causes the repetition of subtree in decision tree or some attribute repeatedly to be checked on a certain path of decision tree.
Summary of the invention
The object of the invention is to the shortcoming overcoming above-mentioned prior art, propose a kind of data classification method based on related coefficient between attribute, to reduce the susceptibility of individual training sample mistake; The introducing of related coefficient, avoids the phenomenon that repeatedly checking appears in attribute on path.
Step of the present invention: first, input amendment collection and data set to be sorted, and the information gain value of calculation training sample set all properties ; Secondly, according to sorting maximum information yield value from big to small attribute elect root node in decision tree as testing attribute; Then, the absolute value of related coefficient between root node attribute (upper layer node attribute) and residue property set is calculated ; Basis again the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set ; Finally, until all properties traversal, generate decision tree, data to be sorted classified according to decision tree, its concrete steps comprise as follows:
Step one, input amendment collection and data set to be sorted, the information gain value of calculation training sample set all properties ;
Step 2, to information gain value according to sorting from big to small, choose maximum information yield value attribute as root node in decision tree testing attribute;
Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set ;
Step 4, basis the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set ;
If step 5 residue property set be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree;
Data set to be sorted is classified by step 6, foundation decision tree.
Tool of the present invention has the following advantages:
1, the present invention uses current all training examples setting up each step in decision tree process, reduces the susceptibility of individual training sample mistake, improves the accuracy of classification;
2, the present invention is by the related coefficient between computation attribute, highlights the correlationship between attribute, solves some attribute on a certain path of decision tree by the problem repeatedly checked.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is the process flow diagram of decision tree child node process of establishing in the present invention;
Fig. 3 is that one embodiment of the invention (certain corporate client purchasing power) decision tree sets up schematic diagram.
Embodiment
In order to be illustrated more clearly in the present invention, be specifically described according to certain this embodiment of corporate client's message sample data, wherein attribute comprises and sells frequency (corresponding property value has , ), (corresponding property value has in year total output value , ), prestige degree (it is bad, good that corresponding property value has), client's character (corresponding property value has private, state-run, private), product industry (corresponding property value has industry and agricultural), affiliated provinces and cities (corresponding property value has Hunan, Jiangxi, Shanghai), use this method is set up decision tree according to classified message sample data, according to decision tree, can realize inputting the classification (corresponding classification have height, general, low) that customer information exports client purchasing power.
In conjunction with the accompanying drawings and embodiments, concrete steps of the present invention are as follows:
Step one, input amendment collection and data set to be sorted, all properties of calculation training sample data is according to information gain value , concrete steps are as follows:
1) known data set to be sorted , training sample set there is property set , can be divided into individual inhomogeneity , namely , wherein representation class in sample number; Property set have attribute, namely and every attribute have individual different attribute value , then property value is by sample set divide, its sample number set is , wherein represent at attribute properties value is sample number; Shown in figure 3, in this example, , , property set ={ selling frequency, year total output value, prestige degree, client's character, product industry, affiliated provinces and cities }, class represent that client's purchasing power is high, class represent that client's purchasing power is general, class represent that client's purchasing power is low;
2) ask the expectation information needed for sample classification, namely total information entropy is , wherein , ;
3) average information calculating each attribute of sample is expected , namely , , wherein , represent at attribute properties value is class sample number, for training sample sum, , , ; Shown in figure 3, in this example, represent that the average information selling this attribute of frequency is expected, represent that selling frequency is the corporate client quantity that secondary and purchasing power is high, for the quantity of all clients;
4) the information gain value of each attribute of sample is asked , namely , shown in figure 3, in this example, represent the information gain value of selling frequency, the information gain value of provinces and cities belonging to representing.
Step 2, to information gain value according to sorting from big to small, and by maximum information yield value attribute elect root node in decision tree as testing attribute, that is:
According to information gain value by property set sort from big to small, and by maximum information yield value corresponding attribute elects root node in decision tree as testing attribute, namely this node has , shown in figure 3, in this example, sell frequency and elect root node in decision tree as testing attribute.
Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set , concrete steps are as follows:
1) covariance between the variance of each attribute and root node attribute (last layer nodal community) and residue property set is calculated, i.e. variance , , covariance , wherein , and , shown in figure 3, in this example, represent the variance of selling frequency, represent the covariance of selling frequency and year total output value;
2) absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set is calculated , namely , wherein .Shown in figure 3, in this example, root node sells frequency value has 5, node prestige degree and year total output value value has 3.
Step 4, basis the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set , concrete steps are as follows:
1) initialization residue property set , namely , wherein represent root node testing attribute, shown in figure 3, in this example, , namely selling frequency is root node testing attribute;
2) will sort from big to small, namely , then before choosing individual , and using the testing attribute of the attribute of its correspondence as the child node of root node, and upgrade residue property set , namely , shown in figure 3, in this example, root node sale frequency value has 5, owing to only having 2 property values under this attribute, therefore gets first 2 value, namely obtaining child node testing attribute is year total output value and prestige degree;
3) according to attribute have individual different attribute value , can by sample set be divided into , calculate root node attribute the quantity of information of different attribute value under (last layer nodal community) , wherein , , shown in figure 3, in this example, rooting node is sold frequency and is secondary and secondary quantity of information ;
4) rooting nodal community (last layer nodal community) is in different attribute value under, the average information of child node attribute is expected , namely , , wherein , represent the residue attribute except root node attribute, attribute it is attribute parent attribute, represent attribute middle property value is under, its child node attribute middle property value is and belong to class sample number, , , , , shown in figure 3, in this example, can represent sale frequency be respectively secondary and under secondary, the average information of year total output value and prestige degree is expected, represent that selling frequency is secondary, year total output value is ten thousand yuan and the high customer quantity of purchasing power;
5) rooting nodal community (last layer nodal community) is in different attribute value under, child node attribute information yield value , namely , and respectively by root node attribute value under sort from big to small, choose be worth the large child node testing attribute corresponding as root node attribute (last layer nodal community) different value, complete the process of establishing of the child node of root node (last layer node), shown in figure 3, in this example, in sale frequency be under secondary, the information gain value of year total output value is larger than the increment of prestige degree information, therefore using year total output value as sale frequency is child node testing attribute under secondary, now remains property set ={ client's character, product industry, affiliated provinces and cities };
6) absolute value of current residual property set and last layer attribute related coefficient is calculated , more same residue attribute and last layer attribute value, chooses the testing attribute of the large child node as upper strata attribute node, and upgrades residue property set , shown in figure 3, in this example, client's character and year total output value be worth than client character and prestige degree value is large, therefore selects client's character as the testing attribute of the child node under father node year total output value attribute; Now remain property set for empty set;
7) according to root node (last layer node) child node process of establishing in step 2), 3), 4), 5), carried out the process of establishing of remaining level of child nodes.
If step 5 residue property set be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree.
Data set to be sorted is classified by step 6, foundation decision tree.

Claims (4)

1. based on a data classification method for related coefficient between attribute, it is characterized in that, in Data classification process, first input amendment collection and data set to be sorted, calculate the information gain value of sample all properties and sort, then according to the information gain value of all properties determine the attribute of decision tree root node, secondly according to the absolute value of related coefficient between attribute with the property value of different attribute, determine the attribute of remaining node, last until all properties has traveled through, generate decision tree, then classified by data set to be sorted according to decision tree, described method at least comprises the following steps:
Step one, input amendment collection and data set to be sorted, the information gain value of calculation training sample set all properties ;
Step 2, to information gain value according to sorting from big to small, choose maximum information yield value attribute as root node in decision tree testing attribute;
Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set ;
Step 4, basis the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set ;
If step 5 residue property set be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree;
Data set to be sorted is classified by step 6, foundation decision tree.
2. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that root node testing attribute choose process, at least further comprising the steps of:
1) known data set to be sorted , training sample set there is property set , can be divided into individual inhomogeneity , namely , wherein representation class in sample number; Property set have attribute, namely and every attribute have individual different attribute value , then property value is by sample set divide, its sample number set is , wherein represent at attribute properties value is sample number;
2) ask the expectation information needed for sample classification, namely total information entropy is , wherein , ;
3) average information calculating each attribute of sample is expected , namely , , wherein , represent at attribute properties value is class sample number, for training sample sum, , , ;
4) the information gain value of each attribute of sample is asked , namely ;
5) according to information gain value by property set sort from big to small, and by maximum information yield value corresponding attribute elects root node in decision tree as testing attribute, namely this node has .
3. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set calculating, at least further comprising the steps of:
1) covariance between the variance of each attribute and root node attribute (last layer nodal community) and residue property set is calculated, i.e. variance , , covariance , wherein , and ;
2) absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set is calculated , namely , wherein .
4. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that basis the property value of value and different attribute carries out the process of establishing of each node layer, at least further comprising the steps of:
1) initialization residue property set , namely , wherein represent the attribute of root node;
2) will sort from big to small, namely , then before choosing individual , and using the testing attribute of the attribute of its correspondence as the child node of root node, and upgrade residue property set , namely ;
3) according to attribute have individual different attribute value , can by sample set be divided into , calculate root node attribute the quantity of information of different attribute value under (last layer nodal community) , wherein , ;
4) rooting nodal community (last layer nodal community) is in different attribute value under, the average information of child node attribute is expected , namely , , wherein , represent the residue attribute except root node attribute, attribute it is attribute parent attribute, represent attribute middle property value is under, its child node attribute middle property value is and belong to class sample number, , , , ;
5) rooting nodal community (last layer nodal community) is in different attribute value lower child node attribute information yield value , namely , and respectively by root node attribute value under sort from big to small, choose be worth the large child node testing attribute corresponding as root node attribute (last layer nodal community) different attribute value, complete the process of establishing of the child node of root node (last layer node);
6) absolute value of current residual property set and last layer attribute related coefficient is calculated , more same residue attribute and last layer attribute value, chooses the testing attribute of the large child node as last layer attribute node, and upgrades residue property set ;
7) according to the step 2 in the process of establishing of the child node of root node (last layer node)), 3), 4), 5) and step 6), carried out the process of establishing of remaining every level of child nodes.
CN201510180290.7A 2015-04-16 2015-04-16 Data classifying method based on correlation coefficients between attributes Pending CN104765839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510180290.7A CN104765839A (en) 2015-04-16 2015-04-16 Data classifying method based on correlation coefficients between attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510180290.7A CN104765839A (en) 2015-04-16 2015-04-16 Data classifying method based on correlation coefficients between attributes

Publications (1)

Publication Number Publication Date
CN104765839A true CN104765839A (en) 2015-07-08

Family

ID=53647667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510180290.7A Pending CN104765839A (en) 2015-04-16 2015-04-16 Data classifying method based on correlation coefficients between attributes

Country Status (1)

Country Link
CN (1) CN104765839A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650807A (en) * 2016-12-20 2017-05-10 东南大学 Method for predicting and evaluating concrete strength deterioration under ocean environment
CN106886519A (en) * 2015-12-15 2017-06-23 中国移动通信集团公司 A kind of attribute processing methods and server
CN107610771A (en) * 2017-08-23 2018-01-19 上海电力学院 A kind of medical science Testing index screening technique based on decision tree
CN107894827A (en) * 2017-10-31 2018-04-10 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN108509962A (en) * 2017-02-28 2018-09-07 优信互联(北京)信息技术有限公司 A kind of method and its device of identification information of vehicles
CN108665309A (en) * 2018-05-08 2018-10-16 多盟睿达科技(中国)有限公司 A kind of advertisement matrix crowd localization method and system based on big data
CN108960294A (en) * 2018-06-12 2018-12-07 中国科学技术大学 A kind of mobile robot classification of landform method of view-based access control model
WO2019019375A1 (en) * 2017-07-26 2019-01-31 平安科技(深圳)有限公司 Method and apparatus for creating underwriting decision tree, and computer device and storage medium
CN109784362A (en) * 2018-12-05 2019-05-21 国网辽宁省电力有限公司信息通信分公司 A kind of DGA shortage of data value interpolating method based on iteration KNN and interpolation priority
CN110377605A (en) * 2019-07-24 2019-10-25 贵州大学 A kind of Sensitive Attributes identification of structural data and classification stage division
US10831733B2 (en) 2017-12-22 2020-11-10 International Business Machines Corporation Interactive adjustment of decision rules
CN113362089A (en) * 2020-03-02 2021-09-07 北京沃东天骏信息技术有限公司 Attribute feature extraction method and device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886519A (en) * 2015-12-15 2017-06-23 中国移动通信集团公司 A kind of attribute processing methods and server
CN106650807A (en) * 2016-12-20 2017-05-10 东南大学 Method for predicting and evaluating concrete strength deterioration under ocean environment
CN106650807B (en) * 2016-12-20 2019-10-11 东南大学 A kind of concrete in marine environment strength deterioration prediction and evaluation method
CN108509962A (en) * 2017-02-28 2018-09-07 优信互联(北京)信息技术有限公司 A kind of method and its device of identification information of vehicles
WO2019019375A1 (en) * 2017-07-26 2019-01-31 平安科技(深圳)有限公司 Method and apparatus for creating underwriting decision tree, and computer device and storage medium
CN107610771A (en) * 2017-08-23 2018-01-19 上海电力学院 A kind of medical science Testing index screening technique based on decision tree
CN107894827A (en) * 2017-10-31 2018-04-10 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
US10831733B2 (en) 2017-12-22 2020-11-10 International Business Machines Corporation Interactive adjustment of decision rules
CN108665309A (en) * 2018-05-08 2018-10-16 多盟睿达科技(中国)有限公司 A kind of advertisement matrix crowd localization method and system based on big data
CN108665309B (en) * 2018-05-08 2021-11-19 多盟睿达科技(中国)有限公司 Advertisement matrix crowd positioning method and system based on big data
CN108960294A (en) * 2018-06-12 2018-12-07 中国科学技术大学 A kind of mobile robot classification of landform method of view-based access control model
CN109784362A (en) * 2018-12-05 2019-05-21 国网辽宁省电力有限公司信息通信分公司 A kind of DGA shortage of data value interpolating method based on iteration KNN and interpolation priority
CN110377605A (en) * 2019-07-24 2019-10-25 贵州大学 A kind of Sensitive Attributes identification of structural data and classification stage division
CN110377605B (en) * 2019-07-24 2023-04-25 贵州大学 Sensitive attribute identification and classification method for structured data
CN113362089A (en) * 2020-03-02 2021-09-07 北京沃东天骏信息技术有限公司 Attribute feature extraction method and device

Similar Documents

Publication Publication Date Title
CN104765839A (en) Data classifying method based on correlation coefficients between attributes
Lim et al. Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning
Huang et al. Revealing density-based clustering structure from the core-connected tree of a network
CN100557626C (en) Image partition method based on immune spectrum clustering
CN106845717B (en) Energy efficiency evaluation method based on multi-model fusion strategy
CN108154430A (en) A kind of credit scoring construction method based on machine learning and big data technology
CN103942571B (en) Graphic image sorting method based on genetic programming algorithm
CN103106279A (en) Clustering method simultaneously based on node attribute and structural relationship similarity
Pandey et al. An analysis of machine learning techniques (J48 & AdaBoost)-for classification
CN109582782A (en) A kind of Text Clustering Method based on Weakly supervised deep learning
CN102750286A (en) Novel decision tree classifier method for processing missing data
Zanghi et al. Strategies for online inference of model-based clustering in large and growing networks
CN109978050A (en) Decision Rules Extraction and reduction method based on SVM-RF
CN104217015A (en) Hierarchical clustering method based on mutual shared nearest neighbors
CN111126865A (en) Technology maturity judging method and system based on scientific and technological big data
Laassem et al. Label propagation algorithm for community detection based on Coulomb’s law
CN106570537A (en) Random forest model selection method based on confusion matrix
CN103310027B (en) Rules extraction method for map template coupling
CN107451617A (en) One kind figure transduction semisupervised classification method
Suard et al. Kernel on Bag of Paths For Measuring Similarity of Shapes.
CN109164794A (en) Multivariable industrial process Fault Classification based on inclined F value SELM
CN111428821A (en) Asset classification method based on decision tree
CN103020864B (en) Corn fine breed breeding method
Wang et al. Adaptive population structure learning in evolutionary multi-objective optimization
Berton et al. The Impact of Network Sampling on Relational Classification.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150708

RJ01 Rejection of invention patent application after publication