CN104765839A

CN104765839A - Data classifying method based on correlation coefficients between attributes

Info

Publication number: CN104765839A
Application number: CN201510180290.7A
Authority: CN
Inventors: 裴廷睿; 赵津锋; 郭勋; 朱更明; 李哲涛; 田淑娟
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2015-04-16
Filing date: 2015-04-16
Publication date: 2015-07-08

Abstract

The invention provides a data classifying method based on correlation coefficients between attributes. The method comprises the steps of inputting a sample set and a data set to be classified, and calculating the information gain values Gain of all attributes of the training sample set; taking the attribute of the largest information gain value Gain as the test attribute of a root node W in a decision-making tree according to the descending order; calculating the absolute value P' of the correlation coefficient between the root node attribute (the attribute of the nodes on the layer above) and the residue attribute set; establishing nodes on each layer according to P' and the values of different attributes, and updating the residue attribute set R; finally, generating the decision-making tree and classifying data to be classified according to the decision-making tree when traversal of all attributes is finished. By the adoption of the method, the efficacy of a traditional decision-making tree is improved greatly, and classifying accuracy of the decision-making tree is improved.

Description

A kind of data classification method based on related coefficient between attribute

Technical field

The invention belongs to Data Mining, relate to Data classification, specifically a kind of data classification method based on related coefficient between attribute.

Background technology

Data mining is exactly from database, excavate pattern potential between data, then finds out corresponding rule according to these patterns.Data mining technology carries out treatment and analysis fast and effectively by using computing machine to mass data in database, therefrom extract useful information, and change, understandable mode is expressed, so that decision-making in one form.The research etc. of data mining to commercial decision-making, knowledge base, science and medical science all has important using value and very wide application prospect.

At present the aspects such as association rule mining, cluster, classification, sequential pattern discovery, exception and trend discovery are mainly concentrated on to the research of data mining, wherein due to the widespread use of classified excavation in the fields such as business, it is made to become most active research direction in data mining.The object of classification is proposition classification function or disaggregated model (sorter), and this model can some in given classification of the data-mapping in database.

Because sorting technique can provide good decision support to industry-by-industry, there is the sorting algorithm of multiple different field method in different industries, such as traditional decision-tree, neural net method, bayes method, Rough Set method etc.In these algorithms, the most easy understand of traditional decision-tree, application is also extensive especially.Decision tree learning is a kind of method of approaching discrete-valued objective function, by from one group of training data learning to function representation be a decision tree, it is a kind of algorithm being usually used in forecast model, by by autotelic for mass data classification, therefrom find the information that some tools are valuable, potential.Although decision tree generate pattern is simple, also there are the following problems: the 1) mistake of individual training sample, and the accuracy of decision tree may be caused poor; 2) mutual relationship between attribute has insufficient emphasis on, and easily causes the repetition of subtree in decision tree or some attribute repeatedly to be checked on a certain path of decision tree.

Summary of the invention

The object of the invention is to the shortcoming overcoming above-mentioned prior art, propose a kind of data classification method based on related coefficient between attribute, to reduce the susceptibility of individual training sample mistake; The introducing of related coefficient, avoids the phenomenon that repeatedly checking appears in attribute on path.

Step of the present invention: first, input amendment collection and data set to be sorted, and the information gain value of calculation training sample set all properties ; Secondly, according to sorting maximum information yield value from big to small attribute elect root node in decision tree as testing attribute; Then, the absolute value of related coefficient between root node attribute (upper layer node attribute) and residue property set is calculated ; Basis again the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set ; Finally, until all properties traversal, generate decision tree, data to be sorted classified according to decision tree, its concrete steps comprise as follows:

Step one, input amendment collection and data set to be sorted, the information gain value of calculation training sample set all properties ;

Step 2, to information gain value according to sorting from big to small, choose maximum information yield value attribute as root node in decision tree testing attribute;

Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set ;

Step 4, basis the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set ;

If step 5 residue property set be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree;

Data set to be sorted is classified by step 6, foundation decision tree.

Tool of the present invention has the following advantages:

1, the present invention uses current all training examples setting up each step in decision tree process, reduces the susceptibility of individual training sample mistake, improves the accuracy of classification;

2, the present invention is by the related coefficient between computation attribute, highlights the correlationship between attribute, solves some attribute on a certain path of decision tree by the problem repeatedly checked.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is the process flow diagram of decision tree child node process of establishing in the present invention;

Fig. 3 is that one embodiment of the invention (certain corporate client purchasing power) decision tree sets up schematic diagram.

Embodiment

In order to be illustrated more clearly in the present invention, be specifically described according to certain this embodiment of corporate client's message sample data, wherein attribute comprises and sells frequency (corresponding property value has , ), (corresponding property value has in year total output value , ), prestige degree (it is bad, good that corresponding property value has), client's character (corresponding property value has private, state-run, private), product industry (corresponding property value has industry and agricultural), affiliated provinces and cities (corresponding property value has Hunan, Jiangxi, Shanghai), use this method is set up decision tree according to classified message sample data, according to decision tree, can realize inputting the classification (corresponding classification have height, general, low) that customer information exports client purchasing power.

In conjunction with the accompanying drawings and embodiments, concrete steps of the present invention are as follows:

Step one, input amendment collection and data set to be sorted, all properties of calculation training sample data is according to information gain value , concrete steps are as follows:

1) known data set to be sorted , training sample set there is property set , can be divided into individual inhomogeneity , namely , wherein representation class in sample number; Property set have attribute, namely and every attribute have individual different attribute value , then property value is by sample set divide, its sample number set is , wherein represent at attribute properties value is sample number; Shown in figure 3, in this example, , , property set ={ selling frequency, year total output value, prestige degree, client's character, product industry, affiliated provinces and cities }, class represent that client's purchasing power is high, class represent that client's purchasing power is general, class represent that client's purchasing power is low;

2) ask the expectation information needed for sample classification, namely total information entropy is , wherein , ;

3) average information calculating each attribute of sample is expected , namely , , wherein , represent at attribute properties value is class sample number, for training sample sum, , , ; Shown in figure 3, in this example, represent that the average information selling this attribute of frequency is expected, represent that selling frequency is the corporate client quantity that secondary and purchasing power is high, for the quantity of all clients;

4) the information gain value of each attribute of sample is asked , namely , shown in figure 3, in this example, represent the information gain value of selling frequency, the information gain value of provinces and cities belonging to representing.

Step 2, to information gain value according to sorting from big to small, and by maximum information yield value attribute elect root node in decision tree as testing attribute, that is:

According to information gain value by property set sort from big to small, and by maximum information yield value corresponding attribute elects root node in decision tree as testing attribute, namely this node has , shown in figure 3, in this example, sell frequency and elect root node in decision tree as testing attribute.

Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set , concrete steps are as follows:

1) covariance between the variance of each attribute and root node attribute (last layer nodal community) and residue property set is calculated, i.e. variance , , covariance , wherein , and , shown in figure 3, in this example, represent the variance of selling frequency, represent the covariance of selling frequency and year total output value;

2) absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set is calculated , namely , wherein .Shown in figure 3, in this example, root node sells frequency value has 5, node prestige degree and year total output value value has 3.

Step 4, basis the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set , concrete steps are as follows:

1) initialization residue property set , namely , wherein represent root node testing attribute, shown in figure 3, in this example, , namely selling frequency is root node testing attribute;

2) will sort from big to small, namely , then before choosing individual , and using the testing attribute of the attribute of its correspondence as the child node of root node, and upgrade residue property set , namely , shown in figure 3, in this example, root node sale frequency value has 5, owing to only having 2 property values under this attribute, therefore gets first 2 value, namely obtaining child node testing attribute is year total output value and prestige degree;

3) according to attribute have individual different attribute value , can by sample set be divided into , calculate root node attribute the quantity of information of different attribute value under (last layer nodal community) , wherein , , shown in figure 3, in this example, rooting node is sold frequency and is secondary and secondary quantity of information ;

4) rooting nodal community (last layer nodal community) is in different attribute value under, the average information of child node attribute is expected , namely , , wherein , represent the residue attribute except root node attribute, attribute it is attribute parent attribute, represent attribute middle property value is under, its child node attribute middle property value is and belong to class sample number, , , , , shown in figure 3, in this example, can represent sale frequency be respectively secondary and under secondary, the average information of year total output value and prestige degree is expected, represent that selling frequency is secondary, year total output value is ten thousand yuan and the high customer quantity of purchasing power;

5) rooting nodal community (last layer nodal community) is in different attribute value under, child node attribute information yield value , namely , and respectively by root node attribute value under sort from big to small, choose be worth the large child node testing attribute corresponding as root node attribute (last layer nodal community) different value, complete the process of establishing of the child node of root node (last layer node), shown in figure 3, in this example, in sale frequency be under secondary, the information gain value of year total output value is larger than the increment of prestige degree information, therefore using year total output value as sale frequency is child node testing attribute under secondary, now remains property set ={ client's character, product industry, affiliated provinces and cities };

6) absolute value of current residual property set and last layer attribute related coefficient is calculated , more same residue attribute and last layer attribute value, chooses the testing attribute of the large child node as upper strata attribute node, and upgrades residue property set , shown in figure 3, in this example, client's character and year total output value be worth than client character and prestige degree value is large, therefore selects client's character as the testing attribute of the child node under father node year total output value attribute; Now remain property set for empty set;

7) according to root node (last layer node) child node process of establishing in step 2), 3), 4), 5), carried out the process of establishing of remaining level of child nodes.

If step 5 residue property set be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree.

Data set to be sorted is classified by step 6, foundation decision tree.

Claims

1. based on a data classification method for related coefficient between attribute, it is characterized in that, in Data classification process, first input amendment collection and data set to be sorted, calculate the information gain value of sample all properties and sort, then according to the information gain value of all properties determine the attribute of decision tree root node, secondly according to the absolute value of related coefficient between attribute with the property value of different attribute, determine the attribute of remaining node, last until all properties has traveled through, generate decision tree, then classified by data set to be sorted according to decision tree, described method at least comprises the following steps:

Data set to be sorted is classified by step 6, foundation decision tree.

2. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that root node testing attribute choose process, at least further comprising the steps of:

1) known data set to be sorted , training sample set there is property set , can be divided into individual inhomogeneity , namely , wherein representation class in sample number; Property set have attribute, namely and every attribute have individual different attribute value , then property value is by sample set divide, its sample number set is , wherein represent at attribute properties value is sample number;

3) average information calculating each attribute of sample is expected , namely , , wherein , represent at attribute properties value is class sample number, for training sample sum, , , ;

4) the information gain value of each attribute of sample is asked , namely ;

5) according to information gain value by property set sort from big to small, and by maximum information yield value corresponding attribute elects root node in decision tree as testing attribute, namely this node has .

3. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set calculating, at least further comprising the steps of:

1) covariance between the variance of each attribute and root node attribute (last layer nodal community) and residue property set is calculated, i.e. variance , , covariance , wherein , and ;

2) absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set is calculated , namely , wherein .

4. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that basis the property value of value and different attribute carries out the process of establishing of each node layer, at least further comprising the steps of:

1) initialization residue property set , namely , wherein represent the attribute of root node;

2) will sort from big to small, namely , then before choosing individual , and using the testing attribute of the attribute of its correspondence as the child node of root node, and upgrade residue property set , namely ;

3) according to attribute have individual different attribute value , can by sample set be divided into , calculate root node attribute the quantity of information of different attribute value under (last layer nodal community) , wherein , ;

4) rooting nodal community (last layer nodal community) is in different attribute value under, the average information of child node attribute is expected , namely , , wherein , represent the residue attribute except root node attribute, attribute it is attribute parent attribute, represent attribute middle property value is under, its child node attribute middle property value is and belong to class sample number, , , , ;

5) rooting nodal community (last layer nodal community) is in different attribute value lower child node attribute information yield value , namely , and respectively by root node attribute value under sort from big to small, choose be worth the large child node testing attribute corresponding as root node attribute (last layer nodal community) different attribute value, complete the process of establishing of the child node of root node (last layer node);

6) absolute value of current residual property set and last layer attribute related coefficient is calculated , more same residue attribute and last layer attribute value, chooses the testing attribute of the large child node as last layer attribute node, and upgrades residue property set ;

7) according to the step 2 in the process of establishing of the child node of root node (last layer node)), 3), 4), 5) and step 6), carried out the process of establishing of remaining every level of child nodes.