CN104765839A - Data classifying method based on correlation coefficients between attributes - Google Patents
Data classifying method based on correlation coefficients between attributes Download PDFInfo
- Publication number
- CN104765839A CN104765839A CN201510180290.7A CN201510180290A CN104765839A CN 104765839 A CN104765839 A CN 104765839A CN 201510180290 A CN201510180290 A CN 201510180290A CN 104765839 A CN104765839 A CN 104765839A
- Authority
- CN
- China
- Prior art keywords
- attribute
- value
- node
- root node
- namely
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data classifying method based on correlation coefficients between attributes. The method comprises the steps of inputting a sample set and a data set to be classified, and calculating the information gain values Gain of all attributes of the training sample set; taking the attribute of the largest information gain value Gain as the test attribute of a root node W in a decision-making tree according to the descending order; calculating the absolute value P' of the correlation coefficient between the root node attribute (the attribute of the nodes on the layer above) and the residue attribute set; establishing nodes on each layer according to P' and the values of different attributes, and updating the residue attribute set R; finally, generating the decision-making tree and classifying data to be classified according to the decision-making tree when traversal of all attributes is finished. By the adoption of the method, the efficacy of a traditional decision-making tree is improved greatly, and classifying accuracy of the decision-making tree is improved.
Description
Technical field
The invention belongs to Data Mining, relate to Data classification, specifically a kind of data classification method based on related coefficient between attribute.
Background technology
Data mining is exactly from database, excavate pattern potential between data, then finds out corresponding rule according to these patterns.Data mining technology carries out treatment and analysis fast and effectively by using computing machine to mass data in database, therefrom extract useful information, and change, understandable mode is expressed, so that decision-making in one form.The research etc. of data mining to commercial decision-making, knowledge base, science and medical science all has important using value and very wide application prospect.
At present the aspects such as association rule mining, cluster, classification, sequential pattern discovery, exception and trend discovery are mainly concentrated on to the research of data mining, wherein due to the widespread use of classified excavation in the fields such as business, it is made to become most active research direction in data mining.The object of classification is proposition classification function or disaggregated model (sorter), and this model can some in given classification of the data-mapping in database.
Because sorting technique can provide good decision support to industry-by-industry, there is the sorting algorithm of multiple different field method in different industries, such as traditional decision-tree, neural net method, bayes method, Rough Set method etc.In these algorithms, the most easy understand of traditional decision-tree, application is also extensive especially.Decision tree learning is a kind of method of approaching discrete-valued objective function, by from one group of training data learning to function representation be a decision tree, it is a kind of algorithm being usually used in forecast model, by by autotelic for mass data classification, therefrom find the information that some tools are valuable, potential.Although decision tree generate pattern is simple, also there are the following problems: the 1) mistake of individual training sample, and the accuracy of decision tree may be caused poor; 2) mutual relationship between attribute has insufficient emphasis on, and easily causes the repetition of subtree in decision tree or some attribute repeatedly to be checked on a certain path of decision tree.
Summary of the invention
The object of the invention is to the shortcoming overcoming above-mentioned prior art, propose a kind of data classification method based on related coefficient between attribute, to reduce the susceptibility of individual training sample mistake; The introducing of related coefficient, avoids the phenomenon that repeatedly checking appears in attribute on path.
Step of the present invention: first, input amendment collection and data set to be sorted, and the information gain value of calculation training sample set all properties
; Secondly, according to sorting maximum information yield value from big to small
attribute elect root node in decision tree as
testing attribute; Then, the absolute value of related coefficient between root node attribute (upper layer node attribute) and residue property set is calculated
; Basis again
the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set
; Finally, until all properties traversal, generate decision tree, data to be sorted classified according to decision tree, its concrete steps comprise as follows:
Step one, input amendment collection and data set to be sorted, the information gain value of calculation training sample set all properties
;
Step 2, to information gain value
according to sorting from big to small, choose maximum information yield value
attribute as root node in decision tree
testing attribute;
Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set
;
Step 4, basis
the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set
;
If step 5 residue property set
be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree;
Data set to be sorted is classified by step 6, foundation decision tree.
Tool of the present invention has the following advantages:
1, the present invention uses current all training examples setting up each step in decision tree process, reduces the susceptibility of individual training sample mistake, improves the accuracy of classification;
2, the present invention is by the related coefficient between computation attribute, highlights the correlationship between attribute, solves some attribute on a certain path of decision tree by the problem repeatedly checked.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is the process flow diagram of decision tree child node process of establishing in the present invention;
Fig. 3 is that one embodiment of the invention (certain corporate client purchasing power) decision tree sets up schematic diagram.
Embodiment
In order to be illustrated more clearly in the present invention, be specifically described according to certain this embodiment of corporate client's message sample data, wherein attribute comprises and sells frequency (corresponding property value has
,
), (corresponding property value has in year total output value
,
), prestige degree (it is bad, good that corresponding property value has), client's character (corresponding property value has private, state-run, private), product industry (corresponding property value has industry and agricultural), affiliated provinces and cities (corresponding property value has Hunan, Jiangxi, Shanghai), use this method is set up decision tree according to classified message sample data, according to decision tree, can realize inputting the classification (corresponding classification have height, general, low) that customer information exports client purchasing power.
In conjunction with the accompanying drawings and embodiments, concrete steps of the present invention are as follows:
Step one, input amendment collection and data set to be sorted, all properties of calculation training sample data is according to information gain value
, concrete steps are as follows:
1) known data set to be sorted
, training sample set
there is property set
, can be divided into
individual inhomogeneity
, namely
, wherein
representation class
in sample number; Property set
have
attribute, namely
and every attribute
have
individual different attribute value
, then property value is by sample set
divide, its sample number set is
, wherein
represent at attribute
properties value is
sample number; Shown in figure 3, in this example,
,
, property set
={ selling frequency, year total output value, prestige degree, client's character, product industry, affiliated provinces and cities }, class
represent that client's purchasing power is high, class
represent that client's purchasing power is general, class
represent that client's purchasing power is low;
2) ask the expectation information needed for sample classification, namely total information entropy is
, wherein
,
;
3) average information calculating each attribute of sample is expected
, namely
,
, wherein
,
represent at attribute
properties value is
class
sample number,
for training sample sum,
,
,
; Shown in figure 3, in this example,
represent that the average information selling this attribute of frequency is expected,
represent that selling frequency is
the corporate client quantity that secondary and purchasing power is high,
for the quantity of all clients;
4) the information gain value of each attribute of sample is asked
, namely
, shown in figure 3, in this example,
represent the information gain value of selling frequency,
the information gain value of provinces and cities belonging to representing.
Step 2, to information gain value
according to sorting from big to small, and by maximum information yield value
attribute elect root node in decision tree as
testing attribute, that is:
According to information gain value
by property set
sort from big to small, and by maximum information yield value
corresponding attribute elects root node in decision tree as
testing attribute, namely this node has
, shown in figure 3, in this example, sell frequency and elect root node in decision tree as
testing attribute.
Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set
, concrete steps are as follows:
1) covariance between the variance of each attribute and root node attribute (last layer nodal community) and residue property set is calculated, i.e. variance
,
, covariance
, wherein
,
and
, shown in figure 3, in this example,
represent the variance of selling frequency,
represent the covariance of selling frequency and year total output value;
2) absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set is calculated
, namely
, wherein
.Shown in figure 3, in this example, root node sells frequency
value has 5, node prestige degree and year total output value
value has 3.
Step 4, basis
the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set
, concrete steps are as follows:
1) initialization residue property set
, namely
, wherein
represent root node
testing attribute, shown in figure 3, in this example,
, namely selling frequency is root node
testing attribute;
2) will
sort from big to small, namely
, then before choosing
individual
, and using the testing attribute of the attribute of its correspondence as the child node of root node, and upgrade residue property set
, namely
, shown in figure 3, in this example, root node sale frequency
value has 5, owing to only having 2 property values under this attribute, therefore gets first 2
value, namely obtaining child node testing attribute is year total output value and prestige degree;
3) according to attribute
have
individual different attribute value
, can by sample set
be divided into
, calculate root node attribute
the quantity of information of different attribute value under (last layer nodal community)
, wherein
,
, shown in figure 3, in this example, rooting node is sold frequency and is
secondary and
secondary quantity of information
;
4) rooting nodal community
(last layer nodal community) is in different attribute value
under, the average information of child node attribute is expected
, namely
,
, wherein
,
represent the residue attribute except root node attribute, attribute
it is attribute
parent attribute,
represent attribute
middle property value is
under, its child node attribute
middle property value is
and belong to class
sample number,
,
,
,
, shown in figure 3, in this example,
can represent sale frequency be respectively
secondary and
under secondary, the average information of year total output value and prestige degree is expected,
represent that selling frequency is
secondary, year total output value is
ten thousand yuan and the high customer quantity of purchasing power;
5) rooting nodal community
(last layer nodal community) is in different attribute value
under, child node attribute information yield value
, namely
, and respectively by root node attribute value
under
sort from big to small, choose
be worth the large child node testing attribute corresponding as root node attribute (last layer nodal community) different value, complete the process of establishing of the child node of root node (last layer node), shown in figure 3, in this example, in sale frequency be
under secondary, the information gain value of year total output value is larger than the increment of prestige degree information, therefore using year total output value as sale frequency is
child node testing attribute under secondary, now remains property set
={ client's character, product industry, affiliated provinces and cities };
6) absolute value of current residual property set and last layer attribute related coefficient is calculated
, more same residue attribute and last layer attribute
value, chooses the testing attribute of the large child node as upper strata attribute node, and upgrades residue property set
, shown in figure 3, in this example, client's character and year total output value
be worth than client character and prestige degree
value is large, therefore selects client's character as the testing attribute of the child node under father node year total output value attribute; Now remain property set
for empty set;
7) according to root node (last layer node) child node process of establishing in step 2), 3), 4), 5), carried out the process of establishing of remaining level of child nodes.
If step 5 residue property set
be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree.
Data set to be sorted is classified by step 6, foundation decision tree.
Claims (4)
1. based on a data classification method for related coefficient between attribute, it is characterized in that, in Data classification process, first input amendment collection and data set to be sorted, calculate the information gain value of sample all properties
and sort, then according to the information gain value of all properties
determine the attribute of decision tree root node, secondly according to the absolute value of related coefficient between attribute
with the property value of different attribute, determine the attribute of remaining node, last until all properties has traveled through, generate decision tree, then classified by data set to be sorted according to decision tree, described method at least comprises the following steps:
Step one, input amendment collection and data set to be sorted, the information gain value of calculation training sample set all properties
;
Step 2, to information gain value
according to sorting from big to small, choose maximum information yield value
attribute as root node in decision tree
testing attribute;
Step 3, calculate the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set
;
Step 4, basis
the property value of value and different attribute carries out the process of establishing of each node layer, and upgrades residue property set
;
If step 5 residue property set
be not empty set, namely all properties has not traveled through, and continues step 3 and four, until all properties has traveled through, generates decision tree;
Data set to be sorted is classified by step 6, foundation decision tree.
2. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that root node
testing attribute choose process, at least further comprising the steps of:
1) known data set to be sorted
, training sample set
there is property set
, can be divided into
individual inhomogeneity
, namely
, wherein
representation class
in sample number; Property set
have
attribute, namely
and every attribute
have
individual different attribute value
, then property value is by sample set
divide, its sample number set is
, wherein
represent at attribute
properties value is
sample number;
2) ask the expectation information needed for sample classification, namely total information entropy is
, wherein
,
;
3) average information calculating each attribute of sample is expected
, namely
,
, wherein
,
represent at attribute
properties value is
class
sample number,
for training sample sum,
,
,
;
4) the information gain value of each attribute of sample is asked
, namely
;
5) according to information gain value
by property set
sort from big to small, and by maximum information yield value
corresponding attribute elects root node in decision tree as
testing attribute, namely this node has
.
3. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that the absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set
calculating, at least further comprising the steps of:
1) covariance between the variance of each attribute and root node attribute (last layer nodal community) and residue property set is calculated, i.e. variance
,
, covariance
, wherein
,
and
;
2) absolute value of related coefficient between root node attribute (last layer nodal community) and residue property set is calculated
, namely
, wherein
.
4. a kind of decision tree data classification method based on related coefficient between attribute according to claim 1, is characterized in that basis
the property value of value and different attribute carries out the process of establishing of each node layer, at least further comprising the steps of:
1) initialization residue property set
, namely
, wherein
represent the attribute of root node;
2) will
sort from big to small, namely
, then before choosing
individual
, and using the testing attribute of the attribute of its correspondence as the child node of root node, and upgrade residue property set
, namely
;
3) according to attribute
have
individual different attribute value
, can by sample set
be divided into
, calculate root node attribute
the quantity of information of different attribute value under (last layer nodal community)
, wherein
,
;
4) rooting nodal community
(last layer nodal community) is in different attribute value
under, the average information of child node attribute is expected
, namely
,
, wherein
,
represent the residue attribute except root node attribute, attribute
it is attribute
parent attribute,
represent attribute
middle property value is
under, its child node attribute
middle property value is
and belong to class
sample number,
,
,
,
;
5) rooting nodal community
(last layer nodal community) is in different attribute value
lower child node attribute information yield value
, namely
, and respectively by root node attribute value
under
sort from big to small, choose
be worth the large child node testing attribute corresponding as root node attribute (last layer nodal community) different attribute value, complete the process of establishing of the child node of root node (last layer node);
6) absolute value of current residual property set and last layer attribute related coefficient is calculated
, more same residue attribute and last layer attribute
value, chooses the testing attribute of the large child node as last layer attribute node, and upgrades residue property set
;
7) according to the step 2 in the process of establishing of the child node of root node (last layer node)), 3), 4), 5) and step 6), carried out the process of establishing of remaining every level of child nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510180290.7A CN104765839A (en) | 2015-04-16 | 2015-04-16 | Data classifying method based on correlation coefficients between attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510180290.7A CN104765839A (en) | 2015-04-16 | 2015-04-16 | Data classifying method based on correlation coefficients between attributes |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104765839A true CN104765839A (en) | 2015-07-08 |
Family
ID=53647667
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510180290.7A Pending CN104765839A (en) | 2015-04-16 | 2015-04-16 | Data classifying method based on correlation coefficients between attributes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104765839A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650807A (en) * | 2016-12-20 | 2017-05-10 | 东南大学 | Method for predicting and evaluating concrete strength deterioration under ocean environment |
CN106886519A (en) * | 2015-12-15 | 2017-06-23 | 中国移动通信集团公司 | A kind of attribute processing methods and server |
CN107610771A (en) * | 2017-08-23 | 2018-01-19 | 上海电力学院 | A kind of medical science Testing index screening technique based on decision tree |
CN107894827A (en) * | 2017-10-31 | 2018-04-10 | 广东欧珀移动通信有限公司 | Using method for cleaning, device, storage medium and electronic equipment |
CN108509962A (en) * | 2017-02-28 | 2018-09-07 | 优信互联(北京)信息技术有限公司 | A kind of method and its device of identification information of vehicles |
CN108665309A (en) * | 2018-05-08 | 2018-10-16 | 多盟睿达科技(中国)有限公司 | A kind of advertisement matrix crowd localization method and system based on big data |
CN108960294A (en) * | 2018-06-12 | 2018-12-07 | 中国科学技术大学 | A kind of mobile robot classification of landform method of view-based access control model |
WO2019019375A1 (en) * | 2017-07-26 | 2019-01-31 | 平安科技(深圳)有限公司 | Method and apparatus for creating underwriting decision tree, and computer device and storage medium |
CN109784362A (en) * | 2018-12-05 | 2019-05-21 | 国网辽宁省电力有限公司信息通信分公司 | A kind of DGA shortage of data value interpolating method based on iteration KNN and interpolation priority |
CN110377605A (en) * | 2019-07-24 | 2019-10-25 | 贵州大学 | A kind of Sensitive Attributes identification of structural data and classification stage division |
US10831733B2 (en) | 2017-12-22 | 2020-11-10 | International Business Machines Corporation | Interactive adjustment of decision rules |
CN113362089A (en) * | 2020-03-02 | 2021-09-07 | 北京沃东天骏信息技术有限公司 | Attribute feature extraction method and device |
-
2015
- 2015-04-16 CN CN201510180290.7A patent/CN104765839A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106886519A (en) * | 2015-12-15 | 2017-06-23 | 中国移动通信集团公司 | A kind of attribute processing methods and server |
CN106650807A (en) * | 2016-12-20 | 2017-05-10 | 东南大学 | Method for predicting and evaluating concrete strength deterioration under ocean environment |
CN106650807B (en) * | 2016-12-20 | 2019-10-11 | 东南大学 | A kind of concrete in marine environment strength deterioration prediction and evaluation method |
CN108509962A (en) * | 2017-02-28 | 2018-09-07 | 优信互联(北京)信息技术有限公司 | A kind of method and its device of identification information of vehicles |
WO2019019375A1 (en) * | 2017-07-26 | 2019-01-31 | 平安科技(深圳)有限公司 | Method and apparatus for creating underwriting decision tree, and computer device and storage medium |
CN107610771A (en) * | 2017-08-23 | 2018-01-19 | 上海电力学院 | A kind of medical science Testing index screening technique based on decision tree |
CN107894827A (en) * | 2017-10-31 | 2018-04-10 | 广东欧珀移动通信有限公司 | Using method for cleaning, device, storage medium and electronic equipment |
US10831733B2 (en) | 2017-12-22 | 2020-11-10 | International Business Machines Corporation | Interactive adjustment of decision rules |
CN108665309A (en) * | 2018-05-08 | 2018-10-16 | 多盟睿达科技(中国)有限公司 | A kind of advertisement matrix crowd localization method and system based on big data |
CN108665309B (en) * | 2018-05-08 | 2021-11-19 | 多盟睿达科技(中国)有限公司 | Advertisement matrix crowd positioning method and system based on big data |
CN108960294A (en) * | 2018-06-12 | 2018-12-07 | 中国科学技术大学 | A kind of mobile robot classification of landform method of view-based access control model |
CN109784362A (en) * | 2018-12-05 | 2019-05-21 | 国网辽宁省电力有限公司信息通信分公司 | A kind of DGA shortage of data value interpolating method based on iteration KNN and interpolation priority |
CN110377605A (en) * | 2019-07-24 | 2019-10-25 | 贵州大学 | A kind of Sensitive Attributes identification of structural data and classification stage division |
CN110377605B (en) * | 2019-07-24 | 2023-04-25 | 贵州大学 | Sensitive attribute identification and classification method for structured data |
CN113362089A (en) * | 2020-03-02 | 2021-09-07 | 北京沃东天骏信息技术有限公司 | Attribute feature extraction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104765839A (en) | Data classifying method based on correlation coefficients between attributes | |
Lim et al. | Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning | |
Huang et al. | Revealing density-based clustering structure from the core-connected tree of a network | |
CN100557626C (en) | Image partition method based on immune spectrum clustering | |
CN106845717B (en) | Energy efficiency evaluation method based on multi-model fusion strategy | |
CN108154430A (en) | A kind of credit scoring construction method based on machine learning and big data technology | |
CN103942571B (en) | Graphic image sorting method based on genetic programming algorithm | |
CN103106279A (en) | Clustering method simultaneously based on node attribute and structural relationship similarity | |
Pandey et al. | An analysis of machine learning techniques (J48 & AdaBoost)-for classification | |
CN109582782A (en) | A kind of Text Clustering Method based on Weakly supervised deep learning | |
CN102750286A (en) | Novel decision tree classifier method for processing missing data | |
Zanghi et al. | Strategies for online inference of model-based clustering in large and growing networks | |
CN109978050A (en) | Decision Rules Extraction and reduction method based on SVM-RF | |
CN104217015A (en) | Hierarchical clustering method based on mutual shared nearest neighbors | |
CN111126865A (en) | Technology maturity judging method and system based on scientific and technological big data | |
Laassem et al. | Label propagation algorithm for community detection based on Coulomb’s law | |
CN106570537A (en) | Random forest model selection method based on confusion matrix | |
CN103310027B (en) | Rules extraction method for map template coupling | |
CN107451617A (en) | One kind figure transduction semisupervised classification method | |
Suard et al. | Kernel on Bag of Paths For Measuring Similarity of Shapes. | |
CN109164794A (en) | Multivariable industrial process Fault Classification based on inclined F value SELM | |
CN111428821A (en) | Asset classification method based on decision tree | |
CN103020864B (en) | Corn fine breed breeding method | |
Wang et al. | Adaptive population structure learning in evolutionary multi-objective optimization | |
Berton et al. | The Impact of Network Sampling on Relational Classification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150708 |
|
RJ01 | Rejection of invention patent application after publication |