CN105654106A - Decision tree generation method and system thereof - Google Patents

Decision tree generation method and system thereof Download PDF

Info

Publication number
CN105654106A
CN105654106A CN201510419436.9A CN201510419436A CN105654106A CN 105654106 A CN105654106 A CN 105654106A CN 201510419436 A CN201510419436 A CN 201510419436A CN 105654106 A CN105654106 A CN 105654106A
Authority
CN
China
Prior art keywords
attribute
sample
decision tree
discrimination
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510419436.9A
Other languages
Chinese (zh)
Inventor
童志明
刘爽
何公道
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Antiy Technology Co Ltd
Original Assignee
Harbin Antiy Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Antiy Technology Co Ltd filed Critical Harbin Antiy Technology Co Ltd
Priority to CN201510419436.9A priority Critical patent/CN105654106A/en
Publication of CN105654106A publication Critical patent/CN105654106A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a decision tree generation method and a system thereof. The method comprises the following steps of acquiring a training sample set and a sample attribute; calculating a type number of different values of each sample attribute in the training sample set; according to the type number, calculating a distinguishing degree of each sample attribute respectively; selecting the sample attribute with the highest distinguishing degree as a basis attribute of decision tree branch dividing; carrying out recursion calculation dividing on the divided training sample set according to the above method till that a branch node reaches a preset threshold so that decision tree generation is completed. The invention also provides the corresponding system. Through the method in the invention, selection of decision tree attribute dividing can be reasonably performed so that a dividing result is accurate and possesses a practical value.

Description

A kind of decision tree generation method and system
Technical field
The present invention relates to field of computer technology, particularly to a kind of decision tree generation method and system.
Background technology
Traditional decision tree is when carrying out branch and dividing, use the method that the size based on comentropy carries out dividing, and the usual discrimination of result based on the division of comentropy is poor, the problem caused is exactly to judge which attribute is main discriminative attributes, causes that decision tree division result referential is poor.
Summary of the invention
For solving the problems referred to above, the present invention proposes a kind of decision tree generation method and system, by calculating the discrimination of sample attribute, it is determined that partitioning standards, it is possible to makes the branch of decision tree divide more reasonable, makes decision tree have better referential.
A kind of decision tree generation method, including:
Obtaining training sample set and sample attribute, described sample attribute number is m;
Add up each sample attribute and concentrate the species number of different value at training sample, be defined as ci(1��i��m);
Calculate the discrimination q of each sample attribute respectivelyi:
Described qiFor with ciFor the end 2xLogarithm, wherein ciShould much smaller than 2x; Therefore c is worked asiDuring more than 1,1 < qi�� x, works as ciDuring equal to 1, qiEqual to 0;
Choosing and distinguish the foundation attribute that the maximum sample attribute of angle value divides as decision tree branches, will form a decision tree branches according to the sample that the value of attribute is identical respectively described in training sample set, the sample comprised is a subset;
The subset recurrence of each decision tree branches being performed above-mentioned steps, carries out decision tree branches division, until branch node reaches the threshold value preset, stop dividing, decision tree has generated.
In described method, when calculating the discrimination of each sample attribute respectively, if the discrimination that there is two or more sample attribute is identical, then calculate the secondary sections calibration t of the identical sample attribute of discrimination furtheri; Assume that the different values of a sample attribute are training the quantity respectively K in gathering1��K2������Kci, then training sample concentrates total sample number n=K1+K2+����+Kci; Then:
ti=(K1��K2��������Kci)(1/n);
Choose the foundation attribute that calibration value maximum attribute in secondary sections divides as decision tree branches.
A kind of decision tree generates system, including:
Sample acquisition module, is used for obtaining training sample set and sample attribute, and described sample attribute number is m;
Statistical module, concentrates the species number of different value, is defined as c for adding up each sample attribute at training samplei(1��i��m);
Discrimination computing module, for calculating the discrimination q of each sample attribute respectivelyi:
Described qiFor with ciFor the end 2xLogarithm, ciMuch smaller than 2x; Therefore c is worked asiDuring more than 1,1 < qi�� x, works as ciDuring equal to 1, qiEqual to 0;
Decision tree generation module, the foundation attribute that the maximum sample attribute of angle value divides is distinguished as decision tree branches for choosing, to form a decision tree branches according to the sample that the value of attribute is identical described in training sample set respectively, the sample comprised is a subset;
The subset recurrence of each decision tree branches being performed above-mentioned steps, carries out decision tree branches division, until branch node reaches the threshold value preset, stop dividing, decision tree has generated.
In described system, when calculating the discrimination of each sample attribute respectively, if the discrimination that there is two or more sample attribute is identical, then calculate the secondary sections calibration t of the identical sample attribute of discrimination furtheri; Assume that the different values of a sample attribute are training the quantity respectively K in gathering1��K2������Kci, then training sample concentrates total sample number n=K1+K2+����+Kci; Then:
ti=(K1��K2��������Kci)(1/n);
Choose the foundation attribute that calibration value maximum attribute in secondary sections divides as decision tree branches.
Advantage is in that, the present invention can rationally effectively select to carry out the attribute of decision tree branches division, makes the result that decision tree divides more have referential.
The present invention proposes a kind of decision tree generation method and system, described method includes: obtain training sample set and sample attribute, add up each sample attribute and concentrate the species number of different value at training sample, according to itself and the discrimination calculating each sample attribute respectively, the foundation attribute that the sample attribute that selective discrimination degree is the highest divides as decision tree branches; Training sample set after dividing is continued recursive calculation division according to the method described above, until branch node reaches the threshold value preset, decision tree has generated. The present invention gives the system of correspondence, by method of the invention, it is possible to rationally carry out decision tree to divide the selection of attribute so that the result of division more accurately and has practical value.
Accompanying drawing explanation
In order to be illustrated more clearly that the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is one decision tree generation method embodiment flow chart of the present invention;
Fig. 2 is that one decision tree of the present invention generates system embodiment structural representation.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the technical scheme in the embodiment of the present invention, and it is understandable to enable the above-mentioned purpose of the present invention, feature and advantage to become apparent from, and below in conjunction with accompanying drawing, technical scheme in the present invention is described in further detail.
For solving the problems referred to above, the present invention proposes a kind of decision tree generation method and system, by calculating the discrimination of sample attribute, it is determined that partitioning standards, it is possible to makes the branch of decision tree divide more reasonable, makes decision tree have better referential.
A kind of decision tree generation method, as it is shown in figure 1, include:
S101: obtaining training sample set and sample attribute, described sample attribute number is m;
S102: add up each sample attribute and concentrate the species number of different value at training sample, be defined as ci(1��i��m);
S103: calculate the discrimination q of each sample attribute respectivelyi:
Described qiFor with ciFor the end 2xLogarithm, wherein ciShould much smaller than 2x; Therefore c is worked asiDuring more than 1,1 < qi�� x, works as ciDuring equal to 1, qiEqual to 0;
S104: choose and distinguish the foundation attribute that the maximum sample attribute of angle value divides as decision tree branches, will form a decision tree branches according to the sample that the value of attribute is identical described in training sample set respectively, and the sample comprised is a subset;
S105: the subset recurrence of each decision tree branches is performed above-mentioned steps, carries out decision tree branches division, until branch node reaches the threshold value preset, stops dividing, and decision tree has generated.
In described method, when calculating the discrimination of each sample attribute respectively, if the discrimination that there is two or more sample attribute is identical, then calculate the secondary sections calibration t of the identical sample attribute of discrimination furtheri; Assume that the different values of a sample attribute are training the quantity respectively K in gathering1��K2������Kci, then training sample concentrates total sample number n=K1+K2+����+Kci; Then:
ti=(K1��K2��������Kci)(1/n);
Choose the foundation attribute that calibration value maximum attribute in secondary sections divides as decision tree branches. Relation according to elementary inequality, if sample distribution is more uniform, then tiMore big, and sample distribution is more uniform, then it is assumed that secondary sections calibration is more good.
As assumed, training sample concentration has ten samples, totally 4 sample attributes, i.e. m=4; The different value species number c of attribute 11It is 2; The different value species number c of attribute 22It is 3; The different value species number c of attribute 33It is 10; The different value species number c of attribute 44It is 2, it is assumed that in formula, x is 5;
Then q1=log225=5;
q2=log325=3.16;
q3=log1025=1.51;
q4=log225=5;
q1With q4It is worth the highest and identical, therefore continues to calculate secondary sections calibration;
Assume in attribute 1 different values quantity in training set respectively 3,7; Different values quantity in training set respectively 4,6 in attribute 4, therefore:
t1=211/10=1.35;t4=241/10=1.37;
Therefore select attribute 4 as according to attribute, carrying out branch's division, subsequent calculations and branch and divide in like manner.
A kind of decision tree generates system, as in figure 2 it is shown, include:
Sample acquisition module 201, is used for obtaining training sample set and sample attribute, and described sample attribute number is m;
Statistical module 202, concentrates the species number of different value, is defined as c for adding up each sample attribute at training samplei(1��i��m);
Discrimination computing module 203, for calculating the discrimination q of each sample attribute respectivelyi:
Described qiFor with ciFor the end 2xLogarithm, ciMuch smaller than 2x; Therefore c is worked asiDuring more than 1,1 < qi�� x, works as ciDuring equal to 1, qiEqual to 0;
Decision tree generation module 204, the foundation attribute that the maximum sample attribute of angle value divides is distinguished as decision tree branches for choosing, to form a decision tree branches according to the sample that the value of attribute is identical described in training sample set respectively, the sample comprised is a subset;
The subset recurrence of each decision tree branches being performed above-mentioned steps, carries out decision tree branches division, until branch node reaches the threshold value preset, stop dividing, decision tree has generated.
In described system, when calculating the discrimination of each sample attribute respectively, if the discrimination that there is two or more sample attribute is identical, then calculate the secondary sections calibration t of the identical sample attribute of discrimination furtheri;Assume that the different values of a sample attribute are training the quantity respectively K in gathering1��K2������Kci, then training sample concentrates total sample number n=K1+K2+����+Kci; Then:
ti=(K1��K2��������Kci)(1/n);
Choose the foundation attribute that calibration value maximum attribute in secondary sections divides as decision tree branches.
Advantage is in that, the present invention can rationally effectively select to carry out the attribute of decision tree branches division, makes the result that decision tree divides more have referential. Using the process that this technology carries out building decision tree is the circulation process plus recurrence, for certain attribute data set determined, carrying out the quantum chemical method operation of a discrimination on each attribute, the attribute by choosing discrimination best divides every time.
The present invention proposes a kind of decision tree generation method and system, described method includes: obtain training sample set and sample attribute, add up each sample attribute and concentrate the species number of different value at training sample, according to itself and the discrimination calculating each sample attribute respectively, the foundation attribute that the sample attribute that selective discrimination degree is the highest divides as decision tree branches; Training sample set after dividing is continued recursive calculation division according to the method described above, until branch node reaches the threshold value preset, decision tree has generated. The present invention gives the system of correspondence, by method of the invention, it is possible to rationally carry out decision tree to divide the selection of attribute so that the result of division more accurately and has practical value.
Although depicting the present invention by embodiment, it will be appreciated by the skilled addressee that the present invention has many deformation and is varied without departing from the spirit of the present invention, it is desirable to appended claim includes these deformation and is varied without departing from the spirit of the present invention.

Claims (4)

1. a decision tree generation method, it is characterised in that including:
Obtaining training sample set and sample attribute, described sample attribute number is m;
Add up each sample attribute and concentrate the species number of different value at training sample, be defined as ci(1��i��m);
Calculate the discrimination q of each sample attribute respectivelyi:
Described qiFor with ciFor the end 2xLogarithm, wherein ciShould much smaller than 2x; Therefore c is worked asiDuring more than 1,1 < qi�� x, works as ciDuring equal to 1, qiEqual to 0;
Choosing and distinguish the foundation attribute that the maximum sample attribute of angle value divides as decision tree branches, will form a decision tree branches according to the sample that the value of attribute is identical respectively described in training sample set, the sample comprised is a subset;
The subset recurrence of each decision tree branches being performed above-mentioned steps, carries out decision tree branches division, until branch node reaches the threshold value preset, stop dividing, decision tree has generated.
2. the method for claim 1, it is characterised in that when calculating the discrimination of each sample attribute respectively, if the discrimination that there is two or more sample attribute is identical, then calculates the secondary sections calibration t of the identical sample attribute of discrimination furtheri; Assume that the different values of a sample attribute are training the quantity respectively K in gathering1��K2������Kci, then training sample concentrates total sample number n=K1+K2+����+Kci; Then:
ti=(K1��K2��������Kci)(1/n);
Choose the foundation attribute that calibration value maximum attribute in secondary sections divides as decision tree branches.
3. a decision tree generates system, it is characterised in that including:
Sample acquisition module, is used for obtaining training sample set and sample attribute, and described sample attribute number is m;
Statistical module, concentrates the species number of different value, is defined as c for adding up each sample attribute at training samplei(1��i��m);
Discrimination computing module, for calculating the discrimination q of each sample attribute respectivelyi:
Described qiFor with ciFor the end 2xLogarithm, ciMuch smaller than 2x; Therefore c is worked asiDuring more than 1,1 < qi�� x, works as ciDuring equal to 1, qiEqual to 0;
Decision tree generation module, the foundation attribute that the maximum sample attribute of angle value divides is distinguished as decision tree branches for choosing, to form a decision tree branches according to the sample that the value of attribute is identical described in training sample set respectively, the sample comprised is a subset;
The subset recurrence of each decision tree branches being performed above-mentioned steps, carries out decision tree branches division, until branch node reaches the threshold value preset, stop dividing, decision tree has generated.
4. system as claimed in claim 3, it is characterised in that when calculating the discrimination of each sample attribute respectively, if the discrimination that there is two or more sample attribute is identical, then calculate the secondary sections calibration t of the identical sample attribute of discrimination furtheri; Assume that the different values of a sample attribute are training the quantity respectively K in gathering1��K2������Kci, then training sample concentrates total sample number n=K1+K2+����+Kci; Then:
ti=(K1��K2��������Kci)(1/n);
Choose the foundation attribute that calibration value maximum attribute in secondary sections divides as decision tree branches.
CN201510419436.9A 2015-07-17 2015-07-17 Decision tree generation method and system thereof Pending CN105654106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510419436.9A CN105654106A (en) 2015-07-17 2015-07-17 Decision tree generation method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510419436.9A CN105654106A (en) 2015-07-17 2015-07-17 Decision tree generation method and system thereof

Publications (1)

Publication Number Publication Date
CN105654106A true CN105654106A (en) 2016-06-08

Family

ID=56481638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510419436.9A Pending CN105654106A (en) 2015-07-17 2015-07-17 Decision tree generation method and system thereof

Country Status (1)

Country Link
CN (1) CN105654106A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106548257A (en) * 2016-12-09 2017-03-29 中国南方电网有限责任公司超高压输电公司昆明局 A kind of standby redundancy quota formulating method based on decision-tree model
CN107894827A (en) * 2017-10-31 2018-04-10 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN108989075A (en) * 2017-06-05 2018-12-11 中国移动通信集团广东有限公司 A kind of network failure locating method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106548257A (en) * 2016-12-09 2017-03-29 中国南方电网有限责任公司超高压输电公司昆明局 A kind of standby redundancy quota formulating method based on decision-tree model
CN106548257B (en) * 2016-12-09 2021-04-09 中国南方电网有限责任公司超高压输电公司昆明局 Method for making quota of spare parts based on decision tree model
CN108989075A (en) * 2017-06-05 2018-12-11 中国移动通信集团广东有限公司 A kind of network failure locating method and system
CN107894827A (en) * 2017-10-31 2018-04-10 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN107529644B (en) Linear approximation method for static voltage stability domain boundary of power system
CN104573000A (en) Sequential learning based automatic questions and answers device and method
CN102855185A (en) Pair-wise test method based on priority
CN105654106A (en) Decision tree generation method and system thereof
CN105005029A (en) Multi-mode radar signal sorting method based on data field hierarchical clustering
CN104199945A (en) Data storing method and device
CN109033322A (en) A kind of test method and device of multidimensional data
CN103678513B (en) A kind of interactively retrieval type generates method and system
CN105825288B (en) optimization analysis method for eliminating regression data collinearity problem in complex system
CN104572474A (en) Dynamic slicing based lightweight error locating implementation method
CN103902798A (en) Data preprocessing method
CN103324888A (en) Method and system for automatically extracting virus characteristics based on family samples
CN106326005A (en) Automatic parameter tuning method for iterative MapReduce operation
CN105069574A (en) New method for analyzing business flow behavior similarity
CN108646688B (en) A kind of process parameter optimizing analysis method based on recurrence learning
CN103425579A (en) Safety evaluation method for mobile terminal system based on potential function
CN106780080A (en) The computational methods of planting progress, computing device and server
CN107957944B (en) User data coverage rate oriented test case automatic generation method
CN105654498A (en) Image segmentation method based on dynamic local search and immune clone automatic clustering
CN105373473A (en) Original signalling decoding-based CDR accuracy test method and system
CN105205627B (en) Grid power Management plan determines method and system
CN105404736B (en) Severity computational methods based on multi-source confidence fuzzy message
CN108089136B (en) Automatic slicing method for fuel cell stack polarization curve test data
CN110096448B (en) Fuzzy test search method considering depth and breadth
CN101571814B (en) Communication behavior information extraction method based on message passing interface device and system thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160608