CN103886030A

CN103886030A - Cost-sensitive decision-making tree based physical information fusion system data classification method

Info

Publication number: CN103886030A
Application number: CN201410077291.4A
Authority: CN
Inventors: 陈志�; 朱超; 岳文静
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Hengtong Mdt InfoTech Ltd
Priority date: 2014-03-05
Filing date: 2014-03-05
Publication date: 2014-06-25
Anticipated expiration: 2034-03-05
Also published as: CN103886030B

Abstract

Disclosed is a cost-sensitive decision-making tree based physical information fusion system data classification method. The cost-sensitive decision-making tree based physical information fusion system data classification method includes: firstly, using physical units for data acquisition; then transmitting acquired data to information units; finally, performing further analyzing and processing in the information units. The information unit is operated in the following steps that 1), the information units respectively establish decision-making trees corresponding to the physical units according to respectively stored sample data; 2), the established decision-making trees are tested, specific value of the number of misclassification to the overall number of tests of the decision-making trees is recorded, and the decision-making trees too large in the specific value are deleted; 3), classifying the data on the basis of the rest decision-making trees and combining specific value of the previously recorded number of misclassification to the overall number of tests. A tree structure can be conveniently and rapidly established to perform efficient classification on the data, which is easy for a user to know about the data, and the user does not need to be with a lot of background knowledge.

Description

Information physics emerging system data classification method based on cost-sensitive decision tree

Technical field

The present invention relates to a kind of Decision-Tree Method based on cost-sensitive, the loss or the cost summation that are mainly used in solving the classification of data efficient in information physics emerging system and produce in the time of classification are minimum problem, belong to the interleaving techniques application of information physics emerging system and data mining.

Background technology

Information physics emerging system is considered to the third wave of world information technology after computing machine, internet.Information physics emerging system can be understood as the high-efficiency network intelligent information system based on embedded device, be there is highly autonomous perception, autonomous judgement, from main regulation and capacity of self-government, can realize interconnected with the collaborative intelligent system of future generation of virtual world and real physical world.Information physics emerging system is main consideration performance optimization in function, is to integrate calculating (Computation), communication (Communication) and the intellectual technology of controlling (Control) 3C.Now, information physics emerging system technology has obtained showing great attention to of international industrial and commercial circles and many large-scale international corporations, and speed of development is very rapid, has been applied to multiple important development fields such as traffic, medical treatment, the energy, has broad application prospects.

Data mining is an iterative process, and it searches valuable, unusual fresh information from a large amount of data, is people and computing machine result of the joint efforts; It describes seeking balance between the knowledge of problem and target and the search capability of computing machine human expert, in the hope of obtaining best result.Data mining is one of field with fastest developing speed in computer industry, and it was a theme in computer science and statistics in the past, and nowadays, it has developed rapidly becomes an independently field.A most powerful advantage of data mining is that it can be the application of many methods and technology and a large amount of problem set.Data mining is a non-human act of carrying out on large data sets, so its maximum target market is whole data warehouse, Data Mart and decision support industry, comprises industries such as retail, manufacture, telecommunications, medical treatment, insurance, transport.

Classification is a kind of important data analysis form, and it extracts the model of portraying significant data class.This model is called sorter, the class label of prediction classification.Classification is generally divided into two steps: the first step, and our data based on providing are set up a disaggregated model; Second step, determines whether the accuracy rate of this model can be accepted, if can, use this model to classify to new data.Most sorting algorithm is the resident algorithm of internal memory, and tentation data amount is very little conventionally.Along with the development of modern technologies, data mining research is based upon on these working foundations, has developed telescopic classification and forecasting techniques, can process data large, resident disk.Classification has extensive application, comprises fraud detection, target marketing, performance prediction, manufacture and medical diagnosis etc.

Decision tree is a kind of tree construction of similar process flow diagram, and it is a kind of typical sorting technique.First it process data, utilizes inductive algorithm generate readable rule and set up decision tree, then uses decision-making to analyze new data.Decision tree is the process of data being classified by series of rules in essence.In 20 century 70 later stages and phase early 1980s, machine learning research institute J.RossQuinlan has developed decision Tree algorithms, is called the riffle (IterativeDichotomiser, ID3) of iteration.The follow-up C4.5 that Quinlan had proposed again ID3 afterwards becomes the performance benchmark of new supervised learning algorithm.1984, multidigit statistician published " ClassificationandRegressionTrees " (CART), had introduced the generation of binary decision tree.Traditional decision tree is calculated and has mostly adopted greedy method, and has used the method construct tree construction of dividing and ruling of top-down recurrence.

The prototype of cost-sensitive (Cost-SensitiveLearning, CSL) classification problem is medical diagnosis problem.In this problem, doctor must weigh for the test effect that possibility, test cost and the expectation of all medical treatment tests of patient obtain.The learning method of cost-sensitive mainly considers in classification, training classifier how in the time that different classification error meetings causes different punishment dynamics.For example, in medical treatment, " cost that is Healthy People by patient's mistaken diagnosis " is different from " cost that is patient by Healthy People mistaken diagnosis "; Usurp in detection at financial credit card, " by usurping the cost of thinking normal use by mistake " is also different from general's " normal use thought the cost of usurping by mistake ".Be not difficult to find out, occur that the possibility of erroneous judgement is very little, if but the potential erroneous judgement that can not correctly judge will cause a series of loss, thereby more have realistic meaning using final loss as weighing target.

Since the cost study based on cost-sensitive is suggested, cause a lot of experts' concern, the method for a lot of novelties has been proposed.As a whole, there are two class basic ideas: a kind of method is exactly not change classical sorting technique, and just original data set is done to certain processing, and outside classical sorting algorithm, wrap up one deck algorithm in other words conj.or perhaps, make it the object to cost-sensitive that reaches predetermined; Another kind of thinking is on classical sorting algorithm basis, adds some other factorses, reaches cost-sensitive the destination of study.

Summary of the invention

Technical matters: the object of this invention is to provide a kind of information physics emerging system data classification method based on cost-sensitive decision tree, in the process that the method is set up decision tree, increase the consideration of cost-sensitive, to solve data efficient classification in information physics emerging system, and in the time of classification, the loss of generation or cost summation are down to minimum problem.

Technical scheme: the information physics emerging system data classification method based on cost-sensitive decision tree of the present invention, user first builds many decision trees in information physics emerging system, then by combination decision tree of these decision tree compositions.Under the consideration based on cost-sensitive, by the calculating to combination decision tree error rate, data are classified.

Information physics emerging system of the present invention is made up of some physical locations and a message unit, and wherein said physical location is used for collecting data; The data that described message unit sends for receiving and store physical location, and these are analyzed and are processed.

Information physics emerging system data classification method based on cost-sensitive decision tree comprises the following steps:

The component units of information physics emerging system is divided into multiple physical locations and a message unit by step 1) user; Described physical location is used for collecting data, and described message unit is for analyzing and deal with data;

Step 2) user is placed on training sample and test sample book in system in advance, and the each sample in described training sample and test sample book comprises catalogue number(Cat.No.), Property Name, corresponding property value and classification;

Step 3) user starts each physical location and collects training sample data;

The training sample data of collecting are sent to message unit by the each physical location of step 4), and the training sample data that message unit sends different physical locations are stored respectively;

Step 5) message unit is respectively these physical locations according to the sample data of storing respectively and sets up corresponding decision tree, described decision tree is the sorter of a kind of tree in machine learning classification algorithm, sorter is a kind of computer program, and effect is automatically data to be assigned to known class;

Step 6) message unit is placed into the decision tree obtaining in a list, form a decision tree list, described list is according to linear precedence in data structure, the set of the data item that arrangement forms, can in this data structure, carry out basic operation comprise to element search, insert and delete;

Step 7) message unit is tested each decision tree.Concrete steps are as follows:

Step 7.1) one group of ready test sample book is input in each decision tree establishing;

Step 7.2) test sample book obtains corresponding classification results through the decision-making meeting of decision tree;

Step 7.3) classification of the classification results obtaining and test sample book itself to be compared, identical classification is correct, and difference is classification error; The sample number of misregistration classification;

Step 7.4) if the ratio of the test specimens given figure of mis-classification and total test specimens given figure is greater than 10%, in decision tree list, delete this decision tree, be retained in if not in decision tree list; Wherein the ratio of the test specimens given figure of mis-classification and total test specimens given figure is designated as to wi, is recorded in message unit, i represents i decision tree in decision tree list;

Step 8), after above-mentioned steps, if decision tree list is empty, is chosen that decision tree of wi minimum as final decision tree;

If the list of step 9) decision tree only contains a decision tree, this class decision tree is final decision tree;

If the list of step 10) decision tree contains two or two above decision trees, the decision tree in decision tree list is set as combination decision, the sorter that described combination decision tree is multiple decision tree compositions, each decision tree has corresponding weights, the ratio of the test specimens given figure that these weights are mis-classification and total test specimens given figure;

The mis-classification rate H of step 11) calculation combination decision tree, return H value hour corresponding classification as the classification results of data; Wherein H passes through

calculate and obtain, i represents i decision tree; J represents classification; C _i,jrepresent whether be classification j by the classification results obtaining after i decision tree, if C _i,jvalue is 1, if not C _i,jvalue is 0; w _ithe ratio of the test specimens given figure of mis-classification and total test specimens given figure in i decision tree while testing before representing; N represents a tree for the decision tree that the classification of classification results in combination decision tree is j, and argmin represents to select in minimum value as the value of final H.

Beneficial effect: the present invention is in classifying to information physics emerging system data, use the decision Tree algorithms of Classification Algorithm in Data mining, and introduced about the processing of cost-sensitive and classified to solve data efficient in information physics emerging system, and in the time of classification, the loss of generation or cost summation are down to minimum problem.Specifically, the information physics emerging system data classification method based on cost-sensitive decision tree of the present invention has following beneficial effect:

(1) decision tree easy to understand and realization, user does not need to understand a lot of background knowledges in learning process, only needs, by after explaining, to be appreciated that the expressed meaning of decision tree at once.

(2) decision tree can directly embody the feature of data, and tree is easy to user and observes and understand.

(3) for decision tree, the preparation of data is simple often, and can process discrete data or continuous data simultaneously, within the relatively short time, can make feasible and respond well result to large data source.

(4) decision tree is easy to by test, model be evaluated and tested, can rating model confidence level.If the data of given observation, are easy to release corresponding classifying rules according to produced decision tree so.

(5) the cost-sensitive study of introducing is by training dataset, trains the diagnostics learning system that obtains minimal error classification cost, is finally in order to pursue mis-classification Least-cost, reduces the possibility of mis-classification.So it is no longer classical classification and prediction, but the expansion of classical classification.

Brief description of the drawings

Fig. 1 decision tree process of establishing schematic diagram,

The information physics emerging system data classification method process flow diagram of Fig. 2 based on cost-sensitive decision tree.

Embodiment

The present invention uses the decision Tree algorithms in conjunction with cost-sensitive, optimizes the classification of information physics emerging system data, reduces the loss that mis-classification is brought.For a more detailed description to the present invention according to drawings and embodiments below.

The present invention is specifically described according to weather sample data, and weather sample data comprises catalogue number(Cat.No.), Property Name and corresponding property value, classification.Wherein attribute comprises rain or shine (corresponding property value have sunny, overcast, rainy), temperature (corresponding property value has hot, mild, cool), humidity (corresponding property value has high, normal), blow (corresponding property value has true, false), and whether classification for going out play (corresponding classification has yes, no).

1, the component units of information physics emerging system is divided into multiple physical locations and a message unit by user.Described physical location is used for collecting weather data; Described message unit is for analyzing and deal with data.

2, user is placed on weather training sample and weather test sample book in system in advance.Each sample in described weather training sample and weather test sample book comprises property value, the classification of catalogue number(Cat.No.), Property Name, correspondence.

3, user starts each physical location collection weather training sample data.

4, the weather training sample data of collecting are sent to message unit by each physical location.The weather training sample data that message unit sends different physical locations are stored respectively.

5, message unit is respectively these physical locations according to the weather sample data of storing respectively and sets up corresponding decision tree.Described decision tree is the sorter of a kind of tree in machine learning classification algorithm, and wherein root node comprises all input sample datas, and leafy node is the result of classification, and all the other nodes represent the test to certain attribute.Sorter is a kind of computer program, and effect is automatically data to be assigned to known class.Concrete each decision tree process of establishing is as follows:

5.1, set up the root node of a node as decision tree

5.2, training of judgement concentrates whether only have a classification, if node is labeled as to leafy node, and finishes the construction process of decision tree; Whether judgement sample has attribute to exist if not.

If 5.3 samples do not have attribute, be labeled as leafy node by having the classification that number is maximum in training set, and finish construction process; Start if not to find best splitting rule.

5.4, computing information gain, chooses and maximum sets up branch as Split Attribute, and this attribute of deletion from sample data.Described information gain is a kind of criterion, is to see that attribute can bring how much information (information herein refers to the impact for final classification results) for classification, and the information more (affecting larger) of bringing, this attribute is more important.Wherein the formula of information gain is

d is training set, p _irepresent to belong in training set the nonzero probability of class i.Described Split Attribute refers to selects certain attribute as the condition of decision tree tree branch, if property value is discrete, generally comprise this attribute the branch of property value number likely; If serial number attribute, or normally judge whether this numerical value is greater than the constant that is less than or equal to certain predefined, provide a y-bend division.

5.5, each branch is carried out the operation of above-mentioned d step, and carry out the judgement of b and c step.

6, message unit is placed into the decision tree obtaining in a list, forms a decision tree list.Described list be in data structure according to linear precedence, arrange the set of the data item that forms, can in this data structure, carry out basic operation comprise to element search, insert, and delete.

7, message unit is tested each decision tree.Concrete steps are as follows:

7.1, one group of ready weather test sample book is input in each decision tree establishing;

7.2, weather test sample book obtains corresponding classification results through the decision-making meeting of decision tree;

7.3, the classification of the classification results obtaining and weather test sample book itself is compared, identical classification is correct, and difference is classification error; The sample number of misregistration classification;

If the ratio of the weather test specimens given figure of 7.4 mis-classifications and total weather test specimens given figure is greater than 10%, in decision tree list, delete this decision tree; If not, be retained in decision tree list.Wherein the ratio of the test specimens given figure of mis-classification and total test specimens given figure is designated as to w _i, be recorded in message unit, wherein i represents i decision tree in decision tree list;

8, judging whether decision tree list is empty, is final decision tree if choose that decision tree of wi minimum.

If 9 judge decision tree list, not for empty, continues to judge whether decision tree list only contains a decision tree, if this class decision tree is final decision tree.If not, the decision tree in decision tree list is set as combination decision.The sorter that described combination decision tree is multiple decision tree compositions, each decision tree has the corresponding weights ratio of the test specimens given figure of mis-classification and total test specimens given figure (herein for), and data need the classification results of comprehensive each decision tree can obtain last classification.

10, the mis-classification rate H of calculation combination decision tree, return H value hour corresponding classification as the classification results of data.H passes through

calculate and obtain, wherein i represents i decision tree; J represents classification; C _i,jrepresent whether be classification j by the classification results obtaining after i decision tree, if C _i,jvalue is 1, if not C _i,jvalue is 0; w _ithe ratio of the test specimens given figure of mis-classification and total test specimens given figure in i decision tree while testing before representing; N represents a tree for the decision tree that the classification of classification results in combination decision tree is j; Argmin represents to select

in minimum value as the value of final H.

Claims

1. the information physics emerging system data classification method based on cost-sensitive decision tree, is characterized in that the method comprises the following steps:

Step 3) user starts each physical location and collects training sample data;

Step 7.4) if the ratio of the test specimens given figure of mis-classification and total test specimens given figure is greater than 10%, in decision tree list, delete this decision tree, be retained in if not in decision tree list; Wherein the ratio of the test specimens given figure of mis-classification and total test specimens given figure is designated as to w _i, being recorded in message unit, i represents i decision tree in decision tree list;

Step 8), after above-mentioned steps, if decision tree list is empty, is chosen w _ithat minimum decision tree is as final decision tree;