Disclosure of Invention
The invention provides a power business collaborative classification method and system based on an ID3 decision tree algorithm to solve the technical problems.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a method for classifying outsourced items of an electric power system based on an ID3 decision tree algorithm, including the following steps:
step 101, acquiring a power business cooperation related database, and extracting a sample set S from the power business cooperation related database;
102, extracting an index set A, wherein the index set A contains indexes used for evaluating business cooperative data;
103, calculating the information entropy and the information gain of each index on the sample set S based on an ID3 algorithm to select a proper root node and a proper middle node;
104, constructing a decision tree according to the selected root node;
and 105, evaluating each business cooperation scheme based on the decision tree, and selecting according to requirements.
Preferably, the step 103 specifically includes:
step 1031, calculating the information entropy and the information gain of each index for the sample set S based on the ID3 algorithm;
step 1032, calculating other data except the training data set S by using the information entropy and information gain test obtained in step 1031;
step 1033, select the appropriate root node and intermediate node after the comparison.
Preferably, the process of calculating the information entropy and the information gain of each index for the sample set S based on the ID3 algorithm is as follows:
selecting an index C in the index set A, wherein the index C has m possible values C ═ C1,C2,...,CmC in training set SiFrequency of occurrence is piWherein i is more than or equal to 1 and less than or equal to m, and m and i are integers, the information entropy of the training set S is as follows:
another index B is selected as a root node, and the index B is used for dividing the sample set S into sample subsets Sj(j ═ 1, 2.. times, k), the information gain of S divided by the index B is:
Gain(S,B)=Entropy(S)-EntropyB(S) (2)
the information entropy of the sample subset after dividing S according to the index B is as follows:
wherein | SjL is the sample subset SjThe number of samples contained in the sample set S, | S | is the number of samples contained in the sample set S, j is more than or equal to 1 and less than or equal to k, and k and j are integers.
Preferably, the following steps are further included between step 101 and step 102:
step 111, judging whether all sample data in the sample set S are of the same type, if so, turning to step 112, otherwise, executing step 102;
step 112, selecting the class to which all the sample data belong as a root node, and going to step 104.
Preferably, the following steps are further included between step 102 and step 103:
step 121, determining whether the sample set S and the index set a are empty, if yes, going to step 123, otherwise, executing step 103;
and step 123, selecting the class with the highest proportion in the sample set S as a root node, and jumping to the step 104.
Preferably, the following steps are further included between step 121 and step 103:
step 122, determining whether the values of all the indexes in the index set a are unique, if yes, going to step 123, otherwise, executing step 103.
Preferably, the following steps are further included between step 132 and step 133:
step 10321, determine whether there is an error classification, if yes, go back to step 131, otherwise, go to step 1033.
Preferably, between the step 103 and the step 104, the following steps may be further included:
step 131, determining whether all indexes have been traversed, if not, going to step 132, if yes, executing step 104;
and 132, eliminating the traversed indexes, generating a sample subset S without the traversed indexes, and jumping to the step 101.
According to a second aspect of the embodiments of the present invention, there is provided an ID3 decision tree algorithm-based power system service collaborative classification system, including:
the system comprises a sample set extraction module, a data processing module and a data processing module, wherein the sample set extraction module is used for acquiring a power business cooperation related database and extracting a sample set S from the power business cooperation related database;
the index set extraction module is used for extracting an index set A, and the index set A contains indexes used for evaluating business collaborative data;
the information entropy and information gain calculation module is used for calculating information entropy and information gain of each index of the sample set S based on an ID3 algorithm so as to select proper root nodes and middle nodes;
the decision tree construction module is used for constructing a decision tree according to the selected root node;
and the scheme selection module is used for evaluating each business cooperation scheme based on the decision tree and selecting according to the requirement.
Preferably, the system further comprises a user interaction module, which is used for visualization display of the data after the decision tree is constructed and classified and configuration of the interface and the application program.
Compared with the prior art, the method adopts the information entropy and the information gain for calculation, has relatively small calculated amount and high classification accuracy, is applied to cooperative data calculation and analysis of services such as power outsourcing and the like, generates the decision tree by selecting the optimal division characteristics as the nodes and classifies the data, has quick and good classification effect, and effectively realizes cooperative management of the services such as power outsourcing and the like.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to these embodiments are included in the scope of the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1, a power service collaborative classification method based on an ID3 decision tree algorithm includes the following specific steps.
Step 101, obtaining a power business cooperation related database, and extracting a sample set S from the power business cooperation related database.
The sample set S can be randomly selected from a collaborative correlation database of the power service as a training data set to ensure that the data has no specificity and avoid that the data is too large and is not easy to converge. For example, in the risk policy system, the category of the data in the sample set S may be marketing, Production Management (PMS), bidding, finance, and the like, and is specifically determined by the values of all samples. For example, the sample set S, all samples have multiple values under a certain category, and the category can be set to multiple types.
Step 102, extracting an index set A, wherein the index set A contains indexes used for evaluating business collaboration data.
Index set a ═ a here1,A2,...,AnN is an integer, n indexes such as marketing, production management, bidding, finance and the like can be preset, and m values of each index can exist in the sample set S, wherein m is an integer larger than or equal to zero. For example, the marketing index may have a value of high sales, low sales, high advertisement popularity, low advertisement popularity, high market share, low market share, good customer satisfaction, poor customer satisfaction, etc., the production management index may have a value of good production quality, poor production quality, etc., the bid index may have a value of high bid amount, etc., and the financial index may have a value of good financial status, poor financial status, etc. Finally, the data can be classified and summarized to obtain the good and the poor of the power business coordination degree.
And 103, calculating the information entropy and the information gain of each index on the sample set S based on the ID3 algorithm to select proper root nodes and middle nodes.
Aiming at various indexes, the information entropy is used for reflecting the chaos degree of the index distribution, then the information gain of other indexes is calculated according to the calculated information entropy, and the maximum information gain is selected to divide sub-nodes. The ID3 algorithm selects indexes by using information gain, the calculated amount is relatively small, the accuracy is high, the realization is simple, pruning operation optimization can be carried out in the tree construction process, the index attribute with the maximum information gain is selected when the segmentation attribute is selected, and therefore the optimal segmentation index is selected to generate important nodes such as root nodes.
And 104, constructing a decision tree according to the selected root node. After the selected root node and corresponding branches such as child nodes are obtained, a complete decision tree can be constructed, and real-time display can be performed through interactive operation.
And 105, evaluating each business cooperation scheme based on the decision tree, and selecting according to requirements.
Based on the decision tree constructed above, the most appropriate business cooperation scheme in aspects of marketing, production management system PMS, bid, finance and the like can be selected from multiple indexes of marketing, production management system PMS, bid, finance and the like to data and data of related outsourcing cooperation business, a risk assessment system is established through intelligent recognition such as graph, mode and voice recognition of basic data and further deep learning such as quantification, machine, knowledge map and big data topology and the like, risk factors are analyzed, data classification is effectively achieved, auxiliary decision is provided, and outsourcing business risks are effectively avoided.
As shown in fig. 2, the following steps may also be included between step 101 and step 102:
step 111, judging whether all sample data in the sample set S are of the same type, if so, turning to step 112;
step 112, selecting the class to which all the sample data belong as a root node, and going to step 104.
When all sample data in the sample set S only have the same class, the class to which all sample data belong can be directly set as the root node without further information entropy and information gain.
The following steps can be further included between step 102 and step 103:
step 121, determining whether the sample set S and the index set a are empty, if yes, going to step 123, otherwise, executing step 122;
step 122, judging whether the values of all indexes in the index set A are unique, if so, turning to step 123, otherwise, executing step 103;
and step 123, selecting the class with the highest proportion in the sample set S as a root node, and jumping to the step 104.
When the sample set S and the index set a are empty, without further information entropy and information gain, the class to which all sample data belongs may be directly set as a root node, for example, the class is "unaffected data class"; when the values of all indexes in the index set A are unique, the created decision tree is not branched, and one index can be set at will by the root node and the child nodes. The above steps 121 and 122 may exist separately, or may be performed sequentially in two steps as described above; if step 121 and step 122 are present separately, the process proceeds directly to step 103 if no, and proceeds to step 123 if yes.
Thus, the root node selected in step 104 may be determined according to step 103, or may be determined according to step 111, step 121, or step 122, and then intermediate nodes and leaf nodes are determined based on this, so as to construct a decision tree.
Between the step 103 and the step 104, the following steps may be further included:
step 131, determining whether all indexes have been traversed, if not, going to step 132, if yes, executing step 104;
and 132, eliminating the traversed indexes, generating a sample subset S without the traversed indexes, and jumping to the step 101.
And circulating the operation until all indexes are traversed once and the index set A is empty.
Step 103 is described in further detail below, and as shown in fig. 3, specifically includes:
step 1031, calculating the information entropy and the information gain of each index for the sample set S based on the ID3 algorithm;
step 1032, calculating other data except the training data set S by using the information entropy and information gain test obtained in step 1031;
step 1033, select the appropriate root node and intermediate node after the comparison.
In step 1031, the process of calculating the information entropy and the information gain of each index for the sample set S based on the ID3 algorithm is as follows:
selecting an index setIn the index C in A, the index C has m possible values C ═ C1,C2,...,CmC in training set SiFrequency of occurrence is piWherein i is more than or equal to 1 and less than or equal to m, and m and i are integers, the information entropy of the training set S is as follows:
another index B is selected as a root node, and the index B is used for dividing the sample set S into sample subsets Sj(j ═ 1, 2.. times, k), the information gain of S divided by the index B is:
Gain(S,B)=Entropy(S)-EntropyB(S) (2)
the information entropy of the sample subset after dividing S according to the index B is as follows:
wherein, | SjL is the sample subset SjThe number of samples contained in the sample set S, | S | is the number of samples contained in the sample set S, j is more than or equal to 1 and less than or equal to k, and k and j are integers.
The information Gain (S, B) of dividing S by the index B is obtained by subtracting the sample subset S divided by the index B from the entropy of the sample set SjThe entropy of (a).
For example, the index power business cooperation degree is selected as the final classification, the sample set S has 15 samples, 8 samples belonging to the power business cooperation degree are good, and 7 samples belonging to the power business cooperation degree are poor, so that Encopy (p)1,p2,...,pi) When a value of 0 is taken, the information entropy is as follows:
one index B in the index set A is marketing, wherein the index 'marketing' takes values as follows: { high sales, high advertisement popularity, and small market share }. If the index is used to divide the sample set S, 3 sample subsets can be obtained, and the division is carried outRespectively recording as: s1(marketing ═ sales are high), S2(marketing is highly famous for advertisement), S3(marketing ═ market share is small).
Setting S
1Comprises 6 samples, wherein the category is the proportion of good power service synergy is
The proportion of the power service synergy difference is
S
2Comprises 4 samples, wherein the category is the proportion of good power service synergy is
The proportion of the power service synergy difference is
S
3The method comprises 5 samples, wherein the category is the proportion of good power service synergy
The proportion of the power service synergy difference is
The information entropy of three branch points is:
the information entropy of the sample subset after dividing S by the index B is:
the information gain for dividing S by the index B is:
Gain(S,B)=Entropy(S)-EntropyB(S)=0.997-0.824=0.173
the information gain is the difference value between the impurity degree (entropy) of the sample data set before division and the impurity degree (entropy) of the sample set after division, and the larger the information gain is, the purer the sample subset after division by using the index B is, the more the classification is facilitated. And obtaining the information entropy and the information gain under other indexes in the same way.
In step 1032, data other than the training data set S is calculated using the information entropy and information gain test obtained in step 1031.
In addition, the following steps may also be included between step 1032 and step 1033:
step 10321, determine whether there is an error classification, if yes, return to step 1031, otherwise execute step 1033. Here, the judgment basis may be determined by the classification result of the training data set S.
Corresponding to the foregoing embodiment of the power system service collaborative classification method based on the ID3 decision tree algorithm, the present invention also provides an embodiment of a power system service collaborative classification system based on the ID3 decision tree algorithm.
Referring to fig. 4, a block diagram of an embodiment of the power system service collaborative classification system based on ID3 decision tree algorithm according to the present invention is shown, the system includes:
the sample set extraction module 201 is configured to obtain a power service collaborative correlation database, and extract a sample set S therefrom;
an index set extraction module 202, configured to extract an index set a, where the index set a contains an index for evaluating business collaboration data;
an information entropy and information gain calculation module 203, configured to calculate information entropy and information gain for each index of the sample set S based on an ID3 algorithm, so as to select a suitable root node and an appropriate intermediate node;
a decision tree construction module 204, configured to construct a decision tree according to the selected root node;
and the scheme selection module 205 is configured to evaluate each business cooperation scheme based on the decision tree and select the business cooperation scheme according to requirements.
Further, the sample set extraction module 201 is further configured to determine whether all sample data in the sample set S is of the same class, and if so, select the class to which all sample data belongs as a root node; the index set extraction module 202 is further configured to determine whether the sample set S and the index set a are empty, and if yes, select the class with the highest proportion in the sample set S as a root node; the index set extraction module 202 is further configured to determine whether values of all indexes in the index set a are unique, and if so, select the class with the highest proportion in the sample set S as the root node.
The information entropy and information gain calculating module 203 may specifically include:
the information entropy and information gain calculation submodule calculates the information entropy and information gain of each index on the sample set S based on the ID3 algorithm;
the test calculation submodule calculates other data except the training data set S by using the calculated information entropy and information gain test;
and selecting a submodule by the node, and selecting a proper root node and a proper middle node after comparison.
And the test calculation submodule is also used for judging whether error classification exists or not, and if so, the error classification is fed back to the information entropy and information gain calculation submodule.
With regard to the system in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
In particular, the power system business collaborative classification system based on the ID3 decision tree algorithm may further include a user interaction module 206, which is used for visualization display of data and configuration of interfaces and applications after decision tree construction and classification.
For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described system embodiments are merely illustrative, and some or all of the modules may be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.