CN112395273A

CN112395273A - Data processing method and device and storage medium

Info

Publication number: CN112395273A
Application number: CN201910754310.5A
Authority: CN
Inventors: 黄刚
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2021-02-23

Abstract

The embodiment of the invention discloses a data processing method, a data processing device and a storage medium, wherein the method comprises the following steps: constructing decision trees of the sample characteristic data by using a preset characteristic selection algorithm to obtain at least two decision trees, wherein the at least two decision trees comprise an initial characteristic set corresponding to the sample characteristic data and construction information of each characteristic in the initial characteristic set in each decision tree; calculating the global importance of each feature according to the uncertainty index and the splitting frequency of each piece of construction information in at least two pieces of construction information, thereby obtaining a global importance set corresponding to the initial feature set; selecting a target feature set meeting an importance condition from the initial feature set according to the global importance set; and when the data to be processed is acquired, performing feature data conversion processing on the data to be processed according to the target feature set. The data processing method provided by the embodiment of the invention improves the prediction processing efficiency and the prediction processing effect of the data to be processed.

Description

Data processing method and device and storage medium

Technical Field

The present invention relates to data processing technologies in the field of data mining, and in particular, to a data processing method and apparatus, and a storage medium.

Background

Feature selection, which refers to a process of selecting the most effective features from the initial feature set to reduce data dimensions, that is, feature selection is a process of selecting a target feature set from the initial feature set, and the target feature set is used as a subset of the initial feature set and can effectively describe data; through feature selection, the features valuable to research work can be extracted from the initial feature set corresponding to the high-dimensional data, so that researchers can be assisted to better understand research objects.

Currently, there are three main types of feature selection methods: the Filter method, the Wrapper method and the Embedded method; for the Embedded method, the commonly used method mainly includes regularization and decision tree. However, when feature selection is implemented by a decision tree, it is usually performed by using a single decision tree; because a single decision tree has high deviation and low stability, and the effectiveness of the selected target feature set is low, the processing efficiency is low and the processing effect is poor when the data to be processed is subjected to prediction processing according to the target feature set.

Disclosure of Invention

In order to solve the foregoing technical problems, embodiments of the present invention are intended to provide a data processing method and apparatus, and a storage medium, which can improve the prediction processing efficiency and the prediction processing effect of data to be processed.

The technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes

When sample feature data are obtained according to preset sample data, a decision tree of the sample feature data is constructed by using a preset feature selection algorithm to obtain at least two decision trees, wherein the at least two decision trees comprise an initial feature set corresponding to the sample feature data and construction information of each feature in the initial feature set in each decision tree in the at least two decision trees;

calculating the global importance of each feature according to the uncertainty index and the splitting frequency of each piece of construction information in at least two pieces of construction information, thereby obtaining a global importance set corresponding to the initial feature set;

selecting a target feature set meeting an importance condition from the initial feature set according to the global importance set;

and when the data to be processed is acquired, performing feature data conversion on the data to be processed according to the target feature set so as to perform prediction processing according to the data to be processed after the conversion processing.

In the above scheme, before the constructing a decision tree of the sample feature data by using a preset feature selection algorithm to obtain at least two decision trees, the method further includes:

determining a base learner;

correspondingly, the constructing the decision tree of the sample feature data by using a preset feature selection algorithm to obtain at least two decision trees includes:

and according to the preset feature selection algorithm, the basis learner is used for carrying out iteration construction on the sample feature data to obtain at least two decision trees.

In the above scheme, the calculating the global importance of each feature according to the uncertainty index and the splitting number of each of the at least two pieces of construction information includes:

calculating the uncertainty index mean value of each feature in the at least two decision trees according to the uncertainty index of each of the at least two pieces of construction information;

calculating the total splitting times of each feature in the at least two decision trees according to the splitting times of each of the at least two pieces of construction information;

and obtaining the global importance of each feature according to the uncertainty index mean value and the total splitting times.

In the foregoing solution, before the calculating the global importance of each feature according to the uncertainty index and the splitting number of each of the at least two pieces of construction information, the method further includes:

obtaining at least two branches of each feature in each of the at least two decision trees;

calculating a sub uncertainty index corresponding to each of the at least two branches;

and calculating the uncertainty index of each constructed information according to at least two sub uncertainty indexes, thereby obtaining the uncertainty index of at least two constructed information.

In the above scheme, the selecting, according to the global importance set, a target feature set satisfying an importance condition from the initial feature set includes:

sorting the features in the initial feature set based on the global importance set to obtain a sorted initial feature set, wherein the sorted initial feature set comprises n features, and n is a positive integer greater than or equal to 1;

when the sorted initial feature sets are arranged in a reverse order according to the global importance, selecting the 1 st feature from the sorted initial feature sets, and determining the prediction accuracy P of the 1 st feature₁；

Selecting the 2 nd feature from the sorted initial feature set, and determining the prediction accuracy P of the 2 nd feature₂；

According to the importance condition, selecting the kth feature from the sorted initial feature set, and determining the prediction accuracy P of the kth feature_kUp to a preset number c of P_k-c+1To P_kAnd when the target feature set is sequentially reduced, stopping selecting features, and taking the selected k-c +1 features as the target feature set, wherein k is a positive integer which is greater than or equal to 3 and less than or equal to n, and c is a positive integer which is greater than or equal to 3 and less than or equal to k.

In the foregoing solution, after the features in the initial feature set are sorted based on the global importance set to obtain a sorted initial feature set, the method further includes:

when the sorted initial feature set is in positive order arrangement according to the global importance, selecting the nth feature from the sorted initial feature set, and determining the prediction accuracy P of the nth feature_n；

Selecting the (n-1) th feature from the sorted initial feature set, and determining the prediction accuracy P of the (n-1) th feature_n-1；

According to the importance condition, selecting the kth feature from the sorted initial feature set, and determining the prediction accuracy P of the kth feature_kUp to a preset number c of P_k+c-1To P_kAnd when the target feature set is sequentially reduced, stopping selecting features, and taking n-k-c +2 selected features as the target feature set, wherein k is a positive integer which is greater than or equal to 1 and less than or equal to n-2, and c is a positive integer which is greater than or equal to 3 and less than or equal to n-k + 1.

In the above solution, said calculating an uncertainty index mean of each feature in the at least two decision trees according to the uncertainty index of each of the at least two pieces of construction information includes:

respectively carrying out normalization processing on the uncertainty index of each piece of construction information in at least two pieces of construction information to obtain at least two normalized uncertainty indexes of each feature;

and calculating the uncertainty index mean value of each feature in the at least two decision trees according to the at least two normalized uncertainty indexes.

In the above solution, said calculating, according to the number of splits of each of the at least two pieces of construction information, a sum of the number of splits of each feature in the at least two decision trees includes:

respectively carrying out normalization processing on the splitting times of each piece of construction information in at least two pieces of construction information to obtain at least two normalized splitting times of each feature;

and calculating the total splitting times of each feature in the at least two decision trees according to the at least two normalized splitting times.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes: a processor, a memory and a communication bus, the memory communicating with the processor through the communication bus, the memory storing a program executable by the processor, the program, when executed, executing the data processing method as described above through the processor.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the program implements the data processing method as described above.

The embodiment of the invention provides a data processing method and device and a storage medium, and the data processing method and device comprises the steps of firstly, when sample characteristic data are obtained according to preset sample data, constructing decision trees of the sample characteristic data by using a preset characteristic selection algorithm to obtain at least two decision trees, wherein the at least two decision trees comprise an initial characteristic set corresponding to the sample characteristic data and construction information of each characteristic in the initial characteristic set in each decision tree in the at least two decision trees; secondly, calculating the global importance of each feature according to the uncertainty index and the splitting frequency of each piece of construction information in at least two pieces of construction information, thereby obtaining a global importance set corresponding to the initial feature set; then, according to the global importance set, selecting a target feature set meeting the importance condition from the initial feature set; and finally, when the data to be processed is acquired, performing feature data conversion processing on the data to be processed according to the target feature set so as to perform prediction processing according to the data to be processed after the conversion processing. By adopting the technical implementation scheme, the feature selection of the initial feature set is performed based on at least two decision trees, so that the deviation is low; moreover, the target feature set is a set consisting of features selected according to two parameters of uncertainty indexes and splitting times, so that a scheme of selecting the features according to a plurality of decision trees and a plurality of importance parameters is realized, and therefore, the effectiveness of the selected target feature set is high; therefore, when the data to be processed is subjected to prediction processing according to the target feature set, the processing efficiency and the processing effect of the data to be processed can be improved.

Drawings

Fig. 1 is a flowchart illustrating an implementation of a data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an exemplary process for obtaining at least two decision trees according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an exemplary data processing method according to an embodiment of the present invention;

fig. 4 is a first schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Example one

The embodiment of the invention provides a data processing method, which has the application scenes that: when the data processing device selects the characteristics according to the preset sample data, the data to be processed which belongs to the same type as the preset sample data is subjected to prediction processing according to the selected target characteristic set. Fig. 1 is a flowchart of an implementation of a data processing method according to an embodiment of the present invention, as shown in fig. 1, the data processing method includes:

s101, when sample feature data are obtained according to preset sample data, a decision tree of the sample feature data is constructed by using a preset feature selection algorithm to obtain at least two decision trees, wherein the at least two decision trees comprise an initial feature set corresponding to the sample feature data and construction information of each decision tree of each feature in the initial feature set in the at least two decision trees.

In the embodiment of the present invention, preset sample data is stored in advance in the data processing device, and the preset sample data is original data for the data processing device to perform feature selection, so that the data processing device needs to perform cleaning and conversion processing on the original data when performing the feature selection processing by using the original data, where the preset sample data after the cleaning and conversion processing is sample feature data, and thus the data processing device obtains the sample feature data according to the preset sample data. At this time, since the preset feature selection algorithm is preset in the data processing device, the data processing device can perform decision tree construction on the sample feature data by using the preset feature selection algorithm, so as to obtain at least two decision trees; and the at least two decision trees comprise an initial feature set corresponding to the sample feature data and construction information of each feature in the initial feature set in each decision tree in the at least two decision trees.

It should be noted that the preset feature selection algorithm represents an algorithm for constructing multiple Decision trees for the sample feature data, for example, GBDT (Gradient Boosting Decision Tree); here, since the construction and feature selection of the decision tree are both features with classification capability selected from the initial feature set corresponding to the sample feature data, in the embodiment of the present invention, a process of constructing the decision tree for the sample feature data by the data processing apparatus is used as a process of initial feature selection of the initial feature set, and subsequent processing is performed based on the plurality of constructed decision trees to complete feature selection processing of the initial feature set. Here, the classification capability of a feature characterizes the ability of the feature to efficiently describe sample feature data. In addition, the sample characteristic data is sample data capable of directly selecting characteristics, and the initial characteristic set is a set formed by attributes corresponding to the sample characteristic data; and for each feature in the initial feature set, corresponding construction information exists in the construction process of at least two decision trees, and the construction information represents information for classifying the sample feature data according to each feature, such as a feature value of the feature, a kini index of the feature, an information gain ratio of the feature, an entropy of the feature, a feature value of the feature, the number of splits, a split branch, and the like.

Here, when the at least two decision trees are sorted according to the constructed time, the latter decision tree of the sorted at least two decision trees is constructed on the basis of reducing the deviation of the former decision tree.

S102, calculating the global importance of each feature according to the uncertainty index and the splitting times of each piece of construction information in at least two pieces of construction information, and accordingly obtaining a global importance set corresponding to the initial feature set.

In the embodiment of the present invention, after the data processing device obtains at least two decision trees corresponding to the initial feature set, at least two pieces of construction information corresponding to each feature are obtained, so that the data processing device can calculate the global importance corresponding to each feature in the initial feature set from two angles according to the uncertainty index and the splitting frequency of each piece of construction information in the at least two pieces of construction information, thereby obtaining the global importance set corresponding to the initial feature set.

It should be noted that the uncertainty index characterizes the uncertainty of each feature in the initial feature set in each of at least two decision trees, such as: kini index, entropy, information gain ratio, and the like; in addition, the size of the uncertainty index may be positively correlated with the classification capability of the feature, or negatively correlated with the classification capability of the feature, which is determined according to the specific content of the uncertainty index, for example, when the uncertainty index is a kini index or entropy, the smaller the uncertainty index is, the larger the classification capability of the feature is, and the larger the uncertainty index is, the smaller the classification capability of the feature is; and when the uncertainty index is the information gain, the smaller the uncertainty index is, the larger the classification capability of the characteristic is, and the larger the uncertainty index is, the smaller the classification capability of the characteristic is. The splitting times characterize the times of splitting of each feature in the initial feature set with different feature values in each of at least two decision trees. In addition, the initial feature set comprises at least one feature, and the global importance set characterizes a set formed by global importance corresponding to each feature in the initial feature set.

Specifically, the data processing apparatus may calculate at least two uncertainty indexes and at least two splitting times in at least two pieces of construction information of at least two decision trees corresponding to each feature according to a preset importance method, so as to obtain a global importance of each feature in the at least two decision trees, and further combine the global importance of each feature to obtain a global importance set of the initial feature set.

And S103, selecting a target feature set meeting the importance condition from the initial feature set according to the global importance set.

In the embodiment of the present invention, after the data processing apparatus obtains the global importance set corresponding to the initial feature set, the data processing apparatus can select the target feature set according to the global importance and the importance condition corresponding to each feature, so as to select a feature satisfying the importance condition from the initial feature set to form the target feature set, thereby completing the feature selection processing on the initial feature set.

It should be noted that the target feature set is a subset of the initial feature set, and the target feature set can effectively describe the sample feature data corresponding to the initial feature set. The importance condition may be a judgment condition larger than a preset threshold, or may be another judgment condition (for example, a quadratic drop method), and this is not particularly limited in the embodiment of the present invention.

And S104, when the data to be processed is acquired, performing feature data conversion processing on the data to be processed according to the target feature set, and performing prediction processing according to the data to be processed after the conversion processing.

In the embodiment of the present invention, when the data processing apparatus obtains the data to be processed, since the data processing apparatus has already obtained the target feature set at this time, the data to be processed can be subjected to the conversion processing of the feature data according to the target feature set, so that the prediction processing is performed according to the data to be processed after the conversion processing.

It should be noted that the data to be processed and the preset sample data belong to the same type of data, for example, the data to be processed and the preset sample data are both traffic data, and for example, the data to be processed and the preset sample data are both student consumption data.

In addition, the data processing device performs feature data conversion processing on the data to be processed according to the target feature set, which is equivalent to extracting data corresponding to features only including the target feature set from the data to be processed; at this time, the data processing device performs prediction processing based on the extracted data (data to be processed after conversion processing); the prediction processing may be a prediction algorithm in the prior art, and the embodiment of the present invention is not described herein again; in addition, the execution main body for performing the prediction processing according to the to-be-processed data after the conversion processing may be a data processing apparatus, or may be another device.

For example, when the data to be processed and the preset sample data are both traffic data, after the data processing device obtains a target feature set corresponding to the traffic data, the data processing device can perform conversion processing on the feature data of the data to be processed according to the target feature set, and then perform traffic flow prediction according to the data to be processed after the conversion processing.

It can be understood that, in the embodiment of the present invention, by constructing a plurality of decision trees for the initial feature set and considering the importance of each feature in the initial feature set from a plurality of angles, excellent features are effectively extracted from the initial feature set to form a target feature set, and feature selection of the initial feature set is completed; when subsequent prediction processing is carried out, the data processing device extracts data only containing the features in the target feature set from the data to be processed; compared with the data to be processed, the extracted data can reduce the calculation amount of prediction processing, so that the efficiency of the prediction processing is improved; meanwhile, the data extracted according to the target feature set can still effectively describe the data to be processed, so that the accuracy of prediction processing is ensured.

Further, in this embodiment of the present invention, before the data processing device constructs a decision tree of the sample feature data by using a preset feature selection algorithm in S101 to obtain at least two decision trees, the data processing method further includes S105, specifically:

and S105, determining a base learner.

In the embodiment of the present invention, before obtaining at least two decision Trees corresponding to an initial feature set by using a preset feature selection algorithm, a data processing apparatus needs to determine an algorithm for constructing each decision tree, that is, a basis learner, for example, CART (Classification and Regression Trees).

Correspondingly, in this embodiment of the present invention, in S101, the data processing apparatus constructs a decision tree of the sample feature data by using a preset feature selection algorithm, and obtains at least two decision trees, where the decision tree includes: and the data processing device performs iteration construction on the sample characteristic data by using the base learner according to a preset characteristic selection algorithm to obtain at least two decision trees.

That is to say, in the embodiment of the present invention, after the data processing apparatus determines the base learner, since the preset feature selection algorithm is used for iteratively constructing the decision tree on the sample feature data, the data processing apparatus can select the algorithm according to the preset feature and iteratively construct the decision tree on the sample feature data by using the base learner; and taking the results of the iterative construction of the decision trees (all the constructed decision trees) as the results of the initial feature selection on the initial feature set, thereby obtaining at least two decision trees.

Specifically, the process of the data processing device for iteratively constructing the decision tree by using the basis learner to the sample feature data is as follows: the data processing device utilizes a base learner to construct a decision tree for the sample characteristic data to obtain a first decision tree; at this time, the data processing device constructs the decision tree for the sample feature data by reusing the base learner under the condition of reducing the deviation of the first decision tree according to the deviation of the first decision tree, so as to obtain a second decision tree; next, the data processing device constructs a decision tree for the sample characteristic data by reusing the base learner under the condition of reducing the deviation of the second decision tree according to the deviation of the second decision tree, so as to obtain a third decision tree; thus, the data processing device constructs a decision tree for the sample characteristic data by using the base learner according to the deviation of the previous decision tree under the condition of reducing the deviation of the previous decision tree, so as to obtain the current decision tree, the current decision tree is used as the previous decision tree, the base learner continues to iteratively construct the decision tree for the sample characteristic data until a preset iteration end condition is met, the iterative construction of the decision tree is ended, and all the decision trees are used as decision tree construction results, so that at least two decision trees are obtained. Here, the data processing apparatus uses the base learner to construct the decision tree for the sample feature data, which is a process of constructing the decision tree for the sample feature data according to the initial feature set in the prior art, and the embodiment of the present invention is not described herein again.

Exemplarily, fig. 2 is a schematic flowchart of an exemplary process for acquiring at least two decision trees according to an embodiment of the present invention, and as shown in fig. 2, after the data processing apparatus obtains sample feature data, the data processing apparatus iteratively constructs a decision tree on the sample feature data by using CART as a base learner: firstly, a data processing device utilizes CART to construct a decision tree for sample characteristic data to obtain a first decision tree; then, the data processing device constructs a decision tree for the sample characteristic data by using CART according to the residual error (deviation) of the first decision tree to obtain a second decision tree; in this way, the data processing device constructs a decision tree for the sample characteristic data by using the base learner according to the residual error of the previous decision tree to obtain a current decision tree, the current decision tree is used as the previous decision tree, the data processing device is used again for constructing the decision tree of the sample characteristic data, and the iterative construction of the decision tree is ended until the preset iteration ending condition is met; at this time, M decision trees are obtained in total, and the M decision trees are used as the result of the primary feature selection: at least two decision trees.

It can be understood that after the data processing apparatus performs the preliminary feature selection processing on the initial feature set by using the preset feature selection algorithm, a plurality of decision trees can be obtained, so that the subsequent feature selection processing can be performed according to the plurality of decision trees, the global importance of each feature is determined by the plurality of decision trees together, and the problem of low effectiveness of the selected feature caused by instability and large deviation of a single decision tree is solved.

Further, in the embodiment of the present invention, in S102, the global importance of each feature is calculated according to the uncertainty index and the splitting number of each of the at least two pieces of construction information, so as to obtain a global importance set corresponding to the initial feature set, which specifically includes S102a-S102 c:

s102a, calculating the uncertainty index mean value of each feature in at least two decision trees according to the uncertainty index of each of at least two pieces of construction information.

In the embodiment of the present invention, after the data processing device obtains the at least two decision trees, since the iteratively constructed decision trees each have already calculated the uncertainty index of each feature in the initial feature set, at this time, the data processing device can obtain the uncertainty index of each feature in each decision tree in the at least two decision trees from each constructed information in the at least two constructed information, that is, obtain at least two uncertainty indexes corresponding to each feature, so as to calculate the mean uncertainty index of each feature in the at least two decision trees according to the at least two uncertainty indexes corresponding to each feature.

That is, for each feature in the initial feature set, there are uncertainty indexes in the construction information corresponding to each decision tree, and the number of decision trees is at least two, so that there are at least two uncertainty indexes in each feature; at this time, the data processing device calculates the mean value of the at least two uncertainty indexes, so as to obtain the mean value of the uncertainty indexes of each feature; i.e. the uncertainty index mean characterizes the mean of at least two uncertainty indexes of each feature. In addition, the process of obtaining the uncertainty index by the data processing apparatus may also be another calculation performed on at least two uncertainty indexes, which is not specifically limited in the embodiment of the present invention.

It can be understood that, since the data processing apparatus has already calculated the uncertainty index of each feature in each decision tree in the process of constructing the decision tree for the sample feature data, the accuracy of the uncertainty of the feature can be improved when the uncertainty index mean of each feature is obtained by integrating at least two uncertainty indexes corresponding to at least two decision trees and describing the uncertainty of each feature by using the uncertainty index mean.

Further, in the embodiment of the present invention, the data processing apparatus in S102a calculates an uncertainty index mean of each feature in at least two decision trees according to the uncertainty index of each of the at least two pieces of construction information, which specifically includes S102a1-S102a 2:

s102a1, respectively carrying out normalization processing on the uncertainty index of each of the at least two pieces of construction information to obtain at least two normalized uncertainty indexes of each feature.

It should be noted that, the data processing device respectively performs normalization processing on at least two uncertain indexes of each feature in at least two pieces of construction information to obtain at least two normalized uncertain indexes of each feature; moreover, when the data processing apparatus performs normalization processing on the uncertainty index, the data processing apparatus adopts the existing normalization processing technology, for example, [0-100] normalization processing, and the embodiment of the present invention is not described herein again.

S102a2, calculating the uncertainty index mean value of each feature in at least two decision trees according to the at least two normalized uncertainty indexes.

In the embodiment of the present invention, after the data processing device obtains at least two normalized uncertainty indexes of each feature, a mean value of the at least two normalized uncertainty indexes is calculated, so as to obtain an uncertainty index mean value of each feature in at least two decision trees; here, the uncertainty index mean characterizes at least two uncertainty index normalized means of each feature.

The data processing device can be used for carrying out normalization processing on the uncertainty indexes, so that the calculation amount of calculation by using the uncertainty indexes is simplified to a certain extent, and the convenience of calculation by using the uncertainty indexes is improved; meanwhile, the condition that the extreme value of the uncertainty index influences the subsequent calculation result is avoided.

S102b, calculating the total splitting times of each feature in at least two decision trees according to the splitting times of each piece of the at least two pieces of construction information.

In the embodiment of the invention, in the process of constructing the decision tree for the initial feature set by using the preset feature selection algorithm, the number of times of splitting of each feature in each decision tree in the initial feature set is related to the importance of the feature, so that the data processing device also uses the splitting number of each piece of constructed information in at least two pieces of constructed information corresponding to the feature as one of the parameters for calculating the global importance of the feature. Here, for each feature, there are at least two splitting times in the at least two decision trees, and the data processing device calculates a sum of the splitting times of each feature in the at least two decision trees according to the at least two splitting times.

It is noted that the sum of the number of splits characterizes the sum of at least two split numbers of each feature. In addition, when the data processing apparatus calculates the total number of splitting times of each feature according to at least two splitting times, the data processing apparatus may directly sum the at least two splitting times, may sum the at least two splitting times according to different weights, or may use another calculation method, which is not specifically limited in this embodiment of the present invention.

It can be understood that, since the splitting times of the features in the decision tree are related to the global importance of the features, the data processing device can improve the accuracy of the global importance of the obtained features by obtaining the splitting time sum corresponding to each feature according to at least two splitting times of each feature and participating the splitting time sum in the calculation of the global importance of the features.

Further, in the embodiment of the present invention, the data processing apparatus in S102b calculates the total number of splits of each feature in at least two decision trees according to the number of splits of each of the at least two pieces of construction information, which specifically includes S102b1-S102b 2:

s102b1, respectively carrying out normalization processing on the splitting times of each piece of construction information in the at least two pieces of construction information to obtain at least two normalized splitting times of each feature.

It should be noted that, the data processing device respectively performs normalization processing on at least two splitting times of each feature in at least two pieces of construction information to obtain at least two normalized splitting times of each feature; moreover, when the data processing apparatus performs normalization processing on the split times, the existing normalization processing technology is adopted, and the embodiment of the invention is not described herein again.

S102b2, calculating the total splitting times of each feature in at least two decision trees according to the at least two normalized splitting times.

In the embodiment of the invention, after the data processing device obtains at least two normalized splitting times of each feature, the data processing device calculates the sum of the at least two normalized splitting times so as to obtain the sum of the splitting times of each feature in at least two decision trees; here, the split number sum represents a sum of at least two split number normalization processes for each feature.

The data processing device can be used for carrying out normalization processing on the splitting times, so that the calculation amount of calculation by using the splitting times is simplified to a certain extent, and the convenience of calculation by using the splitting times is improved; meanwhile, the influence of the extreme value of the splitting times on the subsequent calculation result is avoided.

And S102c, obtaining the global importance of each feature according to the uncertainty index mean value and the splitting times summation.

In the embodiment of the invention, after the data processing device obtains the uncertainty index mean value and the splitting frequency sum corresponding to each feature, the data processing device integrates the uncertainty index mean value and the splitting frequency sum, and calculates the uncertainty index mean value and the splitting frequency sum by using a preset importance algorithm to obtain the global importance of each feature.

Accordingly, after the data processing apparatus obtains the global importance of each feature, a global importance set composed of the global importance of each feature is also obtained for the initial feature set.

Illustratively, when the at least two decision trees are M decision trees, and the uncertainty index of each feature in each decision tree is a kini index, the data processing apparatus calculates the global importance of each feature according to a preset importance algorithm formula (1), where formula (1) is:

wherein A is a feature in the initial feature set, I (A) is the global importance of the feature A, D is the parent node corresponding to at least two branches of the feature A, Gini (D, A)_jFor the uncertainty index of feature a normalized in the jth decision tree,

is the mean uncertainty index of feature A, division (A)_jFor the normalized number of splits for feature a in the jth decision tree,

is the sum of the number of splits of feature a.

Further, in this embodiment of the present invention, before the data processing device in S102 calculates the global importance of each feature according to the uncertainty index and the splitting number of each of the at least two pieces of the construction information, the processing method further includes S106-S108:

s106, acquiring at least two branches of each feature in each decision tree of the at least two decision trees.

In an embodiment of the present invention, after obtaining the at least two decision trees, the data processing apparatus determines, for each feature in the initial feature set, at least two branches of each feature in each of the at least two decision trees.

It should be noted that each feature in the initial feature set has at least two branches in each of the at least two decision trees. Here, the at least two branches characterize branch information obtained by splitting sub-sample feature data in the sample feature data according to the features.

And S107, calculating a sub uncertainty index corresponding to each branch of the at least two branches.

In an embodiment of the present invention, after the data processing apparatus obtains at least two branches of each feature in each decision tree, a sub-uncertainty index corresponding to each branch of the at least two branches can be calculated. Here, the sub-uncertainty index characterizes the uncertainty of the feature at each branch.

Illustratively, when the sub-uncertainty index is a kini index of a branch, the data processing apparatus calculates the sub-uncertainty index according to equation (2), where equation (2) is:

wherein D is₁Representing one of at least two branches, Gini (D)₁) Represents a branch D₁N represents D₁D in decision tree₁Of the number of classes to be predicted, | D₁I denotes the number of samples in the branch, C_iRepresenting the number of samples in the ith category in N; here, a sample refers to a corresponding piece of data record in the sample feature data.

And S108, calculating the uncertainty index of the constructed information according to the at least two sub uncertainty indexes, thereby obtaining the uncertainty index of the at least two constructed information.

In the embodiment of the present invention, after the data processing device obtains the sub uncertainty index corresponding to each of the at least two branches, the data processing device can obtain the at least two sub uncertainty indexes corresponding to the at least two branches, and at this time, the data processing device calculates the uncertainty index of each feature in each decision tree by using a preset calculation algorithm according to the at least two sub uncertainty indexes.

Illustratively, when at least two sub uncertainty indexes are the kini indexes of two branches, the preset calculation algorithm is as shown in equation (3):

wherein the parent node D is divided into D in one of the at least two decision trees by the feature A₁And D₂Two branches (D, D)₁And D₂Is represented by the formula (4), Gini (D)₁) Represents a branch D₁Gini (D) is a Gini index₂) Represents a branch D₂(ii) a kini index, | D₁I denotes branch D₁Number of middle samples, | D₂I denotes branch D₂The number of samples in, | D | represents the number of samples in the parent node D, Gini (D, A) represents that the feature A is represented by D₁And D₂Uncertainty index in the decision tree where constituent D is located. Here, D is a node corresponding to the sub-sample feature data of the sample feature data in the decision tree.

D＝D₁+D₂ (4)

It should be noted that, steps S106 to S108 are steps of calculating an uncertainty index shown in the embodiment of the present invention, and the step of calculating the uncertainty index may be in other calculation manners, which is not limited in the embodiment of the present invention. In addition, S106-S108 pertain to the process of obtaining the build information in S101.

Further, in this embodiment of the present invention, in S103, the data processing apparatus selects, according to the global importance set, a target feature set that satisfies the importance condition from the initial feature set, and specifically includes S103a-S103 g:

s103a, sorting the features in the initial feature set based on the global importance set to obtain a sorted initial feature set, wherein the sorted initial feature set comprises n features, and n is a positive integer greater than or equal to 1.

It should be noted that, after obtaining a global importance set formed by the global importance of each feature, the data processing apparatus ranks the features in the initial feature set according to the global importance of each feature in the global importance set, so as to obtain a ranked initial feature set; here, the sorted initial feature set includes n features, and n is a positive integer equal to or greater than 1. In addition, the sorted initial feature sets may be arranged in a reverse order according to the global importance, may be arranged in a positive order according to the global importance, and may also be arranged in other arrangement manners, which is not specifically limited in this embodiment of the present invention.

In the embodiment of the invention, when the sorted initial feature sets are sorted in the reverse order of the global importance, S103b-S103d are executed; and when the sorted initial feature sets are sorted in positive order by global importance, executing S103e-S103 g. The method comprises the following specific steps:

s103b, when the sorted initial feature sets are in reverse order according to the global importance, selecting the 1 st feature from the sorted initial feature sets, and determining the prediction accuracy P of the 1 st feature₁。

In the embodiment of the present invention, when the features in the sorted initial feature set are arranged in the reverse order according to the global importance, that is, the 1 st feature in the sorted initial feature set is the feature with the largest or the highest global importance, at this time, the data processing apparatus selects the 1 st feature from the sorted initial feature set, and determines the prediction accuracy P of the 1 st feature₁。

The prediction accuracy of the 1 st feature is a result of comparing a predicted value and an actual value obtained by obtaining and predicting data corresponding to the 1 st feature in the sample feature data.

S103c, selecting the 2 nd feature from the sorted initial feature set, and determining the prediction accuracy P of the 2 nd feature₂。

In the embodiment of the invention, whenThe data processing device selects the 1 st feature with the highest global total degree from the sorted initial feature set, then selects the 2 nd feature with the second highest or second highest global importance from the sorted initial feature set, and takes the comparison result of the predicted value and the actual value of the data corresponding to the 1 st feature and the 2 nd feature in the sample feature data as the prediction accuracy P of the 2 nd feature₂。

S103d, selecting the kth feature from the sorted initial feature set according to the importance condition, and determining the prediction accuracy P of the kth feature_kUp to a preset number c of P_k-c+1To P_kAnd when the number of the selected features is reduced sequentially, stopping selecting the features, and taking the selected k-c +1 features as a target feature set, wherein k is a positive integer which is greater than or equal to 3 and less than or equal to n, and c is a positive integer which is greater than or equal to 3 and less than or equal to k.

In the embodiment of the invention, the data processing device sequentially selects the kth feature according to the height of the global importance, and takes the comparison result of the predicted value and the actual value of the data corresponding to the kth feature and the previously selected k-1 features in the sample feature data as the prediction accuracy rate P of the kth feature_k. Meanwhile, according to the importance condition, judging a preset number c of P_k+c-1To P_kWhether or not to decrease in sequence and in a preset number c of P_k+c-1To P_kWhen the number of the selected features is reduced in sequence, stopping selecting the features, and taking the selected k + c-1 features as a target feature set; wherein k is a positive integer of 3 or more and n or less, and c is a positive integer of 3 or more and k or less.

By this point, the data processing apparatus has completed selecting a target feature set from the initial feature sets arranged in the reverse order.

S103e, when the sorted initial feature set is in positive sequence arrangement according to the global importance, selecting the nth feature from the sorted initial feature set, and determining the prediction accuracy P of the nth feature_n。

In the embodiment of the invention, when the characteristics in the sorted initial characteristic set are ranked according to the global importanceWhen the lines are arranged in a positive sequence, namely the nth feature in the sorted initial feature set is the feature with the maximum or highest global importance, the data processing device selects the nth feature from the sorted initial feature set and determines the prediction accuracy rate P of the nth feature_n。

The prediction accuracy of the nth feature is a result of comparing a predicted value and an actual value obtained by obtaining and predicting data corresponding to the nth feature in the sample feature data.

S103f, selecting the (n-1) th feature from the sorted initial feature set, and determining the prediction accuracy P of the (n-1) th feature_n-1。

In the embodiment of the invention, after the data processing device selects the nth feature of the feature with the highest global total importance from the sorted initial feature set, the n-1 th feature with the second highest or the second highest global importance is selected from the sorted initial feature set, and the comparison result of the predicted value and the actual value of the data corresponding to the nth feature and the n-1 th feature in the sample feature data is used as the prediction accuracy P of the n-1 th feature_n-1。

S103g, selecting the kth feature from the sorted initial feature set, and determining the prediction accuracy P of the kth feature_kUp to a preset number c of P_k+c-1To P_kAnd when the number of the selected features is reduced sequentially, stopping selecting the features, and taking the selected n-k-c +2 features as a target feature set, wherein k is a positive integer which is greater than or equal to 1 and less than or equal to n-2, and c is a positive integer which is less than or equal to n-k + 1.

In the embodiment of the invention, the data processing device sequentially selects the kth feature according to the height of the global importance, and takes the comparison result of the predicted value and the actual value of the data, which correspond to the kth feature and the n-k features selected previously, in the sample feature data together as the prediction accuracy P of the kth feature_k. Meanwhile, according to the importance condition, judging a preset number c of P_k+c-1To P_kWhether or not to decrease in sequence and in a preset number c of P_k+c-1To P_kWhen the values are sequentially lowered, the selection of the features is stopped,and taking the selected n-k +1- (c-1), namely n-k-c +2 features as the target feature set; wherein k is a positive integer of 1-2, and c is a positive integer of 3-1.

By this point, the data processing apparatus has completed selecting a target feature set from the initial feature sets in the forward order.

That is, in the embodiment of the present invention, after obtaining the sorted initial feature set, the data processing apparatus sequentially selects the feature with the highest global importance in the global importance set from the sorted initial feature set, calculates the prediction accuracy of the selected feature each time the feature is selected, compares the prediction accuracy of a preset number of features according to the selection order of the features, stops selecting the features when it is determined that the prediction accuracy of the preset number of features is sequentially reduced, and reduces the most recently selected preset number by one feature from the selected features to obtain the remaining features as the target feature set.

It should be noted that the prediction accuracy rate means that the selected features are used to predict the sample feature data to obtain a comparison result between a predicted value and an actual value, and the prediction accuracy rate is calculated by using a prediction algorithm (e.g., a decision tree, a difference integration moving average autoregressive model, or a neural network model) in the prior art, which is not described herein again in the embodiments of the present invention.

Exemplarily, when the preset number c is 3 and the sorted initial feature sets are in a reverse order arrangement according to the global importance, after the data processing device obtains the sorted initial feature sets, marking the feature with the highest global importance currently selected as the qth feature in the target feature set, and sequentially selecting the qth +1 feature and the qth +2 feature; here, the prediction accuracy rates corresponding to the qth feature, the q +1 th feature, and the q +2 th feature are denoted as P_k、P_k+1And P_k+2When the regression prediction probabilities corresponding to the qth feature, the (q + 1) th feature and the (q + 2) th feature are sequentially decreased (i.e., P)_q>P_q+1>P_q+2) When the feature is not selected, the selected q is set to beThe individual features serve as a target feature set. It can be understood that the data processing device selects the target feature set from the initial feature set by adopting a quadratic descent method, and a scheme of automatically and reasonably selecting the number of features is realized.

Fig. 3 is a schematic flowchart of an exemplary data processing method according to an embodiment of the present invention, and as shown in fig. 3, a data processing method starts, and a data processing apparatus cleans and preprocesses traffic data (preset sample data) to obtain traffic characteristic data (sample characteristic data). Then, the data processing device uses GBDT (preset feature selection algorithm) to iteratively construct decision trees for the traffic feature data, and M CART trees (at least two decision trees) are obtained. Then, the data processing device calculates a kini index (uncertainty index) of each feature in a traffic feature set (initial feature set) corresponding to the traffic feature data in each of the M CART trees, and calculates to obtain a kini index mean (uncertainty index mean) of each feature in the M CART trees; the data processing device calculates the splitting times of each feature in the traffic feature data in each CART tree in the M CART trees, and calculates the total splitting times (the sum of the splitting times) of each feature in the M CART trees; and carrying out weighted combination on the Gini index mean value and the total splitting times to obtain the global importance of each feature. Then, the data processing device selects the features with the highest global importance from the traffic feature set in sequence according to the global importance of each feature, and calculates the prediction accuracy of the selected features; judging whether the prediction accuracy rates of the selected preset number are reduced in sequence or not, ending the selection of the features when the prediction accuracy rates of the selected preset number are determined to be reduced in sequence, taking the remaining features except the recently selected preset number reduced by one feature in the selected features as a target feature set, and ending the flow of the data processing method; and when the prediction accuracy of the selected preset quantity is determined not to be reduced in sequence, continuously selecting the features with the highest global importance from the traffic data feature set, calculating the prediction accuracy of the selected features, judging whether the prediction accuracy of the selected preset quantity is reduced in sequence or not, and ending the feature selection until the prediction accuracy of the selected preset quantity is reduced in sequence.

It will be appreciated that since the feature selection of the initial feature set is based on at least two decision trees, the bias is low; moreover, the target feature set is a set consisting of features selected according to two parameters, namely a first importance set and a second importance set, and a scheme of selecting features according to a plurality of decision trees and a plurality of importance parameters is realized, so that the effectiveness of the selected target feature set is high; thus, the effect of feature selection is improved.

Example two

Based on the same inventive concept of the first embodiment, an embodiment of the present invention provides a data processing apparatus 1, corresponding to a data processing method, and fig. 4 is a schematic structural diagram of the data processing apparatus provided in the first embodiment of the present invention, as shown in fig. 4, where the data processing apparatus 1 includes:

a constructing unit 10, configured to, when sample feature data is obtained according to preset sample data, construct a decision tree of the sample feature data by using a preset feature selection algorithm, to obtain at least two decision trees, where the at least two decision trees include an initial feature set corresponding to the sample feature data and construction information of each feature in the initial feature set in each decision tree in the at least two decision trees;

a calculating unit 11, configured to calculate a global importance of each feature according to an uncertainty index and a splitting frequency of each of the at least two pieces of construction information, so as to obtain a global importance set corresponding to the initial feature set;

a selecting unit 12, configured to select, according to the global importance set, a target feature set that meets an importance condition from the initial feature set;

and the processing unit 13 is configured to, when the data to be processed is acquired, perform feature data conversion processing on the data to be processed according to the target feature set, so as to perform prediction processing according to the data to be processed after the conversion processing.

Further, the data processing apparatus 1 further comprises a determination unit 14 for determining a base learner;

correspondingly, the constructing unit 10 is configured to iteratively construct a decision tree on the sample feature data by using the base learner according to the preset feature selection algorithm, so as to obtain the at least two decision trees.

Further, the calculating unit 11 is specifically configured to calculate an uncertainty index mean of each feature in the at least two decision trees according to the uncertainty index of each of the at least two pieces of construction information; calculating the total splitting times of each feature in the at least two decision trees according to the splitting times of each of the at least two pieces of construction information; and obtaining the global importance of each feature according to the uncertainty index mean value and the total splitting times.

Further, the data processing apparatus 1 further comprises an obtaining unit 15, configured to obtain at least two branches of each feature in each decision tree of the at least two decision trees; calculating a sub uncertainty index corresponding to each branch of the at least two branches; and calculating the uncertainty index of each piece of construction information according to the at least two sub uncertainty indexes, thereby obtaining the uncertainty index of the at least two pieces of construction information.

Further, the selecting unit 12 is specifically configured to rank the features in the initial feature set based on the global importance set to obtain a ranked initial feature set, where the ranked initial feature set includes n features, and n is a positive integer greater than or equal to 1; and when the sorted initial feature set is in reverse order arrangement according to the global importance, selecting the 1 st feature from the sorted initial feature set, and determining the prediction accuracy P of the 1 st feature₁(ii) a Selecting the 2 nd feature from the sorted initial feature set, and determining the prediction accuracy P of the 2 nd feature₂(ii) a And sorting the sorted initial features according to the importance degree conditionSelecting a k-th feature from the syndrome set, and determining the prediction accuracy P of the k-th feature_kUp to a preset number c of P_k-c+1To P_kAnd when the target feature set is sequentially reduced, stopping selecting features, and taking the selected k-c +1 features as the target feature set, wherein k is a positive integer which is greater than or equal to 3 and less than or equal to n, and c is a positive integer which is greater than or equal to 3 and less than or equal to k.

Further, the selecting unit 12 is specifically configured to select an nth feature from the sorted initial feature set and determine a prediction accuracy P of the nth feature when the sorted initial feature sets are in a positive order arrangement according to the global importance_n(ii) a Selecting the (n-1) th feature from the sorted initial feature set, and determining the prediction accuracy P of the (n-1) th feature_n-1(ii) a And according to the importance condition, selecting the kth feature from the sorted initial feature set, and determining the prediction accuracy P of the kth feature_kUp to a preset number c of P_k+c-1To P_kAnd when the target feature set is sequentially reduced, stopping selecting features, and taking n-k-c +2 selected features as the target feature set, wherein k is a positive integer which is greater than or equal to 1 and less than or equal to n-2, and c is a positive integer which is greater than or equal to 3 and less than or equal to n-k + 1.

Further, the calculating unit 11 is specifically configured to perform normalization processing on the uncertainty index of each of the at least two pieces of construction information, respectively, to obtain at least two normalized uncertainty indexes of each feature; and calculating the uncertainty index mean value of each feature in the at least two decision trees according to the at least two normalized uncertainty indexes.

Further, the calculating unit 11 is specifically configured to perform normalization processing on the splitting frequency of each piece of the at least two pieces of construction information, respectively, to obtain at least two normalized splitting frequencies of each feature; and calculating the total splitting times of each feature in the at least two decision trees according to the at least two normalized splitting times.

In practical applications, the constructing Unit 10, the calculating Unit 11, the selecting Unit 12, the Processing Unit 13, the determining Unit 14 and the obtaining Unit 15 may be implemented by a processor 16 located on the data Processing apparatus 1, specifically, implemented by a Central Processing Unit (CPU), an MPU (Microprocessor Unit), a Digital Signal Processing (DSP), a Field Programmable Gate Array (FPGA), or the like.

An embodiment of the present invention further provides a data processing apparatus 1, and as shown in fig. 5, the data processing apparatus 1 includes: a processor 16, a memory 17 and a communication bus 18, wherein the memory 17 is in communication with the processor 16 via the communication bus 18, and the memory 17 stores a program executable by the processor 16, and when the program is executed, the data processing method according to the first embodiment is executed by the processor 16.

In practical applications, the Memory 17 may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 16.

The embodiment of the present invention provides a computer-readable storage medium, on which a program is stored, and the program implements the data processing method according to the first embodiment when executed by the processor 16.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method of data processing, the method comprising:

2. The method according to claim 1, wherein before the step of constructing the decision tree of the sample feature data by using the preset feature selection algorithm to obtain at least two decision trees, the method further comprises:

determining a base learner;

3. The method of claim 1, wherein calculating the global importance of each feature according to the uncertainty index and the number of splits of each of the at least two constructed information comprises:

4. The method of claim 1, wherein before calculating the global importance of each of the at least two build information based on the uncertainty index and the number of splits of each of the build information, the method further comprises:

5. The method according to claim 1, wherein the selecting a target feature set satisfying an importance condition from the initial feature set according to the global importance set comprises:

6. The method of claim 5, wherein after the ranking the features in the initial feature set based on the global importance set to obtain a ranked initial feature set, the method further comprises:

According to the importance condition, selecting the kth feature from the sorted initial feature set, and determining the prediction accuracy P of the kth feature_kUp to a preset number c of P_k+c-1To P_kWhen the target feature set is decreased in sequence, stopping selecting features, and taking the selected n-k-c +2 features as the target feature setWherein k is a positive integer of 1 to n-2, and c is a positive integer of 3 to n-k + 1.

7. The method according to claim 3, wherein said calculating an uncertainty index mean of each of said at least two decision trees for said each feature based on said uncertainty index for each of said at least two build information comprises:

8. The method according to claim 3, wherein said calculating a sum of the number of splits of each feature in the at least two decision trees according to the number of splits of each of the at least two constructed information comprises:

9. A data processing apparatus, characterized in that the apparatus comprises: a processor, a memory and a communication bus, the memory in communication with the processor through the communication bus, the memory storing a program executable by the processor, the program, when executed, causing the processor to perform the method of any of claims 1-8.

10. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.