CN110880014B

CN110880014B - Data processing method, device, computer equipment and storage medium

Info

Publication number: CN110880014B
Application number: CN201910965010.1A
Authority: CN
Inventors: 秦文力; 张密; 韩丙卫
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2023-09-05
Anticipated expiration: 2039-10-11
Also published as: CN110880014A

Abstract

The invention discloses a data processing method, a device, computer equipment and a storage medium, wherein the data to be processed is obtained and preprocessed to generate a standard data set; performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set; extracting features of the target data set to obtain basic features of the target data set; determining the weight of each basic feature in the target data set according to a preset strategy; determining necessary features of the target data set according to the weight of each basic feature in the target data set; performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features; combining the cross combination features and the necessary features to obtain target feature data; therefore, the problem of feature dimension disaster of the obtained target feature data is avoided, and the accuracy of data processing is further improved.

Description

Data processing method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data processing method, apparatus, computer device, and storage medium.

Background

With the development of technology in the big data age, the application of specific prediction, evaluation or feedback by using various data is becoming common. Typically, such applications often collect a large amount of sample data and predict, evaluate, or feed back the sample data via machine learning or clustering methods, etc. In this process, there are high demands on the data source, for example: the number, accuracy and balance of the data sources, etc., so that the acquired data sources need to be processed in advance to better ensure the accuracy of subsequent prediction, evaluation or feedback. At present, most of traditional data processing methods only stay on the level of data standardization or feature importance judgment, but cannot realize efficient and accurate processing of data.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, computer equipment and a storage medium, which are used for solving the problem of low accuracy of data processing.

A data processing method, comprising:

the method comprises the steps of obtaining data to be processed, and preprocessing the data to be processed to obtain a standard data set, wherein the preprocessing comprises at least one of missing value processing, outlier processing, duplicate removal processing or noise data processing;

Performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set;

extracting features of the target data set to obtain basic features of the target data set;

determining the weight of each basic feature in the target data set according to a preset strategy;

determining necessary features of the target data set according to the weight of each basic feature in the target data set;

performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features;

and combining the cross combination feature and the necessary feature to obtain target feature data.

A data processing apparatus comprising:

the preprocessing module is used for acquiring data to be processed, preprocessing the data to be processed to obtain a standard data set, wherein the preprocessing comprises at least one of missing value processing, outlier processing, duplicate removal processing or noise data processing;

the iterative cleaning module is used for performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set;

the feature extraction module is used for extracting features of the target data set to obtain basic features of the target data set;

The weight determining module is used for determining the weight of each basic feature in the target data set according to a preset strategy;

a necessary feature determining module, configured to determine a necessary feature of the target data set according to a weight of each of the basic features in the target data set;

the feature construction module is used for carrying out feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features;

and the combination module is used for combining the cross combination characteristic and the necessary characteristic to obtain target characteristic data.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the data processing method described above when executing the computer program.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the data processing method described above.

The data processing method, the data processing device, the computer equipment and the storage medium are used for preprocessing the data to be processed by acquiring the data to be processed to generate a standard data set, wherein the preprocessing comprises at least one of missing value processing, outlier processing, duplicate removal processing and noise data processing; performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set; extracting features of the target data set to obtain basic features of the target data set; determining the weight of each basic feature in the target data set according to a preset strategy; determining necessary features of the target data set according to the weight of each basic feature in the target data set; performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features; combining the cross combination features and the necessary features to obtain target feature data; combining an isolated forest algorithm and a GBDT characteristic combination algorithm into a set of data processing frame, after data is cleaned by adopting the isolated forest algorithm, constructing basic characteristics of a target data set by adopting the GBDT characteristic combination algorithm to generate cross combination characteristics, and finally combining necessary characteristics with the cross combination characteristics to obtain target characteristic data; the method not only avoids the problem of dimension disasters of the characteristics of the obtained target characteristic data, reflects the importance of the necessary characteristics in the target characteristic data, but also further improves the accuracy of data processing and ensures the accuracy of the generated target characteristic data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a data processing method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an exemplary method of data processing in accordance with one embodiment of the present invention;

FIG. 3 is a diagram showing another example of a data processing method according to an embodiment of the present invention;

FIG. 4 is a diagram showing another example of a data processing method according to an embodiment of the present invention;

FIG. 5 is a diagram showing another example of a data processing method according to an embodiment of the present invention;

FIG. 6 is a diagram showing another example of a data processing method according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a data processing apparatus in accordance with an embodiment of the present invention;

FIG. 8 is another functional block diagram of a data processing apparatus in accordance with one embodiment of the present invention;

FIG. 9 is another functional block diagram of a data processing apparatus in accordance with one embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The data processing method provided by the embodiment of the invention can be applied to an application environment shown in fig. 1. Specifically, the data processing method is applied to a data processing system, the data processing system comprises a client and a server as shown in fig. 1, and the client and the server communicate through a network to solve the problem of low data processing efficiency. The client is also called a user end, and refers to a program corresponding to the server end for providing local service for the client. The client may be installed on, but is not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a data processing method is provided, and the method is applied to the server in fig. 1, and the method includes the following steps:

s10: and obtaining data to be processed, and preprocessing the data to be processed to obtain a standard data set, wherein the preprocessing comprises at least one of missing value processing, abnormal value processing, duplicate removal processing and noise data processing.

The data to be processed refers to data to be processed. Optionally, the data to be processed may be user information (such as gender, age, occupation, etc.), website or web clicking behavior (such as clicking time, number of times, frequency, etc.), user transaction data and behavior (such as payment product information, payment amount, payment mode, etc.), etc.

Since the acquired data to be processed may include a lot ofAbnormal or redundant data. Therefore, when the corresponding data to be processed is obtained according to the actual requirement, the data to be processed needs to be preprocessed. Specifically, preprocessing the data to be processed includes at least one of missing value processing, outlier processing, deduplication processing, or noise data processing of the data to be processed. Alternatively, a distance-based method may be used, And carrying out outlier processing on the abnormal or redundant data in the data to be processed by using a principle method or a density method-based method and the like so as to remove the abnormal or redundant data in the data to be processed.

Furthermore, in order to ensure the cleanliness of the generated standard data set, missing value processing can be performed on the data to be processed after the abnormal value processing, so that the accuracy and the effectiveness of the standard data set are improved. Optionally, the method such as global constant filling method, interpolation method or modeling method can be adopted to process the missing value of the data to be processed, and specifically, one of the missing values can be arbitrarily selected according to the distribution condition, the inclination degree, the proportion of the missing value and the like of the data to be processed. Specific processing methods involved in preprocessing data are not specifically enumerated herein.

S20: and carrying out iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set.

Because the preprocessing of the data to be processed according to the step S10 can only remove some abnormal data which are obvious or easy to process, in order to further improve the cleanliness of the target data set, the accuracy of the subsequent model training is ensured. In the step, an isolated forest algorithm is adopted to carry out iterative cleaning on the standard data set so as to thoroughly clean abnormal data of impurities contained in the standard data set. The isolated forest) algorithm is a rapid anomaly detection method, and has linear time complexity and high accuracy.

In a specific embodiment, the iterative cleaning of the standard dataset by using an isolated forest algorithm mainly comprises an iterative process of training the standard dataset (constructing trees for forests) and data screening the standard dataset. Specifically, training the standard dataset includes: constructing an initial isolated forest; and inputting the acquired standard data set into the initial isolated forest for training, and generating an isolated forest model to complete the training process of the standard data set. And then, using the isolated forest model to perform anomaly detection on the standard data set, namely inputting the standard data set into the isolated forest model generated by training to perform data screening, and outputting a normal data set with a part of abnormal data screened.

Furthermore, in order to ensure the cleanliness of the acquired target data set, iterative training and iterative screening are required to be carried out on the standard data set, namely, the normal data set output after preliminary training and data screening is input into the constructed initial isolated forest again for iterative training and iterative screening until the number of the output normal data sets meets the set requirement, and the target data set is generated. Alternatively, the setting request may be to set the number of generated target data sets within a predetermined numerical range, or set the number of generated target data sets to be equal to or smaller than a predetermined threshold value, or the like. The number of iterations is not particularly limited here. In practical applications, the number of iterations depends on a trade-off between cleanliness of the data and information loss, the more iterations, the cleaner the data is, but more useful information will likely be lost.

S30: and extracting the characteristics of the target data set to obtain the basic characteristics of the target data set.

Where a base feature refers to a set of feature data that reflects the nature of the target dataset. For example: if the target data set is user information, the basic features of the target data set may include: name, gender, age, occupation or interest, etc.; if the target data set is a web page clicking action, the basic features of the target data set may include: click time, number or frequency, etc. It will be appreciated that the basic features corresponding to different types of target data sets are different.

Specifically, data with characteristic expression is extracted from a target data set, and the data with characteristic expression is taken as basic characteristics of the target data set. Optionally, feature extraction of the target data set may be automatically implemented by using a feature extraction algorithm, so as to obtain basic features of the target data set. The feature extraction algorithm may be linear feature extraction (PCA) or nonlinear feature extraction, among others.

Preferably, in order to ensure that more comprehensive and accurate basic features can be extracted from the target data set, a pre-compiled feature extraction script can be obtained from a database of the server side, and then the corresponding feature extraction script is adopted to perform feature extraction on the target data set.

S40: and determining the weight of each basic feature in the target data set according to a preset strategy.

Wherein the preset policy refers to a method for determining the weight of each basic feature in the target dataset. The weight of each basic feature in the target data set refers to the importance of each basic feature to the target data set. Optionally, the preset policy may determine a weight of each basic feature in the target dataset by using a subjective weighting method; or determining the weight of each basic feature in the target data set by adopting an objective weighting method; the weights of each basic feature in the target dataset may also be determined for the use of a combined weighting method, etc. In one embodiment, in order to ensure the accuracy of the obtained weights of each basic feature in the target data set, a combined weighting method is used to determine the weights of each basic feature in the target data set. The combined weighting method is also called subjective and objective combined weighting method. The combination weighting method can be a multiplication integration method or an addition integration method. For example: determining the weight of each basic feature in the target data set by adopting a multiplication integration method, wherein the formula of the multiplication integration method is w _i ＝aa _i +(1-a)b _i (0.ltoreq.a.ltoreq.0), wherein w _i A, representing the combination weight of the ith basic feature _i ,b _i The objective weight and the subjective weight of the i-th basic feature are respectively represented. Wherein the objective weight is a weight determined mainly based on the amount of information contained in each basic feature. Preferably, an entropy method may be employed to determine the objective weight for each basic feature. Subjective weights are based on decision makersThe weight of each basic feature is reasonably determined by the actual decision problem and the knowledge experience of the user. Preferably, a binary comparison quantization principle and method may be employed to determine the subjective weight of each base feature. It will be appreciated that when a decision maker has a preference for different weighting methods, a can be determined from the decision maker's preference information.

S50: and determining the necessary characteristics of the target data set according to the weight of each basic characteristic in the target data set.

Specifically, after determining the weight of each basic feature in the target data set according to step S40, the weights of each basic feature may be ranked according to the order from large to small of the weights of each basic feature in the target data set; then, the basic features with the weights arranged in the first n bits are determined as necessary features of the target data set, and n can be determined in a self-defined manner according to the actual situation or the number of the basic features. For example: the basic features with weights ranked in the top 10 bits may be determined as the necessary features for the target dataset.

S60: and adopting a GBDT characteristic combination algorithm to perform characteristic construction on each basic characteristic, and generating a cross combination characteristic.

Wherein GBDT is a gradient-lifting decision tree. The GBDT feature combination algorithm refers to the GBDT algorithm used to construct the combined feature. The GBDT algorithm is an iterative decision tree algorithm that consists of a number of decision trees. The cross combination feature refers to a feature vector output after each basic feature is subjected to feature construction by adopting a GBDT feature combination algorithm.

Specifically, firstly training a GBDT model according to the obtained basic features to construct a target GBDT model with N trees, then inputting the basic features into the constructed target GBDT model until each basic feature is distributed to leaf nodes of each tree of the target GBDT model, and finally combining the basic features corresponding to paths from root nodes to leaf nodes of each tree in the target GBDT model to generate cross combination features.

It can be understood that, the cross combination feature generated after the feature construction of each basic feature by using the GBDT feature combination algorithm is discrete data, each cross combination feature is composed of a vector with a value of 0 or 1, and each element of the vector corresponds to a leaf node of a tree in the target GBDT model, for example: the resulting cross-combined feature may be {0,1, 0}. Specifically, when a certain basic feature passes through a certain tree in the target GBDT model and falls on a leaf node of the certain tree, the element value corresponding to the leaf node in the corresponding generated cross combined feature vector is 1, and the element values corresponding to other leaf nodes of the certain tree are 0. The vector length of the cross-combined feature is equal to the sum of the number of leaf nodes contained in all trees in the target GBDT model.

S70: and combining the cross combination features and the necessary features to obtain target feature data.

In order to embody the importance of the necessary features in the target data set, after the GBDT feature combination algorithm is adopted to perform feature construction on each basic feature to generate cross combination features, the necessary features determined in step S50 and the generated cross combination features are required to be combined to obtain target feature data. Specifically, since the cross-combined feature belongs to a discrete feature vector, the feature vector is composed of a vector with a value of 0 or 1. Therefore, before combining the cross-combined feature and the necessary feature, the necessary feature code is converted into a discrete necessary feature vector by a preset coding method. The coding mode can be One-Hot coding or integer coding. For example: if the necessary features include gender and age, the necessary feature vectors after One-Hot encoding are [1,0], [0,1], respectively. And then, combining the necessary feature vector after feature coding with the cross combination feature to obtain target feature data. Alternatively, the necessary feature vectors may be inserted into the header or trailer positions of the cross-combined feature.

In this embodiment, preprocessing is performed on data to be processed by acquiring the data to be processed to generate a standard data set, where the preprocessing includes at least one of missing value processing, outlier processing, deduplication processing, or noise data processing; performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set; extracting features of the target data set to obtain basic features of the target data set; determining the weight of each basic feature in the target data set according to a preset strategy; determining necessary features of the target data set according to the weight of each basic feature in the target data set; performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features; combining the cross combination features and the necessary features to obtain target feature data; combining an isolated forest algorithm and a GBDT characteristic combination algorithm into a set of data processing frame, after data is cleaned by adopting the isolated forest algorithm, constructing basic characteristics of a target data set by adopting the GBDT characteristic combination algorithm to generate cross combination characteristics, and finally combining necessary characteristics with the cross combination characteristics to obtain target characteristic data; the method not only avoids the problem of dimension disasters of the characteristics of the obtained target characteristic data, reflects the importance of the necessary characteristics in the target characteristic data, but also further improves the accuracy of data processing and ensures the accuracy of the generated target characteristic data.

In one embodiment, as shown in fig. 3, the method adopts an isolated forest algorithm to carry out iterative cleaning on the standard data set to generate the target data set, and specifically comprises the following steps:

s201: and acquiring a standard data set, inputting the standard data set into a preset isolated forest model for data screening, and obtaining normal detection data.

The isolated forest model refers to a model which is generated by training by adopting a large number of training sample data sets in advance. The training sample data set is historical data which is stored in the database in advance and is used for training the isolated forest model. In a specific embodiment, the training sample data set is of the same type as the standard data set.

Specifically, training the training sample data to generate an isolated forest model includes: selecting t sample points randomly from training sample data as subsamples, putting the subsamples into a root node of a tree, randomly designating a dimension (attribute), randomly generating a cutting point p in current node data, wherein the cutting point p is generated between the maximum value and the minimum value of the designated dimension in the current node data, generating a hyperplane by the cutting point, and dividing the current node data space into 2 subspaces: placing data with the specified dimension less than p on the left child of the current node, and placing data with the dimension greater than or equal to p on the right child of the current node; recursively constructing new child nodes in the child nodes until only one data in the child nodes (no longer cut) or the child nodes have reached a defined height; and training is completed after t iTrees are obtained, and an isolated forest model is generated.

Further, in this embodiment, before the isolated forest model is generated, parameter setting is also required for the isolated forest model. The parameters of the isolated forest model mainly comprise: max_features, n_detectors, and min_sample_leaf. Where max_features refers to the maximum number of features that a random forest allows a single decision tree to use, n_optimizers refers to the number of building subtrees in use, and min_sample_leaf refers to the minimum sample leaf size. Specifically, n_optimizers and min_sample_leaf can perform grid search cross validation to find the optimal parameters according to the specific situation, and preferably n_optimizers use default parameters 500, and min_sample_leaf is set to be greater than 50. Since the parameter setting of max_features is associated with the number of features to be subsequently feature constructed using the GBDT feature combination algorithm, the association of the GBDT model with the isolated forest model is enhanced. In this embodiment, the max_features parameter in the isolated forest model and the max_features parameter in the GBDT model are set to the same size, and preferably the max_features parameter is set to 100.

Specifically, the acquired standard data set is input into a preset isolated forest model, and each iTree of the isolated forest model is traversed. And then, calculating the layer of each standard data in the standard data set, which layer of each iTree finally falls on, and obtaining the average height value of each standard data in the isolated forest model. And finally, determining standard data with the average height value lower than the set threshold value as abnormal data, determining standard data with the average height value equal to or higher than the set threshold value as normal data, and extracting normal data with the average height value equal to or higher than the set threshold value to form normal detection data. Wherein the set threshold can be set according to the actual situation of the standard data set.

S202: and judging whether the numerical value difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value or not.

Wherein, the set target value refers to the preset quantity of target data to be generated. For example: the set target value may be 1000,2000 or 5000, etc., and specifically may be set as self-defined according to the actual situation of the normal detection data. The preset threshold value refers to a threshold value for evaluating whether the number value of the normal detection data satisfies a set target value. For example: the preset threshold may be 0, 100 or 200, etc. It should be noted that the preset threshold value must be greater than or equal to 0, that is, the number value of the normal detection data must be greater than or equal to the set target value. Specifically, the normal detection data is acquired, the numerical difference between the numerical value of the normal detection data and the set target value is calculated, and whether the numerical difference between the numerical value of the normal detection data and the set target value is preset with a threshold value is judged.

S203: and if the numerical value difference between the number value of the normal detection data and the set target value is larger than a preset threshold value, performing iterative training and data screening on the isolated forest model based on the normal detection data until the numerical value difference between the number value of the generated normal detection data and the set target value is equal to or smaller than the preset threshold value, and forming the obtained normal detection data into a target data set.

Specifically, if the numerical difference between the number value of the normal detection data and the set target value is greater than a preset threshold value, performing iterative training and data screening on the isolated forest model based on the normal detection data, namely, re-training the generated normal detection data to establish the isolated forest model, inputting the normal detection data into the re-established isolated forest model to perform data screening to obtain the normal detection data, sequentially and circularly training the normal detection data to generate the isolated forest model, inputting the normal detection data into the isolated forest model to perform data screening until the numerical difference between the number value of the generated normal detection data and the set target value is equal to or smaller than the preset threshold value, and forming the obtained normal detection data into a target data set.

In the embodiment, a standard data set is acquired and is input into a preset isolated forest model for data screening, so that normal inspection data are obtained; judging whether the numerical value difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value or not; if the numerical value difference between the number value of the normal detection data and the set target value is larger than a preset threshold value, performing iterative training and abnormal detection on the isolated forest model based on the normal detection data until the numerical value difference between the number value of the generated normal detection data and the set target value is equal to or smaller than the preset threshold value, and forming the obtained normal detection data into a target data set; the standard data set is subjected to repeated iterative cleaning by adopting an isolated forest algorithm, and the number of iterative cleaning times is controlled according to specific data conditions, so that the cleanliness of the generated target data set is further improved.

In one embodiment, as shown in fig. 4, the standard data set is input into a preset isolated forest model for data screening to obtain normal detection data, which specifically includes the following steps:

s2011: and inputting the standard data set into the isolated forest model, traversing each random binary tree in the isolated forest model, and determining the average height value of each standard data in the standard data set in the isolated forest model.

Specifically, each standard data in the standard data set is input into the isolated forest model, and each standard data in the standard data set traverses each random binary tree in the isolated forest model. Then, the height value (which layer) of each random binary tree that the standard data finally falls on in the isolated forest model is calculated. And averaging the height values of the standard data on each random binary tree to obtain the average height value of each standard data in each standard data set in the isolated forest model. For example: if the isolated forest model comprises 3 random binary trees, and after a standard data is input into each random binary tree in the isolated forest model for traversing, the height value of the standard data on each random binary tree is {6,7,8}, and then the average height value (6+7+8)/3=7 of the standard data in the isolated forest model is obtained.

S2012: and determining standard data with average height value larger than a preset height threshold value as normal detection data.

Wherein the height threshold refers to a value used for abnormal data screening of the standard data set. For example: the height threshold may be set to 3,5, 7, etc., and the user may customize the settings based on the actual condition of the standard dataset. Specifically, the average height value of each standard data in the obtained standard data set is compared with a preset height threshold one by one, standard data with the average height value equal to or smaller than the preset height threshold is screened out, and standard data with the average height value larger than the preset height threshold is determined as normal detection data

In the embodiment, the average height value of each standard data in the standard data set in the isolated forest model is determined by inputting the standard data set into the isolated forest model and traversing each random binary tree in the isolated forest model; then, standard data with the average height value larger than a preset height threshold value is determined as normal detection data, so that the accuracy of the generated normal detection data is further improved.

In one embodiment, as shown in fig. 5, feature extraction is performed on a target data set to obtain basic features of the target data set, and specifically includes the following steps:

S301: and acquiring a characteristic parameter set, wherein the characteristic parameter set comprises M parameter identifiers, and M is a positive integer.

Wherein, the feature parameter set refers to a feature set of a target data set which is preset to be extracted. The feature parameter sets corresponding to different types of target data sets are different. For example: if the target data set is user information, the feature parameter set of the target data set may include: name, gender, age, occupation, interests, etc.; if the target data set is a web page clicking action, the feature parameter set of the target data set may include: click time, number of times, frequency, etc. Specifically, the feature parameter set includes M parameter identifiers, where M is a positive integer. The parameter identification refers to an identification number assigned to each characteristic parameter. For example: the parameter identification of the name may be name; the parameter identification of gender may be gender; the parameter identification of the age can be age; the professional parameter identification can be an occlusion; the parameter identification of interest may be interest.

S302: and acquiring a corresponding feature extraction script according to each parameter identifier.

Wherein, the feature extraction script refers to text which can directly perform feature extraction on the target data set. In this embodiment, the feature extraction script is pre-compiled and stored in the database of the server, so that the corresponding feature extraction script can be directly obtained from the database of the server according to the obtained parameter identifier. For example: the name can acquire a corresponding feature extraction script < name > from a database of the server according to the parameter identification of the name; the parameter identification gener according to the gender can obtain a corresponding feature extraction script < gener > from a database of the server side; the corresponding feature extraction script < age > can be obtained from a database of the server according to the parameter identification of the age; the corresponding feature extraction script < Number > can be obtained from a database of the server according to the parameter identification Number of the function Number; and obtaining a corresponding feature extraction script < occlusion > from a database of the server according to the occupational parameter identification occlusion.

S303: and carrying out feature extraction on the target data set by adopting a feature extraction script to obtain basic features of the target data set.

Where a base feature refers to a set of feature data reflecting the nature or attributes of the target dataset. Specifically, since each feature extraction script has a function of directly extracting features of the target data set, the feature extraction script obtained in step S302 may directly extract features of the target data set, so as to obtain basic features of the target data set.

In this embodiment, by acquiring a feature parameter set, the feature parameter set includes M parameter identifiers, where M is a positive integer; acquiring a corresponding feature extraction script according to each parameter identifier; extracting features of the target data set by adopting a feature extraction script to obtain basic features of the target data set; thereby enabling the basic features of the generated target data set to be more accurate and comprehensive.

In one embodiment, as shown in fig. 6, the GBDT feature combination algorithm is adopted to perform feature construction on each basic feature, so as to generate a cross combination feature, which specifically includes the following steps:

s601: sample characteristics are obtained, training is carried out on the sample characteristics, and a target GBDT model is generated.

Wherein, the sample features refer to features used for training to generate a target GBDT model. In a specific embodiment, the sample features are stored in the database of the server in advance, and when each basic feature needs to be subjected to feature construction, the sample features can be directly obtained from the database of the server. The sample feature is a feature of the same type as the basic feature. The target GBDT model is constructed by training a decision tree according to basic characteristics. Specifically, the obtained basic features are input into a preset decision tree for training to construct a target GBDT model with N trees.

Further, in this embodiment, when training the sample feature to generate the target GBDT model, parameter setting is further required for the target GBDT model. The parameters of the target GBDT model mainly comprise max_ features, learning _rate and n_evastiators. Wherein. Both the learning_rate and n_estimators can find the optimal parameter combination by grid search cross-validation, with learning_rate set to 0.01 and n_estimators set to 500 in this embodiment. Since the parameter setting of max_features is associated with max_features in the previous random forest model, the association of the target GBDT model with the random forest model is enhanced. In this embodiment, the max_features parameter of the target GBDT model and the max_features parameter of the random forest model are set to the same size. When the max_features parameter of the target GBDT model is set to be consistent with the max_features parameter of the random forest model, the two originally independent models produce a complementary effect, and the target GBDT model constructs the combined features according to the feature processing logic same as that in the random forest model. Preferably, the max_features parameter of the target GBDT model is set to 100.

S602: inputting the basic features into a target GBDT model for feature construction, and generating cross combination features.

Specifically, the basic features are input into a target GBDT model which is built and generated, and the cross-combination features can be generated. Illustratively, if the target GBDT model is composed of two trees, the first tree has 3 leaf nodes and the second tree has 2 leaf nodes. Inputting a basic feature x if it is at a second leaf node in which the first tree finally falls, and at a first leaf node in which the second tree finally falls; then after feature construction by the target GBDT model, the resulting cross-combined feature is [0,1,0,1,0], where the first three bits in the vector correspond to the 3 leaf nodes of the first tree and the second two bits correspond to the 2 leaf nodes of the second tree.

In the embodiment, training sample characteristics based on a preset GBDT model by acquiring the sample characteristics, and constructing a target GBDT model with N trees, wherein N is a positive integer; inputting the basic features into a target GBDT model for feature construction, and generating cross combination features; thereby improving the accuracy and diversity of the generated cross-combination features.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a data processing apparatus is provided, where the data processing apparatus corresponds to the data processing method in the above embodiment one by one. As shown in fig. 7, the data processing apparatus includes a preprocessing module 10, an iterative cleaning module 20, a feature extraction module 30, a weight determination module 40, a necessary feature determination module 50, a feature construction module 60, and a combination module 70. The functional modules are described in detail as follows:

the preprocessing module 10 is configured to obtain data to be processed, and preprocess the data to be processed to obtain a standard data set, where preprocessing includes at least one of missing value processing, outlier processing, deduplication processing, or noise data processing;

the iterative cleaning module 20 is used for performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set;

the feature extraction module 30 is configured to perform feature extraction on the target data set to obtain basic features of the target data set;

A weight determining module 40, configured to determine a weight of each basic feature in the target data set according to a preset policy;

a necessary feature determining module 50, configured to determine a necessary feature of the target data set according to the weight of each basic feature in the target data set;

a feature construction module 60, configured to perform feature construction on each basic feature by using GBDT feature combination algorithm, so as to generate a cross combination feature;

and the combination module 70 is configured to combine the cross-combination feature and the necessary feature to obtain target feature data.

Preferably, as shown in fig. 8, the iterative cleaning module 20 includes:

the data screening unit 201 is configured to obtain a standard data set, input the standard data set into a preset isolated forest model, and perform data screening to obtain normal detection data;

a judging unit 202 for judging whether the numerical difference between the number value of the normal detection data and the set target value is greater than a preset threshold;

and the target data set generating unit 203 is configured to perform iterative training and data screening on the isolated forest model based on the normal detection data when the numerical difference between the number value of the normal detection data and the set target value is greater than the preset threshold value, until the numerical difference between the number value of the generated normal detection data and the set target value is equal to or less than the preset threshold value, and then form the obtained normal detection data into the target data set.

Preferably, as shown in fig. 9, the data filtering unit 201 includes:

a traversal subunit 2011, configured to input the standard data set into an isolated forest model, traverse each random binary tree in the isolated forest model, and determine an average height value of each standard data in the standard data set in the isolated forest model;

the normal detection data determining subunit 2012 is configured to determine, as normal detection data, standard data having an average height value greater than a preset height threshold.

Preferably, the feature extraction module 30 includes:

the characteristic parameter set acquisition unit is used for acquiring a characteristic parameter set, wherein the characteristic parameter set comprises M parameter identifiers, and M is a positive integer;

the feature extraction script acquisition unit is used for acquiring a corresponding feature extraction script according to each parameter identifier;

and the feature extraction unit is used for carrying out feature extraction on the target data set by adopting a feature extraction script to obtain basic features of the target data set.

Preferably, the feature construction module 60 includes:

the sample feature acquisition unit is used for acquiring sample features, training the sample features and generating a target GBDT model;

and the feature construction unit is used for inputting the basic features into the target GBDT model for feature construction and generating cross combination features.

For specific limitations of the data processing apparatus, reference may be made to the above limitations of the data processing method, and no further description is given here. Each of the modules in the above-described data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for the data used in the data processing method in the above-described embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method of the above embodiments when the computer program is executed by the processor.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the data processing method in the above embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method of data processing, comprising:

extracting features of the target data set to obtain basic features of the target data set; the basic feature refers to a set of feature data reflecting the nature of the target dataset;

determining necessary features of the target data set according to the weight of each basic feature in the target data set; the necessary features refer to basic features of which the weights are arranged in the first n bits;

combining the cross combination features and the necessary features to obtain target feature data;

the iterative cleaning of the standard data set by adopting an isolated forest algorithm to generate a target data set comprises the following steps:

acquiring a standard data set, and inputting the standard data set into a preset isolated forest model for data screening to obtain normal detection data;

judging whether the numerical value difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value or not;

And if the numerical value difference between the number value of the normal detection data and the set target value is larger than the preset threshold value, performing iterative training and data screening on the isolated forest model based on the normal detection data until the generated numerical value difference between the number value of the normal detection data and the set target value is equal to or smaller than the preset threshold value, and forming the obtained normal detection data into a target data set.

2. The data processing method as claimed in claim 1, wherein said inputting the standard data set into the isolated forest model for data screening to obtain normal detection data includes:

inputting the standard data set into the isolated forest model, traversing each random binary tree in the isolated forest model, and determining the average height value of each standard data in the standard data set in the isolated forest model;

and determining the standard data with the average height value larger than a preset height threshold value as normal detection data.

3. The data processing method according to claim 1, wherein the feature extraction of the target data set to obtain the basic feature of the target data set includes:

Acquiring a characteristic parameter set, wherein the characteristic parameter set comprises M parameter identifiers, and M is a positive integer;

acquiring a corresponding feature extraction script according to each parameter identifier;

and carrying out feature extraction on the target data set by adopting the feature extraction script to obtain basic features of the target data set.

4. The data processing method of claim 1, wherein the feature constructing each of the basic features using GBDT feature combining algorithm to generate cross-combined features includes:

acquiring sample characteristics, training the sample characteristics, and generating a target GBDT model;

inputting the basic features into the target GBDT model for feature construction, and generating cross combination features.

5. A data processing apparatus, comprising:

The feature extraction module is used for extracting features of the target data set to obtain basic features of the target data set; the basic feature refers to a set of feature data reflecting the nature of the target dataset;

a necessary feature determining module, configured to determine a necessary feature of the target data set according to a weight of each of the basic features in the target data set; the necessary features refer to basic features of which the weights are arranged in the first n bits;

the combination module is used for combining the cross combination features and the necessary features to obtain target feature data;

the iterative cleaning module comprises:

the data screening unit is used for acquiring a standard data set, inputting the standard data set into a preset isolated forest model for data screening to obtain normal detection data;

a judging unit for judging whether the numerical value difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value;

And the target data set generation unit is used for carrying out iterative training and data screening on the isolated forest model based on the normal detection data when the numerical value difference between the number value of the normal detection data and the set target value is larger than the preset threshold value until the numerical value difference between the number value of the generated normal detection data and the set target value is equal to or smaller than the preset threshold value, and then forming the obtained normal detection data into a target data set.

6. The data processing apparatus of claim 5, wherein the data screening unit comprises:

a traversing subunit, configured to input the standard data set into the isolated forest model, traverse each random binary tree in the isolated forest model, and determine an average height value of each standard data in the standard data set in the isolated forest model;

and the normal detection data determining subunit is used for determining the standard data with the average height value larger than a preset height threshold value as normal detection data.

7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data processing method according to any of claims 1 to 4 when executing the computer program.

8. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the data processing method according to any one of claims 1 to 4.