CN110880014A

CN110880014A - Data processing method and device, computer equipment and storage medium

Info

Publication number: CN110880014A
Application number: CN201910965010.1A
Authority: CN
Inventors: 秦文力; 张密; 韩丙卫
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-03-13
Anticipated expiration: 2039-10-11
Also published as: CN110880014B

Abstract

The invention discloses a data processing method, a data processing device, computer equipment and a storage medium, wherein a standard data set is generated by acquiring data to be processed and preprocessing the data to be processed; performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set; extracting the features of the target data set to obtain the basic features of the target data set; determining the weight of each basic feature in the target data set according to a preset strategy; determining necessary features of the target data set according to the weight of each basic feature in the target data set; performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features; combining the cross combination characteristics and the necessary characteristics to obtain target characteristic data; therefore, the problem of dimensional disaster of the obtained target characteristic data with characteristics is avoided, and the accuracy of data processing is further improved.

Description

Data processing method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

With the development of the technology in the big data era, the application of specific prediction, evaluation or feedback using various data is becoming common. Generally, such applications often collect a large amount of sample data and then predict, evaluate or feed back the sample data through machine learning or clustering methods. In this process, there are high requirements for the data source, such as: the number and accuracy of the data sources, the balance of the data sources, and the like, so that the acquired data sources need to be processed in advance to better ensure the accuracy of subsequent prediction, evaluation or feedback. At present, most of the traditional data processing methods only stay on the level of data standardization or feature importance judgment, and the data cannot be efficiently and accurately processed.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, computer equipment and a storage medium, and aims to solve the problem of low accuracy of data processing.

A method of data processing, comprising:

acquiring data to be processed, and preprocessing the data to be processed to obtain a standard data set, wherein the preprocessing comprises at least one of missing value processing, abnormal value processing, duplicate removal processing or noise data processing;

performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set;

performing feature extraction on the target data set to obtain basic features of the target data set;

determining the weight of each basic feature in the target data set according to a preset strategy;

determining essential features of the target data set according to the weight of each basic feature in the target data set;

performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features;

and combining the cross combination features and the necessary features to obtain target feature data.

A data processing apparatus comprising:

the preprocessing module is used for acquiring data to be processed and preprocessing the data to be processed to obtain a standard data set, wherein the preprocessing comprises at least one of missing value processing, abnormal value processing, duplicate removal processing or noise data processing;

the iterative cleaning module is used for performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set;

the characteristic extraction module is used for extracting the characteristics of the target data set to obtain the basic characteristics of the target data set;

the weight determining module is used for determining the weight of each basic feature in the target data set according to a preset strategy;

an essential feature determination module, configured to determine an essential feature of the target data set according to a weight of each of the basic features in the target data set;

the characteristic construction module is used for carrying out characteristic construction on each basic characteristic by adopting a GBDT characteristic combination algorithm to generate cross combination characteristics;

and the combination module is used for combining the cross combination characteristics and the necessary characteristics to obtain target characteristic data.

A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the above data processing method when executing said computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned data processing method.

According to the data processing method, the data processing device, the computer equipment and the storage medium, the data to be processed is acquired and preprocessed to generate the standard data set, wherein the preprocessing comprises at least one of missing value processing, abnormal value processing, duplicate removal processing or noise data processing; performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set; extracting the features of the target data set to obtain the basic features of the target data set; determining the weight of each basic feature in the target data set according to a preset strategy; determining necessary features of the target data set according to the weight of each basic feature in the target data set; performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features; combining the cross combination characteristics and the necessary characteristics to obtain target characteristic data; combining an isolated forest algorithm and a GBDT characteristic combination algorithm into a set of data processing framework, after cleaning data by adopting the isolated forest algorithm, performing characteristic construction on basic characteristics of a target data set by adopting the GBDT characteristic combination algorithm to generate cross combination characteristics, and finally combining necessary characteristics and the cross combination characteristics to obtain target characteristic data; the method and the device not only avoid the problem of dimensional disaster of the obtained target characteristic data with characteristics, reflect the importance of necessary characteristics in the target characteristic data, but also further improve the accuracy of data processing and ensure the accuracy of the generated target characteristic data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating an application environment of a data processing method according to an embodiment of the present invention;

FIG. 2 is a diagram of an exemplary data processing method according to an embodiment of the present invention;

FIG. 3 is a diagram of another example of a data processing method according to an embodiment of the present invention;

FIG. 4 is a diagram of another example of a data processing method according to an embodiment of the present invention;

FIG. 5 is a diagram of another example of a data processing method according to an embodiment of the present invention;

FIG. 6 is a diagram of another example of a data processing method according to an embodiment of the present invention;

FIG. 7 is a functional block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 8 is another functional block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 9 is another functional block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The data processing method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the data processing method is applied to a data processing system, where the data processing system includes a client and a server as shown in fig. 1, and the client and the server communicate with each other through a network, so as to solve the problem of low efficiency of data processing. The client is also called a client, and refers to a program corresponding to the server and providing local services to the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.

In an embodiment, as shown in fig. 2, a data processing method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

s10: and acquiring data to be processed, and preprocessing the data to be processed to obtain a standard data set, wherein the preprocessing comprises at least one of missing value processing, abnormal value processing, duplicate removal processing or noise data processing.

The data to be processed refers to data to be processed. Alternatively, the data to be processed may be user information (such as gender, age, occupation, etc.), website or web page clicking behavior (such as clicking time, times, frequency, etc.), user transaction data and behavior (such as paying product information, payment amount, payment method, etc.), etc.

The acquired data to be processed may include many anomalies or redundant data. Therefore, after the corresponding data to be processed is acquired according to actual requirements, the data to be processed needs to be preprocessed first. Specifically, the preprocessing the data to be processed includes at least one of missing value processing, abnormal value processing, deduplication processing, or noise data processing. Alternatively, a distance-based method may be employed,

And carrying out abnormal value processing on abnormal or redundant data in the data to be processed by a principle method or a density-based method and other methods so as to eliminate the abnormal or redundant data in the data to be processed.

Furthermore, in order to ensure the cleanness of the generated standard data set, missing value processing can be performed on the data to be processed after abnormal value processing, so as to improve the accuracy and the effectiveness of the standard data set. Optionally, the missing value processing may be performed on the data to be processed by using a global constant filling method, an interpolation method, a modeling method, or the like, and specifically, one of the methods may be arbitrarily selected according to the distribution condition, the inclination degree, the proportion of the missing value, and the like of the data to be processed. The specific processing method involved in preprocessing the data is not necessarily listed here.

S20: and performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set.

Because only some obvious or easily processed abnormal data can be removed by preprocessing the data to be processed according to the step S10, the accuracy of subsequent model training is ensured in order to further improve the cleanliness of the target data set. In the step, iterative cleaning is carried out on the standard data set by adopting an isolated forest algorithm so as to thoroughly clean the abnormal data contained in the standard data set. Isolated forest) algorithm is a rapid anomaly detection method, and has linear time complexity and high accuracy.

In a specific embodiment, the iterative cleaning of the standard data set by using the isolated forest algorithm mainly comprises an iterative process of training the standard data set (building trees of a forest) and performing data screening on the standard data set. Specifically, training the standard dataset includes: constructing an initial isolated forest; and then inputting the acquired standard data set into the initial isolated forest for training to generate an isolated forest model so as to finish the training process of the standard data set. And then, carrying out anomaly detection on the standard data set by using the isolated forest model, namely inputting the standard data set into the isolated forest model generated by training for data screening, and outputting a normal data set with a part of abnormal data screened.

Further, in order to ensure the cleanliness of the acquired target data set, iterative training and iterative screening need to be performed on the standard data set, that is, the normal data set output after the initial training and data screening is input into the initial isolated forest to perform iterative training and iterative screening again until the number of the output normal data sets meets the set requirement, and the target data set is generated. Alternatively, the setting requirement may be that the number of generated target data sets is set to be within a preset numerical range, or the number of generated target data sets is set to be equal to or less than a preset critical value, or the like. The number of iterations is not particularly limited herein. In practical applications, the number of iterations depends on a tradeoff between data cleanliness and information loss, with the greater the number of iterations, the cleaner the data is, but more useful information will likely be lost.

S30: and performing feature extraction on the target data set to obtain the basic features of the target data set.

Wherein, the basic feature refers to a set of feature data capable of reflecting the property of the target data set. For example: if the target data set is user information, the basic features of the target data set may include: name, gender, age, occupation or interest, etc.; if the target data set is a webpage clicking behavior, the basic features of the target data set may include: time, number, or frequency of clicks, etc. It will be appreciated that the different types of target data sets correspond to different base characteristics.

Specifically, data with characteristic expression is extracted from a target data set, and the data with characteristic expression is used as basic characteristics of the target data set. Optionally, feature extraction of the target data set may be automatically implemented by using a feature extraction algorithm to obtain the basic features of the target data set. The feature extraction algorithm may be linear feature extraction (PCA) or nonlinear feature extraction, among others.

Preferably, in order to ensure that more comprehensive and accurate basic features can be extracted from the target data set, a pre-compiled feature extraction script can be obtained from a database of the server, and then the corresponding feature extraction script is adopted to perform feature extraction on the target data set.

S40: and determining the weight of each basic feature in the target data set according to a preset strategy.

The preset strategy refers to a method for determining the weight of each basic feature in the target data set. The weight of each base feature in the target data set refers to the importance of each base feature to the target data set. Optionally, the preset strategy may be to determine the weight of each basic feature in the target data set by using a subjective weighting method; or determining the weight of each basic characteristic in the target data set by adopting an objective weighting method; the weight of each basic feature in the target data set can be determined by adopting a combined weighting method, and the like. In one embodiment, in order to ensure the accuracy of the obtained weight of each basic feature in the target data set, a combined weighting method is used to determine the weight of each basic feature in the target data set. The combined weighting method is also called subjective and objective combined weighting method. The combination weighting method can be a multiplication integration method or an addition integration method. For example: determining the weight of each basic feature in the target data set by adopting a 'multiplication' integration method, wherein the formula of the 'multiplication' integration method is w_i＝aa_i+(1-a)b_i(0. ltoreq. a. ltoreq.0), where w_iDenotes the ith radicalCombining weight of the feature, a_i,b_iRespectively representing the objective weight and the subjective weight of the ith basic feature. The objective weight is determined based on the amount of information included in each basic feature. Preferably, an entropy method may be employed to determine the objective weight of each base feature. The subjective weight is mainly the weight of each basic feature reasonably determined by a decision maker according to an actual decision problem and knowledge and experience of the decision maker. Preferably, a binary comparison quantization principle and method can be used to determine the subjective weight of each basic feature. It is understood that when a decision maker has a preference for different weighting methods, a can be determined according to preference information of the decision maker.

S50: and determining the necessary characteristics of the target data set according to the weight of each basic characteristic in the target data set.

Specifically, after determining the weight of each basic feature in the target data set according to step S40, the weight of each basic feature may be sorted from large to small according to the weight of each basic feature in the target data set; then, the basic features with the weights arranged in the first n bits are determined as the necessary features of the target data set, and n can be determined in a self-defined mode according to the actual situation or the number of the basic features. For example: the basic feature with the weight of the top 10 bits can be determined as the necessary feature of the target data set.

S60: and (4) performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features.

Among them, GBDT is a kind of gradient boosting decision tree. The GBDT feature combination algorithm refers to the GBDT algorithm used to construct the combination feature. The GBDT algorithm is an iterative decision tree algorithm that consists of a number of decision trees. The cross combination feature refers to a feature vector output after feature construction is carried out on each basic feature by adopting a GBDT feature combination algorithm.

Specifically, a GBDT model is trained according to the acquired basic features to construct a target GBDT model with N trees, then the basic features are input into the constructed target GBDT model until each basic feature is distributed to a leaf node of each tree of the target GBDT model, and finally the basic features corresponding to a path from a root node to the leaf node of each tree in the target GBDT model are combined to generate cross combination features.

It can be understood that, after the GBDT feature combination algorithm is used to perform feature construction on each basic feature, the generated cross combination feature is discrete data, each cross combination feature is composed of a vector with a value of 0 or 1, and each element of the vector corresponds to a leaf node of a tree in the target GBDT model, for example: the cross-combination feature generated may be {0,1,0,0 }. Specifically, when a certain basic feature passes through a certain tree in the target GBDT model and falls on a leaf node of the certain tree, the element value corresponding to the leaf node in the corresponding generated cross combination feature vector is 1, and the element values corresponding to other leaf nodes of the certain tree are 0. The vector length of the cross-combination features is equal to the sum of the leaf node numbers contained in all trees in the target GBDT model.

S70: and combining the cross combination characteristics and the necessary characteristics to obtain target characteristic data.

In order to embody the importance of the necessary features in the target data set, after the GBDT feature combination algorithm is used to perform feature construction on each basic feature to generate the cross-combination feature, the necessary features determined in step S50 and the generated cross-combination feature need to be combined to obtain the target feature data. Specifically, since the cross-combination features belong to discrete feature vectors, the cross-combination features are composed of vectors with values of 0 or 1. Therefore, before combining the cross-combination features and the necessary features, a preset coding mode is adopted in advance to convert the necessary feature codes into discrete necessary feature vectors. The encoding method may be One-Hot encoding or integer encoding. For example: if the necessary features include gender and age, the necessary feature vectors after One-Hot encoding are [1,0] and [0,1] respectively. And then, combining the necessary feature vectors subjected to feature coding and the cross combination features to obtain target feature data. Alternatively, the necessary feature vector may be inserted into the head position or the tail position of the cross-combined feature.

In this embodiment, the data to be processed is obtained and preprocessed to generate a standard data set, where the preprocessing includes at least one of missing value processing, abnormal value processing, deduplication processing, or noise data processing; performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set; extracting the features of the target data set to obtain the basic features of the target data set; determining the weight of each basic feature in the target data set according to a preset strategy; determining necessary features of the target data set according to the weight of each basic feature in the target data set; performing feature construction on each basic feature by adopting a GBDT feature combination algorithm to generate cross combination features; combining the cross combination characteristics and the necessary characteristics to obtain target characteristic data; combining an isolated forest algorithm and a GBDT characteristic combination algorithm into a set of data processing framework, after cleaning data by adopting the isolated forest algorithm, performing characteristic construction on basic characteristics of a target data set by adopting the GBDT characteristic combination algorithm to generate cross combination characteristics, and finally combining necessary characteristics and the cross combination characteristics to obtain target characteristic data; the method and the device not only avoid the problem of dimensional disaster of the obtained target characteristic data with characteristics, reflect the importance of necessary characteristics in the target characteristic data, but also further improve the accuracy of data processing and ensure the accuracy of the generated target characteristic data.

In an embodiment, as shown in fig. 3, an isolated forest algorithm is used to perform iterative cleaning on a standard data set to generate a target data set, which specifically includes the following steps:

s201: and acquiring a standard data set, inputting the standard data set into a preset isolated forest model for data screening, and acquiring normal detection data.

The isolated forest model refers to a model generated by adopting a large number of training sample data sets for training in advance. The training sample data set refers to historical data which are stored in a database in advance and used for training the isolated forest model. In a particular embodiment, the training sample data set is of the same type of data set as the standard data set.

Specifically, training sample data, and generating an isolated forest model comprises: randomly selecting psi sample points from training sample data as subsamples, putting the samples into a root node of a tree, randomly assigning a dimension (attribute), randomly generating a cut point p in current node data, wherein the cut point is generated between the maximum value and the minimum value of the assigned dimension in the current node data, generating a hyperplane by using the cut point, and then dividing a current node data space into 2 subspaces: placing data smaller than p in the specified dimension on the left child of the current node, and placing data larger than or equal to p on the right child of the current node; recursion is carried out in the child nodes, and new child nodes are continuously constructed until only one piece of data in the child nodes (the cutting can not be continued) or the child nodes reach the defined height; and finishing training after t iTrees are obtained, and generating an isolated forest model.

Further, in this embodiment, before generating the isolated forest model, parameters of the isolated forest model are set. Wherein, the parameters of the isolated forest model mainly comprise: max _ features, n _ estimators, and min _ sample _ leaf. Where max _ features refers to the maximum number of features that a random forest allows a single decision tree to use, n _ estimators refers to the number of subtrees being built with, and min _ sample _ leaf refers to the minimum sample leaf size. Specifically, n _ estimators and min _ sample _ leaf can perform grid search cross-validation to find the optimal parameter as the case may be, preferably, n _ estimators uses the default parameter 500, and min _ sample _ leaf is set to be greater than 50. Since the parameter setting of max _ features is associated with the number of features subsequently feature-constructed by adopting the GBDT feature combination algorithm, the relevance of the GBDT model and the isolated forest model is enhanced. In this embodiment, the max _ features parameter in the isolated forest model is set to the same size as the max _ features parameter in the GBDT model, preferably the max _ features parameter is set to 100.

Specifically, the acquired standard data set is input into a preset isolated forest model, and each iTree of the isolated forest model is traversed. Then, calculating the layer number of each standard data in the standard data set, which finally falls on each iTree, and obtaining the average height value of each standard data in the isolated forest model. And finally, determining standard data with the average height value lower than a set threshold value as abnormal data, determining standard data with the average height value equal to or higher than the set threshold value as normal data, and extracting the normal data with the average height value equal to or higher than the set threshold value to form normal detection data. Wherein the set threshold value can be set according to the actual situation of the standard data set.

S202: and judging whether the numerical difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value or not.

The set target value refers to the number of target data to be generated, which is set in advance. For example: the set target value can be 1000,2000 or 5000, and the like, and can be specifically set according to the actual situation of normal detection data. The preset threshold refers to a threshold for evaluating whether the quantity value of the normal detection data satisfies a set target value. For example: the preset threshold may be 0, 100 or 200, etc. It should be noted that the preset threshold value must be greater than or equal to 0, i.e. the amount of the normal detection data must be greater than or equal to the set target value. Specifically, normal detection data is acquired, a numerical difference between the numerical value of the normal detection data and a set target value is calculated, and whether the numerical difference between the numerical value of the normal detection data and the set target value is a preset threshold value is judged.

S203: if the numerical difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value, performing iterative training and data screening on the isolated forest model based on the normal detection data until the numerical difference between the generated numerical value of the normal detection data and the set target value is equal to or smaller than the preset threshold value, and forming the obtained normal detection data into a target data set.

Specifically, if the numerical difference between the numerical value of the normal detection data and the set target value is greater than a preset threshold, iterative training and data screening are performed on the isolated forest model based on the normal detection data, the generated normal detection data are trained again to establish the isolated forest model, the normal detection data are input into the re-established isolated forest model to perform data screening, normal detection data are obtained, the normal detection data are trained in sequence in a circulating mode to generate the isolated forest model, the normal detection data are input into the isolated forest model to perform data screening, and the obtained normal detection data are combined into a target data set until the numerical difference between the numerical value of the generated normal detection data and the set target value is equal to or less than the preset threshold.

In the embodiment, the standard data set is input into a preset isolated forest model for data screening by acquiring the standard data set, so that normal inspection data are obtained; judging whether the numerical difference between the numerical value of the normal detection data and the set target value is greater than a preset threshold value or not; if the numerical difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold, performing iterative training and abnormal detection on the isolated forest model based on the normal detection data until the numerical difference between the generated numerical value of the normal detection data and the set target value is equal to or smaller than the preset threshold, and forming the obtained normal detection data into a target data set; and performing iterative cleaning on the standard data set for multiple times by adopting an isolated forest algorithm, and controlling the iterative cleaning times according to specific data conditions, thereby further improving the cleanliness of the generated target data set.

In an embodiment, as shown in fig. 4, inputting the standard data set into a preset isolated forest model for data screening to obtain normal detection data specifically includes the following steps:

s2011: and inputting the standard data set into the isolated forest model, traversing each random binary tree in the isolated forest model, and determining the average height value of each standard data in the standard data set in the isolated forest model.

Specifically, each standard data in the standard data set is input into the isolated forest model, and each standard data in the standard data set traverses each random binary tree in the isolated forest model. Then, the height value (the second level) of the standard data which finally falls on each random binary tree in the isolated forest model is calculated. And averaging the height values of the standard data on each random binary tree to obtain the average height value of each standard data in each standard data set in the isolated forest model. For example: if the isolated forest model comprises 3 random binary trees, after standard data is input into each random binary tree in the isolated forest model for traversal, the height values of the standard data on each random binary tree are {6,7,8}, respectively, and then the average height value (6+7+8)/3 of the standard data in the isolated forest model is 7.

S2012: and determining the standard data with the average height value larger than a preset height threshold value as normal detection data.

Wherein the height threshold refers to a value used for abnormal data screening of the standard data set. For example: the height threshold value can be set to 3,5 or 7, etc., and the user can customize the setting according to the actual situation of the standard data set. Specifically, the average height value of each standard data in the obtained standard data set is compared with a preset height threshold one by one, the standard data with the average height value equal to or less than the preset height threshold are screened, and the standard data with the average height value greater than the preset height threshold are determined as normal detection data

In this embodiment, the standard data set is input into the isolated forest model, and each random binary tree in the isolated forest model is traversed to determine an average height value of each standard data in the standard data set in the isolated forest model; then, the standard data with the average height value larger than the preset height threshold value is determined as normal detection data, so that the accuracy of the generated normal detection data is further improved.

In an embodiment, as shown in fig. 5, the feature extraction is performed on the target data set to obtain the basic features of the target data set, and the method specifically includes the following steps:

s301: and acquiring a characteristic parameter set, wherein the characteristic parameter set comprises M parameter identifications, and M is a positive integer.

The feature parameter set refers to a feature set of a target data set which is preset and needs to be extracted. The different types of target data sets correspond to different sets of feature parameters. For example: if the target data set is user information, the feature parameter set of the target data set may include: name, gender, age, occupation, interests, etc.; if the target data set is a webpage clicking behavior, the feature parameter set of the target data set may include: time of clicks, number of times, frequency, etc. Specifically, the feature parameter set includes M parameter identifications, where M is a positive integer. The parameter identification means an identification number assigned to each characteristic parameter. For example: the parameter identification of the name may be a name; the parameter identification of the gender can be gender; the parameter identification of age may be age; the occupational parameter identification may be occupational; the parameter identification of interest may be interest.

S302: and acquiring a corresponding feature extraction script according to each parameter identifier.

The feature extraction script refers to a text which can directly extract features of the target data set. In this embodiment, the feature extraction script is pre-compiled and stored in the database of the server, so that the corresponding feature extraction script can be directly obtained from the database of the server according to the obtained parameter identifier. For example: according to the name of the name parameter identification, the corresponding feature extraction script < name > can be obtained from a database of the server; according to the gender parameter identification, the generator can acquire a corresponding feature extraction script < generator > from a database of a server; acquiring a corresponding feature extraction script < age > from a database of a server according to the parameter identifier age of the age; acquiring a corresponding feature extraction script < Number > from a database of a server according to the parameter identification Number of the function Number; and according to occupational parameter identification, a corresponding feature extraction script < occupational > and the like can be obtained from a database of the server.

S303: and performing feature extraction on the target data set by adopting the feature extraction script to obtain the basic features of the target data set.

Where a base feature refers to a set of feature data that reflects a property or attribute of a target data set. Specifically, each feature extraction script has a function of performing direct feature extraction on the target data set, so that the feature extraction script obtained in step S302 can be used to perform feature extraction on the target data set directly to obtain the basic features of the target data set.

In this embodiment, by obtaining a feature parameter set, the feature parameter set includes M parameter identifiers, where M is a positive integer; acquiring a corresponding feature extraction script according to each parameter identifier; extracting the features of the target data set by adopting a feature extraction script to obtain the basic features of the target data set; thereby making the basic features of the generated target data set more accurate and comprehensive.

In an embodiment, as shown in fig. 6, a GBDT feature combination algorithm is used to perform feature construction on each basic feature to generate a cross-combination feature, which specifically includes the following steps:

s601: and acquiring sample characteristics, training the sample characteristics, and generating a target GBDT model.

Wherein, the sample feature refers to a feature used for training to generate the target GBDT model. In a specific embodiment, the sample features are pre-stored in the database of the server, and when feature construction needs to be performed on each basic feature, the sample features can be directly obtained from the database of the server. It should be noted that the sample features are of the same type as the base features. The target GBDT model is a GBDT model with N trees, which is constructed after decision tree training is carried out according to basic characteristics. Specifically, the obtained basic features are input into a preset decision tree to be trained so as to construct a target GBDT model with N trees.

Further, in this embodiment, when training the sample features and generating the target GBDT model, parameters of the target GBDT model need to be set. The parameters of the target GBDT model mainly include max _ features, learning _ rate and n _ estimators. Wherein. Both the learning _ rate and the n _ estimators can find the optimal parameter combination by cross-validation through grid search, in this embodiment, the learning _ rate is set to 0.01, and the n _ estimators is set to 500. Since the parameter settings of max _ features are associated with max _ features in the previous random forest model, the association of the target GBDT model with the random forest model is enhanced. In this embodiment, the max _ features parameter of the target GBDT model and the max _ features parameter of the random forest model are set to the same size. When the max _ features parameter of the target GBDT model and the max _ features parameter of the random forest model are set to be consistent, two originally independent models can generate complementary effects, and the target GBDT model can construct combined features according to the same feature processing logic in the random forest model. Preferably, the max _ features parameter of the target GBDT model is set to 100.

S602: and inputting the basic features into a target GBDT model for feature construction to generate cross combination features.

Specifically, the basic features are input into the target GBDT model generated by construction, and the cross combination features can be generated. Illustratively, if the target GBDT model is combined from two trees, the first tree has 3 leaf nodes and the second tree has 2 leaf nodes. Inputting a base feature x if it is the second leaf node in which the first tree finally falls and the first leaf node in which the second tree finally falls; then after feature construction by the target GBDT model, the cross-combination features generated are [0,1,0,1,0], where the first three bits in the vector correspond to 3 leaf nodes of the first tree and the last two bits correspond to 2 leaf nodes of the second tree.

In the embodiment, a target GBDT model with N trees is constructed by obtaining sample characteristics and training the sample characteristics based on a preset GBDT model, wherein N is a positive integer; inputting the basic features into a target GBDT model for feature construction to generate cross combination features; thereby improving the accuracy and diversity of the generated cross-combination features.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a data processing apparatus is provided, and the data processing apparatus corresponds to the data processing method in the above embodiment one to one. As shown in fig. 7, the data processing apparatus includes a preprocessing module 10, an iterative cleaning module 20, a feature extraction module 30, a weight determination module 40, an essential feature determination module 50, a feature construction module 60, and a combination module 70. The functional modules are explained in detail as follows:

the preprocessing module 10 is configured to acquire data to be processed, and preprocess the data to be processed to obtain a standard data set, where the preprocessing includes at least one of missing value processing, abnormal value processing, deduplication processing, or noise data processing;

the iterative cleaning module 20 is used for performing iterative cleaning on the standard data set by adopting an isolated forest algorithm to generate a target data set;

the feature extraction module 30 is configured to perform feature extraction on the target data set to obtain basic features of the target data set;

the weight determining module 40 is used for determining the weight of each basic feature in the target data set according to a preset strategy;

a necessary feature determination module 50 for determining necessary features of the target data set according to the weight of each basic feature in the target data set;

a feature construction module 60, configured to perform feature construction on each basic feature by using a GBDT feature combination algorithm, so as to generate a cross-combination feature;

and the combination module 70 is used for combining the cross combination characteristics and the necessary characteristics to obtain target characteristic data.

Preferably, as shown in fig. 8, the iterative cleaning module 20 includes:

the data screening unit 201 is configured to obtain a standard data set, input the standard data set into a preset isolated forest model, and perform data screening to obtain normal detection data;

a determining unit 202, configured to determine whether a value difference between the value of the normal detection data and the set target value is greater than a preset threshold;

and the target data set generating unit 203 is configured to perform iterative training and data screening on the isolated forest model based on the normal detection data when a value difference between the quantity value of the normal detection data and the set target value is greater than a preset threshold, and to form a target data set from the obtained normal detection data until the value difference between the quantity value of the generated normal detection data and the set target value is equal to or less than the preset threshold.

Preferably, as shown in fig. 9, the data filtering unit 201 includes:

the traversal subunit 2011 is configured to input the standard data set into the isolated forest model, traverse each random binary tree in the isolated forest model, and determine an average height value of each standard data in the standard data set in the isolated forest model;

the normal detection data determining subunit 2012 is configured to determine, as the normal detection data, the standard data with the average height value greater than the preset height threshold.

Preferably, the feature extraction module 30 comprises:

the characteristic parameter set acquisition unit is used for acquiring a characteristic parameter set, wherein the characteristic parameter set comprises M parameter identifications, and M is a positive integer;

the characteristic extraction script acquisition unit is used for acquiring a corresponding characteristic extraction script according to each parameter identifier;

and the feature extraction unit is used for extracting features of the target data set by adopting the feature extraction script to obtain the basic features of the target data set.

Preferably, the feature construction module 60 comprises:

the sample characteristic acquisition unit is used for acquiring sample characteristics, training the sample characteristics and generating a target GBDT model;

and the characteristic construction unit is used for inputting the basic characteristics into the target GBDT model for characteristic construction to generate cross combination characteristics.

For specific limitations of the data processing apparatus, reference may be made to the above limitations of the data processing method, which are not described herein again. The various modules in the data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for the data used in the data processing method in the above-described embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the data processing method in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the data processing method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A data processing method, comprising:

2. The data processing method of claim 1, wherein the iteratively cleaning the standard dataset with the orphan forest algorithm to generate a target dataset comprises:

acquiring a standard data set, inputting the standard data set into a preset isolated forest model for data screening to obtain normal detection data;

judging whether the numerical difference between the numerical value of the normal detection data and a set target value is greater than a preset threshold value or not;

if the numerical difference between the numerical value of the normal detection data and the set target value is larger than the preset threshold, performing iterative training and data screening on the isolated forest model based on the normal detection data until the generated numerical difference between the numerical value of the normal detection data and the set target value is equal to or smaller than the preset threshold, and forming the obtained normal detection data into a target data set.

3. The data processing method of claim 2, wherein the inputting the standard data set into the isolated forest model for data screening to obtain normal detection data comprises:

inputting the standard data set into the isolated forest model, traversing each random binary tree in the isolated forest model, and determining an average height value of each standard data in the standard data set in the isolated forest model;

and determining the standard data with the average height value larger than a preset height threshold value as normal detection data.

4. The data processing method of claim 1, wherein the performing feature extraction on the target data set to obtain basic features of the target data set comprises:

acquiring a characteristic parameter set, wherein the characteristic parameter set comprises M parameter identifiers, and M is a positive integer;

acquiring a corresponding feature extraction script according to each parameter identifier;

and performing feature extraction on the target data set by adopting the feature extraction script to obtain the basic features of the target data set.

5. The data processing method of claim 1, wherein said performing feature construction on each of said basic features using GBDT feature combination algorithm to generate cross-combination features comprises:

acquiring sample characteristics, training the sample characteristics, and generating a target GBDT model;

and inputting the basic features into the target GBDT model for feature construction to generate cross combination features.

6. A data processing apparatus, comprising:

7. The data processing apparatus of claim 6, wherein the iterative cleaning module comprises:

the data screening unit is used for acquiring a standard data set, inputting the standard data set into a preset isolated forest model for data screening to obtain normal detection data;

the judging unit is used for judging whether the numerical difference between the numerical value of the normal detection data and the set target value is larger than a preset threshold value or not;

and the target data set generating unit is used for performing iterative training and data screening on the isolated forest model based on the normal detection data when the numerical difference between the numerical value of the normal detection data and the set target value is greater than the preset threshold value, and forming the obtained normal detection data into a target data set until the numerical difference between the generated numerical value of the normal detection data and the set target value is equal to or less than the preset threshold value.

8. The data processing apparatus of claim 7, wherein the data filtering unit comprises:

a traversal subunit, configured to input the standard data set into the isolated forest model, traverse each random binary tree in the isolated forest model, and determine an average height value of each standard data in the standard data set in the isolated forest model;

and the normal detection data determining subunit is used for determining the standard data of which the average height value is greater than a preset height threshold value as normal detection data.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data processing method according to any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the data processing method according to any one of claims 1 to 5.