CN117194942A

CN117194942A - Feature processing method and device based on causal relationship and electronic equipment

Info

Publication number: CN117194942A
Application number: CN202311243678.8A
Authority: CN
Inventors: 张小静; 刘兆涵; 方磊; 尚明栋
Original assignee: Beijing Zetyun Tech Co ltd
Current assignee: Beijing Zetyun Tech Co ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-12-08

Abstract

The application provides a feature processing method and device based on causal relation and electronic equipment, and relates to the technical field of machine learning and deep learning, wherein the method comprises the following steps: determining a causal relation among a plurality of initial characteristic variables in an initial data set to obtain a causal structure matrix, wherein the causal structure matrix is used for representing the causal relation among any two initial characteristic variables in the plurality of initial characteristic variables; determining a target characteristic variable from a plurality of initial characteristic variables, and determining a plurality of related characteristic variables with adjacent relation with the target characteristic variable based on a causal structure matrix; performing feature derivative operation based on a plurality of related feature variables to obtain a plurality of derivative feature variables; and performing feature evaluation based on the plurality of derivative feature variables to determine a target feature combination. According to the application, invalid characteristic variables are effectively screened, and the number of the characteristic variables is reduced; meanwhile, adjacent relations exist among feature variables in the target feature combination, so that the quality of the target feature combination is improved.

Description

Feature processing method and device based on causal relationship and electronic equipment

Technical Field

The application relates to the technical field of machine learning and deep learning, in particular to the technical field of automatic feature engineering, and specifically relates to a feature processing method and device based on causal relation and electronic equipment.

Background

Feature extraction is a technique of screening useful feature variables from raw data through domain knowledge and insight, and processing the feature variables to obtain feature variables with prediction capability. In the related art, feature variables are generally extracted from data by an expansion-compression method or a learning method. However, in the related art, related feature variables are extracted from data based on the related relationship between the variables by an expansion-compression method or a learning method, and derived to obtain a large number of derived features including a large number of invalid feature variables, and then feature screening is performed to delete the invalid feature variables, thereby obtaining valid feature variables.

Therefore, the related art has the problems that the number of feature variables derived based on the correlation is large and unreliable, so that the effectiveness of the finally extracted feature variables is poor.

Disclosure of Invention

The embodiment of the application provides a feature processing method, a device and electronic equipment based on causal relation, which are used for solving the problem that the feature variables of extracted data have poor effectiveness because more invalid feature variables exist in the feature variables which are initially extracted in the related technology.

To solve the above problems, the present application is achieved as follows:

in a first aspect, an embodiment of the present application provides a causal relationship-based feature processing method, including:

determining a causal relation among a plurality of initial characteristic variables in an initial data set to obtain a causal structure matrix, wherein the causal structure matrix is used for representing the causal relation among any two initial characteristic variables in the plurality of initial characteristic variables;

determining a target feature variable from the plurality of initial feature variables, and determining a plurality of related feature variables having an adjacent relationship with the target feature variable based on the causal structure matrix;

performing feature derivative operation based on the related feature variables to obtain derivative feature variables;

and carrying out feature evaluation based on the plurality of derivative feature variables to determine a target feature combination.

In a second aspect, an embodiment of the present application provides a feature processing method based on a target model, including:

acquiring information of the target feature combination and data to be processed according to the first aspect;

the target model performs feature processing on the data to be processed based on the information of the target feature combination to obtain feature variables corresponding to the data to be processed; the target model is a model which is obtained based on the target feature combination training.

In a third aspect, an embodiment of the present application further provides a feature processing apparatus based on causal relationship, including:

the first processing module is used for determining the causal relation among a plurality of initial characteristic variables in an initial data set to obtain a causal structure matrix, and the causal structure matrix is used for representing the causal relation among any two initial characteristic variables in the plurality of initial characteristic variables;

a second processing module configured to determine a target feature variable from the plurality of initial feature variables, and determine a plurality of related feature variables having an adjacency relationship with the target feature variable based on the causal structure matrix;

the third processing module is used for carrying out feature derivative operation based on the related feature variables to obtain derivative feature variables;

and the evaluation module is used for carrying out feature evaluation based on the plurality of derivative feature variables and determining a target feature combination.

In a fourth aspect, an embodiment of the present application provides a feature processing apparatus based on a target model, including:

the acquisition module is used for acquiring information of the target feature combination and data to be processed, which are obtained by a feature processing method based on a causal relationship;

the processing module is used for carrying out feature processing on the data to be processed based on the information of the target feature combination by the target model to obtain feature variables corresponding to the data to be processed; the target model is a model which is obtained based on the target feature combination training.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program when executed by the processor implements the steps in the causal relationship-based feature processing described in the first aspect, or implements the steps in the object model-based feature processing described in the first aspect.

In a sixth aspect, an embodiment of the present application further provides a readable storage medium storing a program, which when executed by a processor, implements the steps in the causal relation-based feature processing described in the first aspect, or implements the steps in the object model-based feature processing described in the first aspect.

In the embodiment of the application, a causal structure matrix is obtained by determining causal relations among a plurality of initial characteristic variables in an initial data set, wherein the causal structure matrix is used for representing the causal relations among any two initial characteristic variables in the plurality of initial characteristic variables; determining a target characteristic variable from a plurality of initial characteristic variables, and determining a plurality of related characteristic variables with adjacent relation with the target characteristic variable based on a causal structure matrix; performing feature derivative operation based on a plurality of related feature variables to obtain a plurality of derivative feature variables; and performing feature evaluation based on the plurality of related feature variables and the plurality of derivative feature variables to determine a target feature combination. Compared with the related art, the method can effectively screen and remove invalid characteristic variables, and reduce the number of the characteristic variables; meanwhile, the feature variables in the target feature combination obtained by the method provided by the embodiment of the application have adjacent relations, so that the quality of the extracted target feature combination is improved, the performance of the model is improved, the time consumption of training is reduced, and compared with the feature variable training efficiency obtained by screening in the related technology, the method is higher, and the method can be used for quickly and automatically generating a machine learning model or a deep learning model on a large scale.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a flow chart of a causal relationship-based feature process provided by an embodiment of the present application;

FIG. 2 is a schematic illustration of a causal relationship provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of screening a target feature combination according to an embodiment of the present application;

FIG. 4 is a flowchart of a feature processing method based on a target model according to an embodiment of the present application;

FIG. 5 is a graph showing the comparison of scores of data obtained by different methods according to the embodiments of the present application;

FIG. 6 is a comparison of feature variable extraction by different methods provided by embodiments of the present application;

FIG. 7 is a block diagram of a causal relationship-based feature processing apparatus according to an embodiment of the present application;

FIG. 8 is a block diagram of a feature processing apparatus based on a target model according to an embodiment of the present application;

Fig. 9 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of a causal relationship-based feature process according to an embodiment of the present application, as shown in fig. 1, including the following steps:

step 101, determining a causal relation among a plurality of initial characteristic variables in an initial data set, and obtaining a causal structure matrix, wherein the causal structure matrix is used for representing the causal relation among any two initial characteristic variables in the plurality of initial characteristic variables.

The plurality of initial feature variables in the initial data set are feature variables after preprocessing the original data in the initial data set. In some embodiments, the preprocessing is, for example, filling blank fields in the original data, converting data types in the original data, deleting error data in the original data, normalizing, and the like, which is not limited by the embodiment of the present application. Based on the preprocessed raw data, a plurality of initial feature variables included in the initial data set are determined.

The causal relationship among the plurality of initial characteristic variables in the initial data set is obtained by calculating the preprocessed original data through a causal analysis algorithm, and the causal discovery algorithm is used for analyzing the causal relationship of any two initial characteristic variables in the plurality of initial characteristic variables. In some embodiments, the causal discovery algorithm may be, for example, any one or more of the PC algorithm (Peter-Clark algorism), the notes algorithm, the independent component analysis (Independent Component Analysis, ICA), the Linear Non-Gaussian acyclic model (LiNGAM) algorithm. The causal relationship between feature variables may also be obtained by voting with a plurality of causal discovery algorithms, for example.

The causal structure matrix is a matrix obtained through a causal discovery algorithm, and the causal structure matrix comprises causal relations among variables, wherein the causal relations represent whether one initial characteristic variable is a causal characteristic variable of the initial characteristic of the other initial characteristic. The causal structure matrix can be used for determining whether each initial characteristic variable in the plurality of initial characteristic variables has a causal characteristic variable and a result characteristic variable. And under the condition that the reason characteristic variable and/or the result characteristic variable exist in the ith initial characteristic variable, the reason characteristic variable and/or the result characteristic variable corresponding to the ith initial characteristic variable.

It should be understood that, in the related art, the feature variables are screened only based on the correlation, and the causal relationship between the feature variables is ignored, but in the embodiment of the application, the feature variables are screened through the causal relationship, so that the feature vectors are obtained through the causal relationship screening, and the performance of the model after training can be improved.

Step 102, determining a target characteristic variable from the plurality of initial characteristic variables, and determining a plurality of related characteristic variables with adjacent relation to the target characteristic variable based on the causal structure matrix.

The target feature variable is one of a plurality of initial feature variables, and is, for example, one feature variable determined from the plurality of initial feature variables according to an actual processing task. The related characteristic variable is an initial characteristic variable corresponding to a target characteristic variable determined based on a causal structure matrix, and has an adjacent relation with the target characteristic variable. The adjacency relationship may characterize direct adjacency to the target feature variable, for example: the relevant feature variable may be a cause feature variable or a result feature variable of the target feature variable; alternatively, the adjacency relationship may characterize indirect adjacency with the target feature variable, for example: the relevant feature variable may be a cause feature variable or a result feature variable of the target feature variable, or a cause feature variable or a result feature variable of the target feature variable, and so on.

Illustratively, there is a chain of relationships between feature variables as follows: I-A-G, wherein I, A, G is a characteristic variable, I and G are characteristic variables directly adjacent to A, I and A are first-order relations, and A and G are first-order relations; i and G are second order relationships, and I and G are indirectly adjacent characteristic variables. It should be understood that there may also be a relationship of more than two orders between two feature variables, where two feature variables are still indirectly adjacent feature variables.

It should be appreciated that the relevant feature variables may be set according to the requirements of the screening feature variables, and in the case where the association or causal relationship between the required feature variables is more strict, the relevant feature variables may be only the causative feature variables or the resultant feature variables of the target feature variables (e.g., the relevant variables are feature variables having a first order relationship with the target feature variables); in the case where the association or causal relationship between the required feature variables is not strict, the relevant feature variables may be increased in number, including the cause feature variable or the result feature variable of the target feature variable, and the cause feature variable or the result feature variable of the target feature variable, the cause feature variable of the cause feature variable, the result feature variable of the result feature variable, and the like (for example, the relevant variable is a feature variable having a second order relationship or a second order or more with the target feature variable).

And 103, performing feature derivative operation based on the related feature variables to obtain derivative feature variables.

The feature derivation operation is to perform a preset operation on a part of feature variables in a plurality of related feature variables, so as to obtain a plurality of derived feature variables. The feature derivation operation may be to carry out calculation by introducing part of the feature variables into a preset formula, or may be to carry out cross operation on part of the feature variables. The cross operation is to realize cross derivation of the features by a method of a cross operator, the cross operator comprises the modes of adding, subtracting, multiplying, dividing, averaging or grouping a plurality of initial feature variables, and the feature variables related to the target feature variables are expanded by the cross operation, so that the number of the feature variables to be screened is increased, and the accuracy of the feature variables in the finally obtained target data is improved.

And 104, carrying out feature evaluation based on the plurality of derivative feature variables to determine a target feature combination.

The target feature combination is a plurality of features selected from a plurality of related feature variables and a plurality of derivative feature variables through feature evaluation. It should be understood that the number of the plurality of related feature variables and the plurality of derived feature variables obtained as described above is reduced relative to the number of feature variables obtained by the expansion-compression method or the learning method in the related art, and the quality of the feature variables is further improved; and screening a plurality of related characteristic variables and a plurality of derivative characteristic variables to obtain a target characteristic combination, wherein invalid characteristic variables can be screened and removed, and the effectiveness of the extracted target characteristic combination is improved.

Further, the screening strategies of the related feature variables and the derivative feature variables can be selected to pass through a one-shot global screening strategy, the screening efficiency of the feature variables can be further improved through the one-shot global screening strategy, the number of the output feature variables is reduced, and meanwhile the interaction effect of the feature variables can be enhanced. By screening a plurality of related feature variables and a plurality of derivative feature variables, the number of feature variables in the target feature combination can be effectively reduced compared with the number of more features calculated in the process of determining the feature combination in the related technology.

Among them, one-shot global screening strategies typically select evolutionary algorithms, such as genetic algorithms, differential evolutionary algorithms, and the like.

The feature evaluation is to screen feature variables in the target feature combination from a plurality of related feature variables and a plurality of derivative feature variables based on a preset condition, wherein the preset condition can be feature quantity or performance parameters, and the feature evaluation is performed on the plurality of related feature variables and the plurality of derivative feature variables through the preset condition to obtain feature variables in the target feature combination meeting the condition.

In one embodiment, the determining a plurality of related feature variables having an adjacency relationship with the target feature variable based on the causal structure matrix comprises:

determining a cause characteristic variable of the target characteristic variable, a result characteristic variable of the target characteristic variable and a cause characteristic variable corresponding to the result characteristic variable of the target characteristic variable from the plurality of initial characteristic variables based on the structural information of the cause and effect structural matrix;

and determining at least one of the reason characteristic variable of the target characteristic variable, the result characteristic variable of the target characteristic variable and the reason characteristic variable corresponding to the result characteristic variable of the target characteristic variable as the plurality of related characteristic variables.

The structural information of the causal structure matrix is used for representing whether causal relation exists between two initial characteristic variables, and the causal relation between any two initial characteristic variables can be determined through the causal structure matrix. For example, a value M in a causal structure matrix _ij For characterizing whether there is a causal relationship between the initial characteristic variable i and the initial characteristic variable j, at M _ij In the case of=1, the initial feature variable i is the cause feature variable of the initial feature variable j, which is the result feature variable of the initial feature variable i; at M _ij In the case of =0, there is no causal relationship between the initial characteristic variable i and the initial characteristic variable j. The cause characteristic variable and the result characteristic variable corresponding to the selected characteristic variable can be determined through the cause and effect structure matrix.

The above-mentioned multiple related feature variables are markov blanket feature sets corresponding to the target feature variables, including the reason feature variables of the target feature variables, the result feature variables, and the reason feature variables corresponding to the result feature variables of the target feature variables (the markov blanket feature sets may not be included when there are no reason feature variables of the target feature variables, the result feature variables, or the reason feature variables corresponding to the result feature variables of the target feature variables among the multiple initial feature variables). The initial feature variables and the target feature variables included in the Markov blanket feature set have strong causality, the initial feature variables and the target feature variables except the Markov blanket feature set have weak causality, and the target feature combination corresponding to the target feature variables can be obtained by carrying out cross operation and filtering on the initial feature variables in the Markov blanket feature set, so that invalid feature variables can be further removed, and the quality of the extracted target feature combination is improved.

The causal relationship between each initial feature variable is obtained through a causal structure matrix, as shown in fig. 2, wherein the initial feature variable at the beginning of an arrow is a cause feature variable, the initial feature variable pointed by the arrow is a result feature variable, and if the initial feature variable Y is a target feature variable, a plurality of relevant feature variables corresponding to the target feature variable (i.e., feature variables included in the markov blanket set) are feature variable a, feature variable C, feature variable F, feature variable H and feature variable K, and then the feature variable a, feature variable C, feature variable F, feature variable H and feature variable K are subjected to cross operation and screening, so that a target feature combination can be obtained.

In one embodiment, the performing feature derivation operation based on the plurality of related feature variables to obtain a plurality of derived feature variables includes:

and executing crossover operation on each two relevant characteristic variables in the plurality of relevant characteristic variables corresponding to the target characteristic variable to obtain the plurality of derivative characteristic variables.

The cross operation is to realize cross derivation of the features by a method of a cross operator, the cross operator comprises the modes of adding, subtracting, multiplying, dividing, averaging or grouping two initial feature variables, and the like, and the feature variables related to the target feature variables are expanded by the cross operation, so that the number of the feature variables to be screened is increased, and the accuracy of the feature variables in the finally obtained target data is improved.

Illustratively, the feature variable a, the feature variable C, the feature variable F, the feature variable H, and the feature variable K corresponding to the target feature are subjected to a cross operation, so as to obtain a plurality of derivative feature variables (such as a×c, a×f, etc.). Then screening a plurality of derivative characteristic variables, a characteristic variable A, a characteristic variable C, a characteristic variable F, a characteristic variable H and a characteristic variable K to obtain a characteristic variable comprising one of the derivative characteristic variables; and/or screening the derivative feature variables, the feature variable A, the feature variable C, the feature variable F, the feature variable H and the feature variable K to obtain a target feature combination comprising one feature variable among the feature variable A, the feature variable C, the feature variable F, the feature variable H and the feature variable K.

It should be understood that, performing a cross operation on any two relevant feature variables in the plurality of relevant feature variables corresponding to the target feature variable, so that the obtained plurality of derivative feature variables have high correlation with the target feature variable; and the number of derived feature variables is obtained by performing cross operation relative to any number of initial feature variables, the number of derived feature variables obtained by performing cross operation on any two related feature variables in a plurality of related feature variables corresponding to the target feature variables is small, the screening efficiency can be improved, and the time for obtaining the target feature combination is reduced.

In one embodiment, the two arbitrary related feature variables are two cause feature variables corresponding to the target feature variable; or,

the arbitrary two related characteristic variables are two result characteristic variables corresponding to the target characteristic variable; or,

one of the two arbitrary related characteristic variables is a reason characteristic variable corresponding to the target characteristic variable, and the other is a result characteristic variable corresponding to the target characteristic variable.

It should be appreciated that among the plurality of relevant feature variables, some relevant feature variables have closer causal relationships to the target feature variable, such as the causal feature variable and the resulting feature variable of the target feature variable; the causal relationship between the part of relevant characteristic variables and the target characteristic variable is far, such as the reason characteristic variable and the result characteristic variable of the first characteristic variable corresponding to the target characteristic variable. The relevance between the derived characteristic variable obtained by the initial characteristic variable cross operation with far causal relation with the target characteristic variable and the target characteristic variable is poor, and the derived characteristic variable and the target characteristic variable are usually screened out after screening; the relevance between the derived characteristic variable and the target characteristic variable obtained by the initial characteristic variable cross operation which has a closer causal relation with the target characteristic variable is better, and the derived characteristic variable and the target characteristic variable can be kept after being screened.

As shown in fig. 3, after redundant derivative feature variables are deleted for a plurality of derivative feature variables, a plurality of related feature variables and a plurality of derivative feature variables are screened by an evolutionary algorithm, invalid feature variables can be screened and removed, and the effectiveness of the extracted target feature combination is improved.

In the embodiment of the application, in order to improve screening efficiency and reduce time, any two related characteristic variables are set as reason characteristic variables or result characteristic variables of target characteristic variables, and the number of derived characteristic variables is reduced, namely, any two related characteristic variables are two reason characteristic variables corresponding to the target characteristic variables; or any two related characteristic variables are two result characteristic variables corresponding to the target characteristic variable; or, one of any two related characteristic variables is a reason characteristic variable corresponding to the target characteristic variable, and the other is a result characteristic variable corresponding to the target characteristic variable, so that the time for extracting the characteristic variable can be further reduced.

In one embodiment, the determining the target feature combination based on the feature evaluation performed by the plurality of derived feature variables includes:

determining scores corresponding to the plurality of derivative feature variables based on the initial data set and a preset machine learning model, wherein the scores corresponding to the plurality of derivative feature variables comprise scores of each derivative feature variable and/or scores of each derivative feature combination in a plurality of derivative feature combinations;

The target feature combination is determined based on the scores corresponding to the plurality of derived feature variables and/or the scores of each of the plurality of derived feature combinations.

The preset machine learning model is a preset algorithm model, and the score of each derivative characteristic variable is obtained by training the preset machine learning model based on the initial data set and each derivative characteristic variable; alternatively, the scoring of each derivative feature combination is obtained by training a pre-set machine learning model based on the initial dataset and each derivative feature combination.

The scoring processing of the plurality of derived feature variables based on the initial data set may be scoring for determining a similarity between the initial data set and the plurality of derived feature variables, where the higher the score, the higher the similarity; it may also be that a score for performance improvement of model training is determined after a plurality of derived feature variables are added to the initial dataset, the higher the score the higher the performance improvement.

And determining the target feature combination based on the scores corresponding to the plurality of derivative feature variables, specifically, screening the plurality of derivative feature variables through the scores corresponding to the plurality of derivative feature variables to obtain the target feature combination. It should be appreciated that before screening the plurality of derived feature variables and the plurality of related feature variables, the plurality of derived feature variables are screened at a time, and derived feature variables that are significantly unrelated to the plurality of initial feature variables are deleted, thereby reducing the number of derived feature variables and further improving the speed of screening the plurality of derived feature variables and the plurality of related feature variables. Screening the plurality of related feature variables, namely, after scoring each derivative feature variable, screening each derivative feature variable based on the score of each derivative feature; alternatively, derived feature combinations comprising a plurality of derived feature variables may be scored and the variables in the derived feature combinations may be screened based on the scoring of the derived feature combinations.

The scoring rank threshold is a preset number threshold, and is used for determining the number of derivative feature variables after deletion, for example, if the scoring rank threshold is set to 3, the derivative feature variables with scoring rank lower than 3 are finally deleted, only 3 derivative feature variables with scoring rank lower than 3 are reserved, the number of the 3 derivative feature variables is less, and the time for extracting the feature variables can be further reduced. For example, a plurality of derivative feature variables are respectively combined into different derivative feature combinations, the ranking threshold is set to be 3, the score of each derivative feature combination is ranked, the feature variables in the derivative feature combinations with the score lower than 3 are deleted, and only the feature variables in the derivative feature combinations with the score higher than or equal to 3 are reserved.

In the embodiment of the application, scores corresponding to a plurality of derivative feature variables are determined based on an initial data set and a preset machine learning model, and the scores corresponding to the plurality of derivative feature variables comprise scores of each derivative feature variable and/or scores of each derivative feature combination in a plurality of derivative feature combinations; and determining the target feature combination based on scores corresponding to the plurality of derivative feature variables, thereby reducing the number of the derivative feature variables and improving the model training efficiency.

In one embodiment, the determining the scores corresponding to the plurality of derived feature variables based on the initial dataset and a preset machine learning model includes:

adding each derivative feature variable or each derivative feature combination in a plurality of derivative feature combinations to the initial data set respectively to obtain a plurality of intermediate data sets;

training the preset machine learning model based on the plurality of intermediate data sets respectively to obtain a plurality of first machine learning models;

training the preset machine learning model based on the initial data set to obtain a second machine learning model;

and determining scores corresponding to the plurality of derived feature variables based on performance evaluation results of the plurality of first machine learning models and the second machine learning model.

It should be understood that, the derivative feature variables do not belong to the initial data, but are obtained by the intersection operation of the initial feature variables, the derivative feature variables are combined with the initial feature variables and then screened, a part of derivative feature variables in a plurality of derivative feature variables have low correlation with the initial feature variables, training the part of derivative feature variables with a model in the initial data may not improve the performance of the trained model, training other part of derivative feature variables with the model in the initial data can effectively improve the performance of the trained model, and deleting the part of derivative feature variables while retaining the other derivative feature variables to improve the effectiveness of the target feature combination after screening, so that the number of derivative feature variables is reduced by scoring each derivative feature variable and screening the plurality of derivative feature variables through scoring.

Specifically, training a preset machine learning model through a plurality of intermediate data sets respectively to obtain a first machine learning model; training the preset machine learning model based on an initial data set to obtain a second machine learning model; and performing performance evaluation results on the plurality of first machine learning models and the plurality of second machine learning models to obtain first training results of the plurality of first machine learning models and second training results of the plurality of second machine learning models, wherein the first training results are used for representing performance conditions of the models trained based on the intermediate data set, the second training results are used for representing performance conditions of the models trained based on the initial data set, and by comparing the first training results and the second training results, adding derivative feature variables or derivative feature combinations to performance improvement conditions of the initial data set on model training can be determined, and then scores corresponding to the derivative feature variables or the derivative feature combinations are calculated. The higher the scoring rank is, the higher the performance improvement of the trained model is, the lower the performance improvement of the trained model is, or the performance of the trained model is reduced.

The algorithm of the preset machine learning model can be an algorithm such as logistic regression or random forest.

In the embodiment of the application, a plurality of intermediate data sets are obtained by respectively adding each derivative feature variable or each derivative feature combination in a plurality of derivative feature combinations to an initial data set; training a preset machine learning model based on a plurality of intermediate data sets respectively to obtain a plurality of first machine learning models; training a preset machine learning model based on the initial data set to obtain a second machine learning model; and determining scores corresponding to the plurality of derived feature variables based on performance evaluation results of the plurality of first machine learning models and the plurality of second machine learning models, so that the derived feature variables or the derived feature combinations can be screened through the scores, and the number of the derived feature variables is reduced.

In one embodiment, the determining the target feature combination based on the scores corresponding to the plurality of derived feature variables and/or the scores of each derived feature combination of the plurality of derived feature combinations includes

Determining a target derivative feature variable based on the scores corresponding to the plurality of derivative feature variables and/or the scores of each derivative feature combination in the plurality of derivative feature combinations;

and carrying out iterative search processing on the plurality of related feature variables and the target derivative feature variable based on an evolutionary algorithm until reaching an iteration cut-off condition to obtain the target feature combination.

The evolutionary algorithm is an algorithm corresponding to a one-shot global screening strategy, and can be an algorithm such as a genetic algorithm or a differential evolutionary algorithm, and the algorithm iterates a plurality of related characteristic variables and a plurality of derivative characteristic variables with redundant derivative characteristic variables deleted to obtain a target characteristic combination comprising a smaller number of characteristic variables, and the most effective characteristic variables are reserved.

The iterative search processing is to repeatedly traverse the search from the target derived feature variable and a plurality of related feature variables until the search obtains a plurality of features meeting the iteration cut-off condition, and the plurality of features are the target feature combination. The iteration cut-off condition is a limitation of feature variables in the target feature combination obtained by feature evaluation, and the target feature combination obtained by feature evaluation meets the iteration cut-off condition through the iteration cut-off condition. Wherein the iteration cutoff condition includes at least one of a feature quantity and a performance parameter. For example, in the case where the iteration cutoff condition is the feature number, the number of feature variables included in the target feature combination is the feature number of the iteration cutoff condition. Also for example, in the case where the iteration cutoff condition is a performance parameter, a model trained based on the target feature parameter can meet the requirement of the performance parameter.

In the embodiment of the application, the target feature combination is obtained by carrying out iterative processing on the plurality of related feature variables and the target derivative feature variable based on an evolutionary algorithm until an iterative cutoff condition is reached, so that the feature variable included in the target feature combination meets the iterative cutoff condition, thereby facilitating the improvement of the speed of training the model in the model training process and the improvement of the performance of the trained model.

As shown in fig. 4, the application further provides a feature processing method based on the target model, which is applied to the target model. In some embodiments, the object model includes an object feature combination described by the method of fig. 1, the method including:

step 401, obtaining information of a target feature combination and data to be processed, which are obtained by a feature processing method based on a causal relationship;

step 402, performing feature processing on the data to be processed by a target model based on the information of the target feature combination to obtain feature variables corresponding to the data to be processed; the target model is a model which is obtained based on the target feature combination training.

It should be understood that the feature engineering processing of the data to be processed includes at least one of the following:

Performing feature derivation processing on the data to be processed;

performing feature screening on the data to be processed and/or the data obtained by the feature derivation processing;

and carrying out iterative search processing on the data to be processed and/or the data subjected to feature screening.

The target model is a model trained based on a target feature combination, the target feature combination is a target feature combination obtained based on a causal relation, the information of the target feature combination can be the feature quantity in the target feature combination and/or the feature parameters of the target feature combination, and the target model can perform feature engineering processing on the data to be processed based on the information of the target feature combination by acquiring the information of the target feature combination so as to screen out feature variables meeting requirements from the data to be processed. The feature engineering processing comprises a plurality of parts such as feature derivatization, feature screening and iterative search processing, feature data are obtained by carrying out feature engineering processing on the data to be processed, correlation exists among the feature data, the quality of the feature data can be effectively improved, meanwhile, the number of feature variables is smaller, and the training speed can be improved.

The target model performs feature derivation processing on the data to be processed based on the information of the target feature combination to obtain a plurality of derived variables; performing feature screening on the plurality of derived variables to obtain target derived variables; and carrying out iterative search processing on the target derivative variable and the data to be processed to obtain the required characteristic data.

In one embodiment, the target model is obtained by:

responding to the trigger message, and displaying the information of the target feature combination, wherein the information of the target feature combination comprises the feature quantity and the feature parameters of the target feature combination;

and responding to the screening operation of the target feature combination, and training an initial model based on the screened target feature combination to obtain the target model.

The trigger message is a message indicating that training of the target model is ready to start, and information of the target feature combination obtained by the feature processing method based on the causal relationship is displayed when the trigger message is received. The information of the target feature combination comprises feature quantity and feature parameters of the target feature combination, and the initial model is trained through the target feature combination to obtain a target model, so that the target model can train data to be processed based on the information of the target feature combination to obtain feature variables conforming to the feature quantity and the feature parameters.

The screening operation is an operation of screening the initial data set to obtain a target feature combination, and under the condition that the screening operation is completed, feature variables in the target feature combination accord with information of the target feature combination, and training is carried out on the initial model through the screened target feature combination to obtain a target model. In the embodiment of the application, the initial model is trained based on the target feature combination to obtain the target model, and because the number of feature variables in the target feature combination is smaller than that of feature variables in the related technology and the feature variables in the target feature combination have causal relation, the effectiveness of the extracted target feature combination can be effectively improved, the performance of the model is further improved, and the time consumption of training is reduced.

In some embodiments, training an initial model by a target feature combination to obtain the target model; in other embodiments, the target feature combination may be combined with a preset training set, and the initial model is trained through the combined training set to obtain the target model. The specific mode of training the initial model based on the target feature combination is not limited by the embodiment of the application. The preset training set may be a data set required for the disclosed model training, or other processed data sets that may be used for the model training, which is not limited in the embodiment of the present application.

Further, the target feature combination is obtained by feature evaluation based on an evolutionary algorithm and an iteration cut-off condition, and the iteration cut-off condition can be set according to user requirements. For example, the iteration cutoff condition is: the number of feature variables in the target feature combination is reduced to a preset value, a model obtained by training based on the target feature parameters can reach performance parameters, and training time required by model training reaches at least one of preset time. Based on the iteration cut-off condition, the initial model can be trained based on the target feature combination, and the obtained trained model can meet the requirements of users.

For example, the automatic feature engineering method in the related art compares with the result of extracting feature variables in the same data by the method of the present application. The iteration cut-off condition in the method is that the model obtained by training based on the target characteristic parameters can reach the performance parameters, namely, the model trained by the target characteristic combination can have better performance effect. As shown in the table in fig. 5, the automatic feature engineering method in the related art is AutoFeat, openFE, FETCH, MACFE, the Original data is Original, the causal relation-based automatic feature engineering method (CasualFE) of the present application has a better trained model as the average value in the table is larger and the value of the best value/total value is larger, and as shown in fig. 5, the performance effect of the trained model obtained by the method of the present application is better than that of the model obtained by the feature training obtained by the automatic feature engineering in the related art.

For example, the application can set the iteration cut-off condition as the training time and the number of feature variables required by model training, namely the number of feature variables obtained by screening by the method is smaller, and the training time of the model is shorter. As shown in the table in fig. 6, the automatic feature engineering method in the related art is AutoFeat, openFE, FETCH, MACFE, the Original data is Original, and the causal relation-based automatic feature engineering method (CasualFE) of the present application, as can be seen from fig. 6, the feature quantity S obtained by the method of the present application is less, the training time T is less, and the time of model training can be effectively reduced.

Referring to fig. 7, fig. 7 is a block diagram of a feature processing apparatus based on causal relationship according to an embodiment of the present application, and as shown in fig. 7, a feature processing apparatus 700 based on causal relationship includes:

a first processing module 701, configured to determine a causal relationship between a plurality of initial feature variables in an initial dataset, and obtain a causal structure matrix, where the causal structure matrix is used to characterize a causal relationship between any two initial feature variables in the plurality of initial feature variables;

a second processing module 702 configured to determine a target feature variable from the plurality of initial feature variables and determine a plurality of related feature variables having an adjacency relationship with the target feature variable based on the causal structure matrix;

a third processing module 703, configured to perform a feature derivation operation based on the multiple related feature variables, to obtain multiple derived feature variables;

an evaluation module 704, configured to perform feature evaluation based on the plurality of derived feature variables, and determine a target feature combination.

In one embodiment, each of the plurality of related feature variables is a cause feature variable of the target feature variable, a result feature variable of the target feature variable, or a cause feature variable corresponding to the result feature variable of the target feature variable;

The second processing module 702 includes at least one of:

a first processing unit, configured to determine, based on structural information of the causal structural matrix, a cause feature variable of the target feature variable, an effect feature variable of the target feature variable, and a cause feature variable corresponding to the effect feature variable of the target feature variable from the plurality of initial feature variables;

and the second processing unit is used for determining at least one of the reason characteristic variable of the target characteristic variable, the result characteristic variable of the target characteristic variable and the reason characteristic variable corresponding to the result characteristic variable of the target characteristic variable as the plurality of related characteristic variables.

In one embodiment, the third processing module 703 includes:

and the third processing unit is used for executing crossover operation on each two related characteristic variables in the plurality of related characteristic variables corresponding to the target characteristic variable to obtain the plurality of derivative characteristic variables. .

In one embodiment, the evaluation module 704 includes:

a scoring unit, configured to determine scores corresponding to the plurality of derivative feature variables based on the initial data set and a preset machine learning model, where the scores corresponding to the plurality of derivative feature variables include a score for each derivative feature variable and/or a score for each derivative feature combination in a plurality of derivative feature combinations;

and the determining unit is used for determining the target feature combination based on the scores corresponding to the derivative feature variables and/or the scores of each derivative feature combination in the derivative feature combinations.

In one embodiment, the scoring unit comprises:

an adding subunit, configured to add each derivative feature variable or each derivative feature combination of the plurality of derivative feature combinations to the initial data set respectively, to obtain a plurality of intermediate data sets;

the first sub-training unit is used for respectively training the preset machine learning model based on the plurality of intermediate data sets to obtain a plurality of first machine learning models;

The second sub-training unit is used for training the preset machine learning model based on the initial data set to obtain a second machine learning model;

and the first scoring subunit is used for determining scores corresponding to the derivative feature variables based on performance evaluation results of the first machine learning models and the second machine learning models.

In one embodiment, the determining unit comprises:

a second scoring subunit, configured to determine a target derived feature variable based on scores corresponding to the plurality of derived feature variables and/or scores of each derived feature combination in the plurality of derived feature combinations;

and the determining subunit is used for carrying out iterative search processing on the plurality of related characteristic variables and the target derivative characteristic variable based on an evolution algorithm until reaching an iteration cut-off condition to obtain the target characteristic combination.

The feature processing device based on causal relationship provided in the embodiment of the present application is capable of implementing each process of each embodiment of the feature processing method based on causal relationship shown in fig. 1, technical features are in one-to-one correspondence, and the same technical effects can be achieved, so that repetition is avoided, and no description is repeated here.

It should be noted that, the determining device for the feature combination in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in an electronic device.

Referring to fig. 8, fig. 8 is a block diagram of a feature processing apparatus based on a target model according to an embodiment of the present application, which is applied to the target model. As shown in fig. 8, the model-based feature processing apparatus 800 includes:

an obtaining module 801, configured to obtain information of a target feature combination and data to be processed, which are obtained by a feature processing method based on a causal relationship;

the processing module 802 is configured to perform feature processing on the data to be processed according to the information of the target feature combination by using a target model, so as to obtain feature variables corresponding to the data to be processed; the target model is a model which is obtained based on the target feature combination training.

In one embodiment, the target model is obtained by:

The feature processing device based on causal relationship provided in the embodiment of the present application is capable of implementing each process of each embodiment of the model-based feature processing method shown in fig. 4, technical features are in one-to-one correspondence, and the same technical effects can be achieved, so that repetition is avoided, and no description is repeated here.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device provided by the embodiment of the present application, where the electronic device includes a memory 901, a processor 902, and a program or an instruction stored to run on the memory 901, and when the program or the instruction is executed by the processor 902, any step in the method embodiment corresponding to fig. 1 may be implemented and the same beneficial effect may be achieved, or any step in the method embodiment corresponding to fig. 4 may be implemented and the same beneficial effect may be achieved, which is not described herein.

The processor 902 may be, among other things, a central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA), or a graphics processing unit (Graphics Processing Unit, GPU).

Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the methods of the embodiments described above may be implemented by hardware associated with program instructions, where the program may be stored on a readable medium.

The embodiment of the present application further provides a readable storage medium, where a computer program is stored, where the computer program when executed by a processor may implement any step in the method embodiment corresponding to fig. 1, or implement any step in the method embodiment corresponding to fig. 4, and achieve the same technical effect, so that repetition is avoided, and no further description is given here. Such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, etc.

The terms "first," "second," and the like in embodiments of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the use of "and/or" in the present application means at least one of the connected objects, such as a and/or B and/or C, means 7 cases including a alone a, B alone, C alone, and both a and B, both B and C, both a and C, and both A, B and C.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described example method may be implemented by means of software plus a necessary general hardware platform, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a second terminal device, etc.) to perform the method of the various embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A causal relationship-based feature processing method, comprising:

2. The method of claim 1, wherein the determining a plurality of related feature variables having an adjacency with the target feature variable based on the causal structure matrix comprises:

3. The method of claim 1, wherein performing the feature derivation operation based on the plurality of related feature variables results in a plurality of derived feature variables, comprising:

4. A method according to any one of claims 1 to 3, wherein said evaluating features based on said plurality of derived feature variables, determining a target feature combination, comprises:

5. The method of claim 4, wherein determining the scores corresponding to the plurality of derived feature variables based on the initial dataset and a pre-set machine learning model comprises:

6. The method of claim 4, wherein the determining the target feature combination based on the scores corresponding to the plurality of derivative feature variables and/or the scores of each of the plurality of derivative feature combinations comprises:

7. A feature processing method based on a target model, the method comprising:

acquiring information of a target feature combination and data to be processed;

8. The method of claim 7, wherein the target model is obtained by:

9. A causal relationship-based feature processing apparatus, comprising:

10. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the causal relationship based feature processing method of any of claims 1 to 8.