CN111563067B

CN111563067B - Feature processing method and device

Info

Publication number: CN111563067B
Application number: CN202010372184.XA
Authority: CN
Inventors: 吴作鹏
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2023-04-14
Anticipated expiration: 2040-05-06
Also published as: CN111563067A

Abstract

The invention discloses a feature processing method and a feature processing device, wherein an iteration model identifier uniquely corresponding to a feature combination is generated based on the feature combination of all features to be evaluated of the current feature iteration, the iteration model identifier is used as a log file name of the current feature iteration, when a target log file with the same log file name as the iteration model identifier is found, the target log file is analyzed, and a current model evaluation score of the current feature iteration is obtained from the analyzed target log file. According to the invention, aiming at the situation that the iterative model identification and the model evaluation score during single model training are both recorded in the log file, when the characteristic iterative process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iterative model identification as the name of the log file by calculating the iterative model identification when the characteristic iteration is terminated, so that the waste of model repeated training time is reduced, and the characteristic processing efficiency is improved.

Description

Feature processing method and device

Technical Field

The invention relates to the technical field of computers, in particular to a feature processing method and device.

Background

In applying machine learning techniques to solve production problems, extensive model training is typically required to obtain the best performing model. In the model training process, the effect of a large number of features is required to be evaluated, particularly a large number of features constructed by a feature derivation mode, wherein some features can generate a positive effect on the model, and some features can generate a disturbing effect on the model. Currently, when these features are evaluated, a gradually increasing or gradually decreasing mode is usually adopted to iteratively train a model, and according to the final evaluation score of each feature, a feature with a good effect is obtained by screening.

At present, in the process of screening features, because an iterative training mode is adopted to evaluate the quality of the features, once a model is accidentally terminated in the training process due to various uncertain factors, an iterative training process may need to be restarted. If the iterative training is restarted, the time spent on the model training before the program crashes is wasted, so that the whole model training process takes a long time.

Disclosure of Invention

In view of this, the present invention discloses a feature processing method and apparatus, so as to achieve that when a feature iteration process is terminated due to an uncertain factor, by calculating an iteration model identifier when the feature iteration is terminated, a corresponding model evaluation score can be obtained from a log file with the iteration model identifier as a log file name, thereby reducing waste of model repeated training time and improving feature processing efficiency.

A method of feature processing, comprising:

generating an iteration model identifier uniquely corresponding to the feature combination based on the feature combination of all features to be evaluated of the current feature iteration, and taking the iteration model identifier as the name of a log file of the current feature iteration;

judging whether a log file with the same log file name as the iterative model identification exists or not, and recording the log file as a target log file, wherein the log file records the iterative model identification obtained by calculation in single model training and a model evaluation score obtained by training;

and if so, analyzing the target log file, and acquiring the current model evaluation score of the current characteristic iteration from the analyzed target log file.

Optionally, when the iterative model identifier is an MD5 value, the generating, based on the feature combination of all the features to be evaluated of the current feature iteration, an iterative model identifier uniquely corresponding to the feature combination, and using the iterative model identifier as the name of the log file of the current feature iteration specifically includes:

and generating an MD5 value by adopting an MD5 information abstract algorithm for the feature combination, and taking the MD5 value as the name of the log file of the current feature iteration.

Optionally, the method further includes:

if not, performing model training on all the features to be evaluated to obtain the current model evaluation score of the current feature iteration, and storing the iteration model identification and the current model evaluation score into a log file taking the iteration model identification as the name of the log file in a corresponding relation mode.

Optionally, after obtaining the current model evaluation score, the method further includes:

judging whether the current characteristic iteration is the last characteristic iteration of all the characteristic iterations;

if so, finding the best feature combination which is obtained by screening the feature combination with the highest model evaluation score from all the model evaluation scores generated in the iterative process.

A feature processing apparatus comprising:

the identification generation unit is used for generating an iteration model identification which is uniquely corresponding to the characteristic combination based on the characteristic combination of all the characteristics to be evaluated of the current characteristic iteration, and taking the iteration model identification as the name of the log file of the current characteristic iteration;

the first judging unit is used for judging whether a log file with the same log file name as the iterative model identification exists and recording the log file as a target log file, wherein the log file records the iterative model identification obtained by calculation during single model training and a model evaluation score obtained by training;

and the analysis unit is used for analyzing the target log file and acquiring the current model evaluation score of the current characteristic iteration from the analyzed target log file under the condition that the first judgment unit judges that the current model evaluation score is positive.

Optionally, the identifier generating unit is specifically configured to:

Optionally, the method further includes:

and the training unit is used for carrying out model training on all the features to be evaluated under the condition that the first judgment unit judges that the features are not evaluated, obtaining the current model evaluation score of the feature iteration, and storing the iteration model identification and the current model evaluation score into a log file taking the iteration model identification as the name of the log file in a corresponding relation mode.

Optionally, the method further includes:

a second judging unit, configured to judge whether the current feature iteration is the last feature iteration of all feature iterations after the analyzing unit or the training unit obtains the current model evaluation score;

and the searching unit is used for searching the best feature combination which is obtained by screening the feature combination with the highest model evaluation score from all the model evaluation scores generated in all the iterative processes under the condition that the second judging unit judges that the feature combination is the best feature combination.

According to the technical scheme, the invention discloses a feature processing method and a feature processing device, an iteration model identification uniquely corresponding to the feature combination is generated based on the feature combination of all features to be evaluated of the current feature iteration, the iteration model identification is used as a log file name of the current feature iteration, when a log file with the same log file name as the iteration model identification, namely a target log file, is found, the target log file is analyzed, and the current model evaluation score of the current feature iteration is obtained from the analyzed target log file. Because the iteration model identification obtained by calculation during single model training and the model evaluation score obtained by training are recorded in the log file taking the iteration model identification as the log file name, when the characteristic iteration process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iteration model identification as the log file name by calculating the iteration model identification when the characteristic iteration is terminated, thereby reducing the waste of repeated training time of the model and improving the characteristic processing efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the disclosed drawings without creative efforts.

FIG. 1 is a flow chart of a feature processing method disclosed in an embodiment of the present invention;

FIG. 2 is a flow chart of another feature processing method disclosed in the embodiments of the present invention;

FIG. 3 is a flow chart of another feature processing method disclosed in the embodiments of the present invention;

FIG. 4 is a schematic structural diagram of a feature processing apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of another feature processing apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of another feature processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a feature processing method and a feature processing device, wherein an iteration model identification uniquely corresponding to the feature combination is generated based on the feature combination of all features to be evaluated of the current feature iteration, the iteration model identification is used as a log file name of the current feature iteration, when a log file with the same log file name as the iteration model identification, namely a target log file, is found, the target log file is analyzed, and a current model evaluation score of the current feature iteration is obtained from the analyzed target log file. Because the iteration model identification obtained by calculation during single model training and the model evaluation score obtained by training are both recorded in the log file taking the iteration model identification as the name of the log file, when the characteristic iteration process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iteration model identification as the name of the log file by calculating the iteration model identification when the characteristic iteration is terminated, thereby reducing the waste of repeated model training time and improving the characteristic processing efficiency.

Referring to fig. 1, a flowchart of a feature processing method according to an embodiment of the present invention includes:

step S101, generating an iteration model identification uniquely corresponding to the feature combination based on the feature combination of all features to be evaluated of the current feature iteration, and taking the iteration model identification as the name of a log file of the current feature iteration;

the iteration model identifications are used for distinguishing different iteration steps, and the iteration model identifications generated by the same iteration step are the same.

Optionally, the iterative model identifier may be MD5 obtained by using an MD5 information summarization algorithm.

The implementation process of step S101 may specifically include: and generating an MD5 value by adopting an MD5 information abstract algorithm for the feature combinations of all features to be evaluated of the current feature iteration, and taking the MD5 value as the name of the log file of the current feature iteration.

The MD5 Message Digest Algorithm (MD 5 Message-Digest Algorithm) is a widely used cryptographic hash function that generates a 128-bit (16-byte) hash value (hash value) to ensure the integrity of the Message transmission.

Of course, in practical applications, other methods, such as a hash algorithm, may also be used to generate the iterative model identifier, which is determined according to practical situations, and the present invention is not limited herein.

Wherein the combination of characteristics includes: model features, evaluation algorithms, and model parameters.

The invention determines an iterative model identification for the training task, the iterative model identification is an MD5 value generated based on the feature combination of all the features to be evaluated, the MD5 value is used as the unique identification of the current feature iteration, and the iterative model identification before all the features to be evaluated change can be ensured to be consistent.

Specifically, the features of all the features to be evaluated are combined and spliced into a character string, an MD5 value is generated for the character string by adopting an MD5 information abstract algorithm, and the MD5 value is used as a log file name, so that log records can be conveniently found based on the MD5 value before all the features to be evaluated are changed.

For example, assume that there are three features to be evaluated, which are: age, family _ dep and deployed _ time, combining and splicing the three features to be evaluated into a character string "family _ predicted _ time", and generating an MD5 value by adopting an MD5 information summarization algorithm for the character string as follows: bccabac 92B7a7138F8146EF08606a67EB. This ensures that a 32-bit string, i.e., the MD5 value, can be converted no matter how long the string of features to be evaluated is.

Step S102, judging whether a log file with the same log file name as the iterative model identification exists or not, recording the log file as a target log file, and if so, executing step S103;

after the single model training is finished, the iterative model identification obtained by calculation in the single model training and the model evaluation score obtained by training are stored in the log file with the iterative model identification as the name of the log file.

Therefore, an iterative model identifier obtained by calculation during single model training and a model evaluation score obtained by training are recorded in the log file, and the iterative model identifier is obtained based on a feature combination of features to be evaluated during single model training.

It can be understood that before the feature iterative training, a plurality of log files may have been generated, and when a log file identical to the iterative model identifier generated by the feature iterative training is searched, the log file identical to the iterative model identifier generated by the feature iterative training is marked as a target log file.

It should be noted that, in the feature screening process, multiple feature iterative trainings need to be performed on all features to be evaluated, and a log file is generated after each feature iterative training is completed, so as to distinguish each log file, in the first feature iterative training, the suffix of "iterative model identification" and "01" can be used as the name of the log file, and the log file generated in the first feature iterative training process is named; and during the second feature iterative training, naming the log file generated in the second feature iterative training process by using the suffix of 'iterative model identification' plus '02' as the name of the log file and repeating the steps.

And step S103, analyzing the target log file, and acquiring the current model evaluation score of the current characteristic iteration from the analyzed target log file.

In summary, the feature processing method disclosed in the present invention generates an iteration model identifier uniquely corresponding to a feature combination based on the feature combination of all features to be evaluated of the current feature iteration, uses the iteration model identifier as a log file name of the current feature iteration, analyzes a target log file when a log file having the same log file name as the iteration model identifier, that is, the target log file, is found, and obtains a current model evaluation score of the current feature iteration from the analyzed target log file. Because the iteration model identification obtained by calculation during single model training and the model evaluation score obtained by training are both recorded in the log file taking the iteration model identification as the name of the log file, when the characteristic iteration process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iteration model identification as the name of the log file by calculating the iteration model identification when the characteristic iteration is terminated, thereby reducing the waste of repeated model training time and improving the characteristic processing efficiency.

In order to further optimize the above embodiment, referring to fig. 2, a flowchart of a feature processing method according to another embodiment of the present invention may further include, after step S102, if the determination in step S102 is negative, the step:

and S104, performing model training on all the features to be evaluated to obtain the current model evaluation score of the current feature iteration, and storing the iteration model identification and the current model evaluation score into a log file with the iteration model identification as the name of the log file in a corresponding relationship mode.

In practical application, feature iteration information can be generated according to features to be evaluated, an evaluation algorithm and model parameters, a feature iteration process is executed based on the feature iteration information, and model training is performed on all the features to be evaluated.

The MD5 value obtained by the feature iterative computation and the current model evaluation score obtained by training may be combined into a key: the value is recorded in the form of a log file.

In order to further optimize the foregoing embodiment, referring to fig. 3, a flowchart of a feature processing method disclosed in another embodiment of the present invention may further include, on the basis of the embodiment shown in fig. 2, after obtaining the current model evaluation score of the current feature iteration, that is, after step S103 and step S104, the steps of:

step S105, judging whether the current feature iteration is the last feature iteration of all the feature iterations, if not, returning to execute step S101, and if so, executing step S106;

when feature screening is performed, multiple feature iteration processes are usually required to be performed, and only after all feature iteration processes are finished, an optimal feature combination can be screened out.

Therefore, after each feature iteration is finished, whether the feature iteration is the last feature iteration of all the feature iterations needs to be judged, if not, the next feature iteration is continuously executed, and if so, the subsequent feature screening operation is continuously executed.

And S106, finding the best feature combination obtained by screening the feature combination with the highest model evaluation score from all the model evaluation scores generated in the iterative process.

In summary, the feature processing method disclosed in the present invention generates an iteration model identifier uniquely corresponding to a feature combination based on the feature combination of all features to be evaluated of the current feature iteration, uses the iteration model identifier as a log file name of the current feature iteration, analyzes a target log file when a log file having the same log file name as the iteration model identifier, that is, the target log file, is found, and obtains a current model evaluation score of the current feature iteration from the analyzed target log file. And when the target log file is not found, model training is continuously carried out on all the features to be evaluated to obtain the current model evaluation score of the feature iteration, and the iteration model identification and the current model evaluation score are stored into the log file with the iteration model identification as the name of the log file in a corresponding relationship mode so as to be convenient for subsequent use. Because the iteration model identification obtained by calculation during single model training and the model evaluation score obtained by training are recorded in the log file taking the iteration model identification as the log file name, when the characteristic iteration process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iteration model identification as the log file name by calculating the iteration model identification when the characteristic iteration is terminated, thereby reducing the waste of repeated training time of the model and improving the characteristic processing efficiency. Meanwhile, the invention does not need additional processing subsequently, and the optimal feature combination can be screened out after all the feature iterations are finished.

Corresponding to the embodiment of the method, the invention also discloses a characteristic processing device.

Referring to fig. 4, a schematic structural diagram of a feature processing apparatus according to an embodiment of the present invention includes:

an identifier generating unit 201, configured to generate an iterative model identifier uniquely corresponding to a feature combination based on the feature combination of all features to be evaluated of a current feature iteration, and use the iterative model identifier as a log file name of the current feature iteration;

the iterative model identifiers are used for distinguishing different iterative steps, and the iterative model identifiers generated by the same iterative step are the same.

The characteristic combination comprises: model features, evaluation algorithms, and model parameters.

Therefore, the identifier generating unit 201 may specifically be configured to:

A first judging unit 202, configured to judge whether a log file with a log file name that is the same as the iterative model identifier exists, and record the log file as a target log file, where an iterative model identifier obtained through calculation during single model training and a model evaluation score obtained through training are recorded in the log file;

It can be understood that a plurality of log files may have been generated before the present feature iteration training, and when a log file having the same identification as the iteration model generated by the present feature iteration is searched, the log file having the same name as the iteration model generated by the present feature iteration is recorded as a target log file.

It should be noted that, in the feature screening process, feature iterative training needs to be performed on all features to be evaluated for multiple times, and a log file is generated after each feature iterative training is completed, so as to facilitate distinguishing of each log file, during the first feature iterative training, the name of the log file can be given by adding an "01" suffix to an "iterative model identifier", and the log file generated in the first feature iterative training process is named; and during the second feature iterative training, the name of the log file can be given by adding a suffix of 'iteration model identification' and '02', the log file generated in the second feature iterative training process is named, and the like.

An analyzing unit 203, configured to, if the first determining unit 202 determines that the current model evaluation score is smaller than the threshold value, analyze the target log file, and obtain a current model evaluation score of the current feature iteration from the analyzed target log file.

In summary, the feature processing apparatus disclosed in the present invention generates an iteration model identifier uniquely corresponding to a feature combination based on the feature combination of all features to be evaluated of the current feature iteration, uses the iteration model identifier as a log file name of the current feature iteration, analyzes a target log file when a log file having the same log file name as the iteration model identifier, that is, the target log file, is found, and obtains a current model evaluation score of the current feature iteration from the analyzed target log file. Because the iteration model identification obtained by calculation during single model training and the model evaluation score obtained by training are both recorded in the log file taking the iteration model identification as the name of the log file, when the characteristic iteration process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iteration model identification as the name of the log file by calculating the iteration model identification when the characteristic iteration is terminated, thereby reducing the waste of repeated model training time and improving the characteristic processing efficiency.

In order to further optimize the above embodiment, referring to fig. 5, a schematic structural diagram of a feature processing apparatus according to another embodiment of the present invention may further include, on the basis of the embodiment shown in fig. 4:

a training unit 204, configured to perform model training on all the features to be evaluated under the condition that the first determining unit 202 determines that the features are not evaluated, obtain a current model evaluation score of the current feature iteration, and store the iteration model identifier and the current model evaluation score in a log file with the iteration model identifier as a log file name in a form of a corresponding relationship.

The MD5 value obtained by the iterative computation of the feature and the current model evaluation score obtained by training may be combined into a key: the value is recorded in the form of a log file.

In order to further optimize the above embodiment, referring to fig. 6, a schematic structural diagram of a feature processing apparatus according to another embodiment of the present invention may further include, on the basis of the embodiment shown in fig. 5:

a second determining unit 205, configured to determine whether the current feature iteration is the last feature iteration of all feature iterations after the analyzing unit 203 or the training unit 204 obtains the current model evaluation score;

when feature screening is performed, multiple feature iteration processes are usually required to be performed, and only after all the feature iteration processes are finished, the optimal feature combination can be screened out.

A searching unit 206, configured to search, when the second determining unit 205 determines that the feature combination is the best feature combination obtained by screening, the feature combination with the highest model evaluation score from among the model evaluation scores generated in all the iterative processes.

If the second determination unit 205 determines that the result is negative, the process returns to the execution identifier generation unit 201.

In summary, the feature processing apparatus disclosed in the present invention generates an iteration model identifier uniquely corresponding to a feature combination based on the feature combination of all features to be evaluated of the current feature iteration, uses the iteration model identifier as a log file name of the current feature iteration, analyzes a target log file when a log file having the same log file name as the iteration model identifier, that is, the target log file, is found, and obtains a current model evaluation score of the current feature iteration from the analyzed target log file. And when the target log file is not found, model training is continuously carried out on all the features to be evaluated to obtain the current model evaluation score of the feature iteration, and the iteration model identification and the current model evaluation score are stored into the log file with the iteration model identification as the name of the log file in a corresponding relationship mode so as to be convenient for subsequent use. Because the iteration model identification obtained by calculation during single model training and the model evaluation score obtained by training are both recorded in the log file taking the iteration model identification as the name of the log file, when the characteristic iteration process is terminated due to uncertain factors, the corresponding model evaluation score can be obtained from the log file taking the iteration model identification as the name of the log file by calculating the iteration model identification when the characteristic iteration is terminated, thereby reducing the waste of repeated model training time and improving the characteristic processing efficiency. Meanwhile, the invention does not need additional processing subsequently, and the optimal feature combination can be screened out after all the feature iterations are finished.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the phrase "comprising a. -. Said" to define an element does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A feature processing method, comprising:

judging whether a log file with the same log file name as the iterative model identification exists, wherein the log file records the iterative model identification obtained by calculation in single model training and the model evaluation score obtained by training;

if so, recording the log file with the same log file name as the iterative model identification as a target log file, analyzing the target log file, and acquiring the current model evaluation score of the current characteristic iteration from the analyzed target log file.

2. The feature processing method according to claim 1, wherein when the iterative model identifier is an MD5 value, the generating of an iterative model identifier uniquely corresponding to the feature combination based on the feature combination of all features to be evaluated of the current feature iteration and taking the iterative model identifier as a log file name of the current feature iteration specifically includes:

3. The feature processing method according to claim 1, further comprising:

4. The feature processing method according to claim 3, further comprising, after obtaining the current model evaluation score:

if so, finding the feature combination with the highest model evaluation score from all the model evaluation scores generated in the iterative process as the best feature combination obtained by screening.

5. A feature processing apparatus, characterized by comprising:

the identification generation unit is used for generating an iteration model identification uniquely corresponding to the characteristic combination based on the characteristic combination of all the characteristics to be evaluated of the current characteristic iteration, and taking the iteration model identification as the name of the log file of the current characteristic iteration;

the first judging unit is used for judging whether a log file with the same log file name as the iterative model identification exists, wherein the log file records the iterative model identification obtained by calculation in single model training and the model evaluation score obtained by training;

and the analysis unit is used for recording the log file with the same log file name as the iterative model identifier as a target log file, analyzing the target log file and acquiring the current model evaluation score of the current characteristic iteration from the analyzed target log file under the condition that the first judgment unit judges that the log file name is the same as the iterative model identifier.

6. The feature processing apparatus according to claim 5, wherein the identifier generating unit is specifically configured to:

7. The feature processing apparatus according to claim 5, characterized by further comprising:

and the training unit is used for carrying out model training on all the features to be evaluated under the condition that the first judgment unit judges that the features are not the features to be evaluated, obtaining the current model evaluation score of the current feature iteration, and storing the iteration model identification and the current model evaluation score into a log file with the iteration model identification as the name of the log file in a corresponding relationship mode.

8. The feature processing apparatus according to claim 7, characterized by further comprising:

and the searching unit is used for searching the feature combination with the highest model evaluation score as the best feature combination obtained by screening from all the model evaluation scores generated in all the iterative processes under the condition that the second judging unit judges that the feature combination is positive.