WO2020259325A1

WO2020259325A1 - Feature processing method applicable to machine learning, and device

Info

Publication number: WO2020259325A1
Application number: PCT/CN2020/095934
Authority: WO
Inventors: 兰冲
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2019-06-26
Filing date: 2020-06-12
Publication date: 2020-12-30
Also published as: CN110275889B; CN110275889A

Abstract

A feature processing method applicable to machine learning, and a device. The method comprises: upon acquiring a feature processing request corresponding to data, constructing a feature pool according to features in each feature table, wherein the feature table at least consists of a feature list, an associated feature library, a dependency feature table, an associated service, and a feature transforming logic, the feature list comprises one or more features, the dependency feature table is used to record other feature tables having a dependency relationship with the feature table, and the feature processing request comprises a feature to be processed (S101); determining a feature dependency relationship according to the feature to be processed and each feature in the feature pool, and determining a feature processing path according to the feature dependency relationship (S102); and performing feature processing according to the feature processing path to obtain training data according to features that have undergone concurrent processing (S103).

Description

Feature processing method and device suitable for machine learning

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 26, 2019, the application number is 201910562484. 1, the application title is "a feature processing method and device suitable for machine learning", the entire content of which is by reference Incorporated in this application.

Technical field

The present invention relates to the technical field of financial technology (Fintech), in particular to a feature processing method and device suitable for machine learning.

Background technique

With the development of computer technology, more and more technologies are applied in the financial field. The traditional financial industry is gradually transforming to Finteh. Feature processing technology is no exception. However, due to the security and real-time requirements of the financial industry, There are also higher requirements for technology.

In the current supervision system, a large amount of data is usually analyzed. Generally, feature engineering of machine learning algorithm models is used for analysis. Feature engineering is the process of transforming original data into features. These features can be better predicted The model describes potential problems, thereby improving the accuracy of the model for unseen data. Hive data warehouse storage features are usually used in the prior art, and the data warehouse can provide SQL processing features and the ability to store features. However, in the prior art, there is no unified management feature and feature processing logic, and the dependency relationship between features cannot be clearly expressed, which brings inconvenience to feature addition, deletion, and maintenance, resulting in inaccurate training data in the algorithm model. , Which reduces the analysis accuracy of regulatory data.

Summary of the invention

In view of this, the embodiments of the present invention provide a feature processing method and device suitable for machine learning, which at least solves the problem of the lack of unified management features and feature processing logic in the prior art, which leads to reduced accuracy of supervision data analysis.

On the one hand, an embodiment of the present invention provides a feature processing method suitable for machine learning, including:

After obtaining the feature processing request corresponding to the data, construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list At least one feature is included in the feature table, the dependency feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature that needs to be processed;

Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and add the feature table of the relationship that currently has no dependent relationship as a parallel subtask to the feature Processing path

According to the feature processing path, parallel feature processing is performed on the feature table with no dependent relationship, so as to obtain training data according to the features after the parallel processing.

In the embodiment of the present invention, the data features in the database are saved in the form of a feature table. In the feature table, multiple features and the processing logic of these features are included, and in order to facilitate feature processing, the data associated with the The feature table has a feature table with a dependency relationship; in this way, the dependency relationship between the feature tables can be maintained, and the dependency relationship between the feature tables can be reused, reducing the repeated calculation cost of the dependency relationship between the feature tables . When performing a feature processing task, construct a processing feature dependency relationship for the features in all feature tables used in the task, determine the feature table of the relationship that is not currently dependent through each dependency relationship, and use the feature table of the relationship that is currently not dependent as Parallel subtasks are added to the feature processing path, and the feature tables that are not currently dependent are processed in parallel according to the feature processing path; in this way, the parallel processing of multiple feature tables is realized and the efficiency of feature processing is accelerated. In the embodiment of the present invention, the features are managed through the feature table, which can clearly express the dependencies between features, bring convenience to feature addition, deletion, and maintenance, and make subsequent training data more accurate, thereby improving The accuracy of regulatory data analysis.

Optionally, the determining the feature dependency relationship according to the feature to be processed and each feature in the feature pool includes:

The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.

In the embodiment of the present invention, the dependency relationship between features can be better sorted through the form of the feature dependency tree, which is convenient for feature processing and feature processing.

Optionally, the determining a feature table of a relationship that is not currently dependent based on the feature dependency relationship, and adding the feature table of a relationship that is currently not dependent as a parallel subtask into the feature processing path includes:

Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.

In the embodiment of the present invention, by gradually determining the processing sequence in the feature dependency tree, multiple feature tables can be processed at the same time, and the processing sequences between features can be sorted, which improves the efficiency of feature processing.

Optionally, after the parallel feature processing is performed on the feature table with no dependent relationship according to the feature processing path, the method further includes:

The processed features are passed through multiple consecutive processing steps to obtain machine features.

In the embodiment of the present invention, through multiple consecutive processing steps, multiple intermediate states can be realized in the feature processing project, and any step can be modified through configuration without modifying other steps, and the modification process can be realized, and it can be flexible The result of using the features of the intermediate state.

On the one hand, an embodiment of the present invention provides a feature processing device suitable for machine learning, including:

The obtaining unit is used to obtain the feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, belonging feature library, dependent feature table, belonging business, and feature processing logic , The feature list includes at least one feature, the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed;

The feature processing path determination unit is used to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and to determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and to select the relationship that currently has no dependent relationship. The feature table is added to the feature processing path as a parallel subtask;

The feature processing unit is configured to perform parallel feature processing on the feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.

Optionally, the feature processing path determining unit is specifically configured to:

Optionally, the feature processing unit is further configured to:

On the one hand, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes a computer program that is suitable for machine learning. Feature processing method steps.

On the one hand, an embodiment of the present invention provides a computer-readable storage medium that stores a computer program executable by a computer device. When the program runs on a computer device, the computer device executes a computer program suitable for machine learning. Feature processing method steps.

Description of the drawings

FIG. 1 is a schematic flowchart of a feature processing method suitable for machine learning according to an embodiment of the present invention;

2 is a schematic diagram of a feature management structure provided by an embodiment of the present invention;

3 is a schematic diagram of service level management of a feature table provided by an embodiment of the present invention;

Figure 4 is a schematic diagram of a feature dependency tree provided by an embodiment of the present invention;

5 is a schematic flowchart of a feature processing pipeline provided by an embodiment of the present invention;

6 is a schematic flowchart of a feature processing method suitable for machine learning according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a feature processing device suitable for machine learning according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.

Detailed ways

In order to make the purpose, technical solutions, and beneficial effects of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

In order to facilitate the understanding of the embodiments in the specification, some nouns are explained here first.

Feature engineering: The process of obtaining, sorting, and processing features that can be understood and easily processed by computer programs from data. The main purpose is to provide input data for training, evaluation, and prediction for machine learning.

Machine learning: Machine learning refers to the process of automatically analyzing and obtaining laws from data, such as computer programs, and using the laws to predict unknown data.

Normalization: The process of mapping a value to the interval [0,1].

Missing value processing: the processing method when feature data is missing, such as filling with 0.

Natural features: features that humans can understand.

Machine features: features processed by machine learning algorithms.

One hot code: One Hot code, which maps multiple values of a feature into multiple bits; the bit corresponding to the feature value is 1, and the other bits are 0.

Topological sorting: A sorting algorithm that ranks the elements that are not dependent on first.

In-degree: a node in a directed graph, the number of edges pointing to the node.

Out-degree: the number of edges that point to other nodes in a node in a directed graph.

Machine learning in the prior art usually requires some training data, which is determined through feature engineering. However, in the prior art, the Hive data warehouse is usually used to store features, which can provide SQL processing features and the ability to store features. However, in the prior art, there is no unified management feature and feature processing logic, and the dependency relationship between features cannot be clearly expressed, which brings inconvenience to feature addition, deletion, and maintenance.

Based on the problems in the prior art, an embodiment of the present invention provides a feature processing method suitable for machine learning, as shown in FIG. 1 specifically, including the following steps:

Step S101, in the embodiment of the present invention, if the feature processing request corresponding to the data is obtained, the feature processing is performed through the feature pool composed of the features related to the feature processing request. These features are defined in the form of a feature table, and the feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list includes at least one feature, and the dependent feature The table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed.

Specifically, in the embodiment of the present invention, the feature processing request corresponding to the data may be a request for extracting and processing some features and related data, for example, a residential area suitable for the elderly, a residential area suitable for office workers, and so on. Generally speaking, the original data is obtained. In the previous example, the original data can include information provided by BD maps, SG maps, etc. Then use data processing technology to obtain, process, and extract meaningful features and attributes from these data. In the previous example, the features and attributes can be community facility information, traffic convenience, etc. Finally, statistical models or The process of modeling these features by technologies such as machine learning models. We can divide the feature processing process into two stages. In the first stage, the original data is processed into natural features. The natural features focus on the meaning of the feature itself, such as the age, occupation, annual income of the customer, the size of the company’s employees, and office location. Wait. Some natural features can be obtained directly from the original data, and some features need to be obtained through complex processing logic. In the second stage, natural features are processed into machine features. The processing method of machine features depends on the input requirements of machine learning algorithms. Different algorithms require different processing methods. For example, deep learning algorithms often need to process directory attributes into one-hot codes, while decision tree algorithms can directly process directory attributes.

In the embodiment of the present invention, the features are stored in the database through the feature table. As shown in FIG. 2, there are multiple feature tables t in the database K, and the feature table t includes multiple natural features f.

It should be noted that the feature library in the embodiment of the present invention may correspond to the library in the data warehouse, or may not correspond to the library in the data warehouse. Similarly, the feature table in the embodiment of the present invention may correspond to the library in the data warehouse. The table may not correspond to the table in the data warehouse, and there is no logical dependency.

In the embodiment of the present invention, the multiple features included in the feature table are defined in the form of a feature list, that is, each feature table includes part of the feature list, and the part of the feature list includes at least A feature, the representation of the feature in the feature list, can be as shown in Table 1.

Table 1

Feature identification

Chinese name

English name

type of data

description

Attributes

Of course, Table 1 is only a feature identification method. In Table 1, feature representation elements can also be deleted or added.

In the embodiment of the present invention, in addition to the feature list, the feature table also includes a dependent feature table. For example, the feature table includes feature A, and feature A has a dependent relationship with feature B. Feature B belongs to feature table B, so in the dependent feature table Feature table B is included in.

In the embodiment of the invention, the feature table also includes part of the content of the belonging feature library, the belonging business and the feature processing logic. The belonging feature library refers to which library the feature table belongs to, and the belonging service refers to which service the feature in the feature table belongs to.

In the embodiment of the present invention, three types of services can be defined, one is the operation data layer service, which can be understood as the user's input information, and the second is the public dimensional model service, that is, the user's input information is modeled or judged The third type of features obtained is the application data layer business, that is, the feature is directly applied to some applications. Exemplarily, as shown in FIG. 3, the embodiment of the present invention provides a general service level division method, although other division methods are also possible. The service level can be regarded as a mark of the feature table, and the same feature table can have multiple service marks of the same level, such as feature table 3: application 1, application 2. However, on the same feature table, there will be no cross-level business-level marks. For example, feature table 1: Model 1, Application 2 is not allowed.

In the embodiment of the present invention, the processing logic includes a processing program and a program configuration; the processing program can be an SQL statement or other programs that can run in a specific environment, and the program configuration must be completed before running. It should be noted that the processing program is only responsible for processing the features, and does not care how to save the target feature data. For example, if the processing program is SQL, it will not contain logic like insert into [target table] or insert overwrite [target table]; On the contrary, the saving behavior of features is controlled and tracked by the runtime system.

Step S102: Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the current non-dependent relationship according to the feature dependency, and use the feature table of the current non-dependent relationship as a parallel child The task is added to the feature processing path.

In the embodiment of the present invention, through the dependency relationship between each feature in the feature pool and the features that need to be processed, the feature table of the relationship that is currently not dependent can be determined according to the dependence relationship, and the feature table of the relationship that is currently not dependent The feature processing path is added as a subtask of parallel processing. In this way, the feature processing path can be determined, thereby improving the feature processing efficiency and facilitating feature management.

In the embodiment of the present invention, the feature dependency relationship can be determined layer by layer, for example, the feature to be processed is feature A, and feature A is stored in feature table 1, feature table 2, and feature table 3; and feature table 1 and feature 4 has a dependency relationship, feature table 2 has a dependency relationship with feature table 5, feature table 3 has a dependency relationship with feature table 6, so when processing feature A, you need to process feature table 4, feature table 5, and feature table 6 first, and then Reprocessing feature table 1, feature table 2, and feature table 3.

Optionally, in the embodiment of the present invention, the foregoing method may be determined by a simple topological sorting method, that is, the processing path is finally obtained by performing dependent sorting on the features to be processed.

Optionally, in the embodiment of the present invention, in order to clearly display the feature processing path and facilitate rapid feature processing, the feature that needs to be processed is used as the root node, and the feature table that has a direct or indirect dependence relationship with the root node is used as The upper node builds a feature dependency tree. In other words, take the feature to be processed as the root node, and then gradually build up nodes to the upper level to form a tree-like dependency.

Exemplarily, the feature to be processed is feature A, and feature A is stored in feature table 1, feature table 2, and feature table 3; while feature table 1 has a dependency relationship with feature 4, and feature table 2 has a dependency with feature table 5 Relationship, feature table 3 and feature table 6 have a dependency relationship, and feature table 4, feature table 5, and feature table 6 are obtained by processing the features in original table 1, original table 2, and original table 3, which forms a dependency tree As shown in Figure 4.

After the dependency tree is determined, the processing sequence between features can be quickly determined through topological sorting.

Optionally, in the embodiment of the present invention, one batch of features can be processed in parallel to improve the efficiency of feature processing. For example, in the above example, original table 1, original table 2, and original table 3 can be For one batch, there is no dependency relationship. For example, the information provided by the BD map is stored in the original table 1, and the information provided by the BD map is relatively complete; the information provided by the SG map is stored in the original table 2. In the information provided by the map, the school information is relatively complete; the original table 3 stores the information provided by the WW map, and the supermarket information is relatively complete in the information provided by the WW map. In this way, the original table 1, the original table 2, and the original table 3 can be processed at the same time to obtain the characteristic table 4, that is, the specific information of each hospital is public or private, whether it is a tertiary hospital, etc.; the characteristic table 5, namely, each The specific information of the school is public or private, whether it is a university, middle school, or elementary school, whether it is a national key school, etc.; and feature table 6, that is, the specific information of each supermarket, focusing on fresh food or daily necessities, service quality and product quality The stars and so on. Further, feature table 4, feature table 5, and feature table 6 can be processed at the same time to obtain feature table 1, that is, the information of the neighborhood near the top three hospitals, the feature table 2, the neighborhood information of the 211 university, and the feature table 3. That is, information on the neighborhoods near the supermarket with five-star service quality and product quality. In this way, it is possible to determine the feature A that needs to be processed, that is, the residential area suitable for the elderly.

Optionally, in the embodiment of the present invention, a feature processing path generation algorithm is proposed, which specifically includes:

(1) Initialize the processing sequence R, the initialized sequence R is empty; the initialization set S, S is all feature tables, in the previous example, it contains feature table 1, feature table 2, feature table 3, feature table 4, feature table 5. Feature table 6; Initialize temporary set C, which is all original data tables. In the previous example, it contains original table 1, original table 2, and original table 3;

(2) When the set C is non-empty, that is, when the original data table is not empty, traverse all the feature tables in the set S and mark the table currently traversed as Si;

(3) Traverse all tables in set C. These tables may be original tables or feature tables, and record the current traversed table as Cj;

(4) If it is determined that Cj is a dependency table of Si, there is an edge where Cj points to Si in the dependency graph, as shown in Figure 4, delete the edge from original table 1 to feature table 4, and original table 2 to feature table 5. This side of the original table 3 points to the side of the feature table 6, which can be understood as deleting the part of Cj out of degree Si;

(5) Cycle step (3), and then execute step (2);

(6) Take out the non-original tables (C1, C2,...) in the current set C, and form a parallel subtask task (C1|C2|..), add it to the end of the processing sequence R; empty the set C Empty collection

(7) Traverse all the feature tables in the S set, find all the tables with an in-degree of 0, delete them from S, and add them to the set C;

(8) Return to step (2);

(9) If it is determined that the set S is not empty, it prompts that there is a circular dependency and exits the path calculation;

(10) At the end of the program, all subtasks in the sequence R are the processing paths of the extraction tasks.

In order to better understand the method, take the feature dependency tree in Figure 4 as an example to illustrate. First, initialize the processing sequence R to be empty, then initialize the set S to all feature tables, and initialize the temporary set C to all original data tables.

In the first cycle, the set C is not empty, the current traversal is S1 and C1, for example, S1 is feature table 4, C1 is original table 1, original table 1 is the dependency table of feature table 4, and the feature table is deleted Into the degree. Then continue to traverse, S2 is feature table 6, C2 is original table 2, these two feature tables have no dependency relationship, continue to traverse, S3 is feature table 5, C3 is original table 2, original table 2 is the dependency table of feature table 4 , Delete the in-degree of feature table 5 until the in-degree of feature table 6 is also deleted; then update the C table, at this time the C table includes the feature table, use the above to continue to delete the in-degree of the feature table, and delete The in-degree feature table forms a parallel subtask, then clears the C table, and continues the above steps until the in-degrees of all the feature tables are deleted, forming multiple parallel subtasks.

That is to say, in the embodiment of the present invention, it is necessary to find out the set of tables that are not currently dependent, and delete the association between the dependence graph and the tables in the set, so as to generate the next batch of non-dependent tables until all tables are Join the processing sequence. The difference from standard topological sorting is that each step of the algorithm composes tables that are not currently dependent on a parallel subtask, and parallel operation can speed up the overall operating efficiency.

Step S103: Perform parallel feature processing on the feature table with no dependent relationships according to the feature processing path, so as to obtain training data based on the parallel processed features.

In the embodiment of the present invention, the required feature can be processed through the determined feature processing path. After the processing is completed, the first step is completed, the process of processing the original feature into a natural feature, and then the natural feature needs to be processed. Processing is a machine feature.

In the embodiment of the present invention, machine features can be obtained through multiple consecutive processing steps, and after each processing step, the processing results can be saved to facilitate subsequent feature utilization. For example, in the embodiment of the present invention, such as whether the catalog attribute customer smokes: Yes|No In the process of digitizing, it is necessary to record the corresponding relationship between the catalog category and the value, for example, smoking -> 1, no smoking -> 0 Others, such as mean variance normalization, need to record the mean value and variance of the feature. Therefore, in the embodiment of the present invention, the machine feature, namely 1 or 0, can be obtained after the average step and the variance step.

In the embodiment of the present invention, the process of obtaining machine features through multiple consecutive processing steps may also be referred to as a machine feature processing pipeline. As shown in FIG. 5, the processing from natural features to machine features is performed in the dimension of a single feature. Multiple features can also share a pipeline. Multiple processing steps form a processing pipeline, and the steps on the pipeline receive the output of the previous step, and output to the next step after processing. Each step can output a step status or not.

Each step in the pipeline needs to support processing one or more features. Because there may only be one feature when input to the pipeline, but a certain step in the middle may turn one feature into multiple features. For example, the one-hot code adds a feature to each value of the feature. For example, the feature of whether a customer smokes is treated as two features: the customer smokes and the customer does not smoke.

That is to say, through the processing pipeline, the intermediate state of the feature processing can be saved, and the intermediate process can be set in a custom way. For example, the normalization process in the above example is set to mean and variance. Step, you can save the features in the normalization process for feature reuse.

In order to better explain the embodiments of the present application, a feature processing method suitable for machine learning provided by the embodiments of the present application will be described below in conjunction with specific implementation scenarios. The method is used to extract feature S, which is located in feature table 1. Feature table 1 has an associated relationship with feature table 2, feature table 3, and feature table 4, and feature table 2 has an associated relationship with feature table 5 and feature table 6, as shown in Figure 6:

Step S601: Obtain a characteristic processing request corresponding to the data;

Step S602: construct a feature pool with features in feature table 1, feature table 2, feature table 3, feature table 4, feature table 5, and feature table 6;

Step S603: Construct a dependency tree for the features in the feature pool. The dependency tree can be embodied as: feature S is the root node, the upper node of the root node is feature table 1, and the upper nodes of feature table 1 are feature table 2, feature table 3, Feature table 4, the upper nodes of feature table 2 are feature table 5 and feature table 6;

Step S604: Find out the set of tables that are not currently dependent, and delete the association between the dependence graph and the tables in the set, so as to generate the next batch of non-dependent tables, until all tables are added to the processing sequence to obtain the processing sequence. Feature Table 5, Feature Table 6> Feature Table 2, Feature Table 3, Feature Table 4> Feature Table 1;

Step S605, performing feature processing according to the processing sequence to obtain feature S;

In step S606, the feature S is passed through multiple steps to obtain the machine feature T, and the feature results of the multiple steps are saved.

Based on the same technical concept, an embodiment of the present application provides a feature processing device suitable for machine learning. As shown in FIG. 7, the device 700 includes:

The obtaining unit 701 is configured to obtain a feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list includes at least one feature, the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes features that need to be processed;

The feature processing path determination unit 702 is configured to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is not currently dependent according to the feature dependency relationship, and to change the relationship that is currently not dependent The feature table of is added to the feature processing path as a parallel subtask;

The feature processing unit 703 is configured to perform parallel feature processing on a feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.

Optionally, the characteristic processing path determining unit 702 is specifically configured to:

Optionally, the feature processing unit 703 is further configured to:

Based on the same technical concept, an embodiment of the present application provides a computer device, as shown in FIG. 8, including at least one processor 801 and a memory 802 connected to the at least one processor. The embodiment of the present application does not limit the processor For the specific connection medium between the 801 and the memory 802, the connection between the processor 801 and the memory 802 in FIG. 8 is taken as an example. The bus can be divided into address bus, data bus, control bus, etc.

In the embodiment of the present application, the memory 802 stores instructions that can be executed by at least one processor 801. By executing the instructions stored in the memory 802, the at least one processor 801 can execute the aforementioned feature processing methods suitable for machine learning. A step of.

Among them, the processor 801 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the terminal equipment, and obtain customers by running or executing instructions stored in the memory 802 and calling data stored in the memory 802. End address. Optionally, the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor. The application processor mainly processes the operating system, user interface, and application programs. The adjustment processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 801. In some embodiments, the processor 801 and the memory 802 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.

The processor 801 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.

As a non-volatile computer-readable storage medium, the memory 802 can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk , CD, etc. The memory 802 is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 802 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.

Based on the same technical concept, the embodiments of the present application provide a computer-readable storage medium that stores a computer program that can be executed by a computer device. When the program runs on the computer device, the computer device can execute The steps of the feature processing method of machine learning.

A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc. A medium that can store program codes.

Alternatively, if the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A feature processing method suitable for machine learning, characterized in that the method includes:

After obtaining the feature processing request corresponding to the data, construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list At least one feature is included in the feature table, the dependency feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature that needs to be processed;

Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and add the feature table of the relationship that currently has no dependent relationship as a parallel subtask to the feature Processing path

According to the feature processing path, parallel feature processing is performed on the feature table with no dependent relationship, so as to obtain training data according to the features after the parallel processing.
The method according to claim 1, wherein the determining the feature dependency relationship according to the feature to be processed and each feature in the feature pool comprises:

The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
The method according to claim 2, wherein the determining a feature table of a relationship that is not currently dependent according to the feature dependency relationship, and adding the feature table of a relationship that is currently not dependent as a parallel subtask into the feature processing path, comprises:

Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
The method according to claim 1, characterized in that, after performing parallel feature processing on a feature table with no dependent relationship currently based on the feature processing path, further comprising:

The processed features are passed through multiple consecutive processing steps to obtain machine features.
A feature processing device suitable for machine learning, characterized in that the device comprises:

The obtaining unit is used to obtain the feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, belonging feature library, dependent feature table, belonging business, and feature processing logic , The feature list includes at least one feature, the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed;

The feature processing path determination unit is used to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and to determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and to select the relationship that currently has no dependent relationship. The feature table is added to the feature processing path as a parallel subtask;

The feature processing unit is configured to perform parallel feature processing on the feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.
The device according to claim 5, wherein the characteristic processing path determining unit is specifically configured to:

The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
The device according to claim 6, wherein the characteristic processing path determining unit is specifically configured to:

Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
The device according to claim 5, wherein the feature processing unit is further configured to:

The processed features are passed through multiple consecutive processing steps to obtain machine features.
A computer device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to implement any one of claims 1 to 4 The steps of the method.
A computer-readable storage medium, characterized in that it stores a computer program that can be executed by a computer device, and when the program runs on the computer device, the computer executes the computer program described in any one of claims 1 to 4 method.