CN106708609B

CN106708609B - Feature generation method and system

Info

Publication number: CN106708609B
Application number: CN201510784474.4A
Authority: CN
Inventors: 冯天恒; 王雯晋; 乔彦辉; 王学庆; 周胜臣; 方炜超; 娄鹏
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2015-11-16
Filing date: 2015-11-16
Publication date: 2020-06-26
Anticipated expiration: 2035-11-16
Also published as: CN106708609A

Abstract

The application relates to the technical field of internet, in particular to a feature generation method and a feature generation system, which are used for solving the problems of insufficient big data processing capability and low evaluation efficiency when the fitness evaluation of a large number of newly generated features is carried out; in the embodiment of the present application, the whole iteration process includes: the method comprises the following steps that an initialization task executed by a selected sub-node, each generation of fitness evaluation tasks executed by a plurality of sub-nodes in parallel, an iteration task executed by the selected sub-node and an output task executed after all the fitness evaluation tasks are executed; the master node is responsible for performing coordinated scheduling of the entire iterative process. Because each generation of fitness evaluation task can be executed by a plurality of child nodes in parallel, the efficiency of the whole feature generation process is improved; the main node indicates the characteristic expression to be evaluated to the child node executing the fitness evaluation task in the form of coding individuals, so that the data transmission quantity can be reduced.

Description

Feature generation method and system

Technical Field

The present application relates to the field of internet technologies, and in particular, to a feature generation method and system.

Background

With the development of internet information technology, the variety of business services provided to users through the internet is increasing, and how to better provide business services to users is an important problem in the internet industry. The model classification can effectively improve the business service level, for example, the income level of the user is classified, the income level of the user is divided into a high category, a middle category and a low category, and different information recommendation services can be provided for the user based on different categories of the income level of the user.

When classification is performed based on the model, a plurality of features need to be input, and the accuracy of model classification can be effectively improved through a good feature set. In many cases, the amount of information contained in a single feature is limited, and significant classification performance can be achieved after transformation through feature combinations. Therefore, some new features can be generated based on the original feature set, so that the new features can reflect the implicit classification capability of the original feature set. Meanwhile, in order to avoid the influence of a large amount of invalid or redundant features generated by transformation on the model classification accuracy, fitness evaluation needs to be performed on the newly generated features.

At present, when fitness evaluation of a large number of new generation features is carried out, the problems of insufficient big data processing capacity and low evaluation efficiency exist generally, so that further optimization of the new generation features is limited, and valuable features cannot be obtained timely and effectively.

Disclosure of Invention

The embodiment of the application provides a feature generation method and system, which are used for solving the problems of insufficient big data processing capacity and low evaluation efficiency when the fitness evaluation of a large number of newly generated features is carried out, and further provides a feature generation algorithm for effectively obtaining high-value new features.

The embodiment of the application provides a feature generation method, which comprises the following steps:

step A, after receiving evaluation results sent by a plurality of sub-nodes executing the N-th generation fitness evaluation task, if N is determined to be equal to the maximum iteration number, the main node issues an output task to a selected sub-node, otherwise, the main node issues an iteration task to the selected sub-node;

b, the child node executing the output task determines and outputs N characteristic expressions with the highest fitness based on the evaluation result of the N-th generation fitness evaluation task; the n characteristic expressions with the highest fitness are the first n characteristic expressions which are arranged from high to low according to the fitness;

c, generating a coding file containing a plurality of coding individuals by the child node executing the iterative task based on the evaluation result of the Nth generation fitness evaluation task, and sending the coding file to the main node; the plurality of coding individuals comprise N coding individuals corresponding to N characteristic expressions with highest fitness evaluated by an Nth-generation fitness evaluation task;

d, the main node generates a plurality of N +1 generation fitness evaluation tasks based on the coding file, and sends each N +1 generation fitness evaluation task to different sub-nodes respectively, wherein each fitness evaluation task comprises a coding individual;

step E, the sub-node executing the fitness evaluation task calculates the fitness according to the feature expression indicated by the coding individual in the allocated fitness evaluation task, and sends the calculated fitness as an evaluation result to the main node; and adding 1 to N, and returning to the step A.

Optionally, the coding individuals are generated by adopting a depth-first coding (DFP) mode; in step C, the generating, by the child node executing the iterative task, a code file including a plurality of coding individuals based on the evaluation result of the nth-generation fitness evaluation task includes:

step C1, the sub-nodes executing the iterative tasks select N characteristic expressions with highest fitness from the m characteristic expressions evaluated by the N generation fitness evaluation task based on the evaluation result of the N generation fitness evaluation task;

step C2, randomly selecting two feature expressions from the m feature expressions, respectively selecting one sub-expression from the two feature expressions for crossing according to a preset crossing probability, and reserving one feature expression after random crossing; repeating the step m-n times to obtain m-n reserved characteristic expressions after random crossing;

step C3, carrying out mutation treatment on elements in the reserved m-n characteristic expressions after random crossing according to a preset mutation probability to obtain m-n characteristic expressions after random mutation;

and step C4, determining the coding individuals corresponding to the N characteristic expressions with the highest fitness and the m-N characteristic expressions after the random variation processing as m coding individuals contained in the N +1 th generation fitness evaluation task.

Optionally, in step C3, performing mutation processing on the elements in the retained m-n feature expressions after random intersection, including:

for any characteristic expression, randomly selecting one of the following processing modes for mutation processing:

replacing a single feature node in the feature expression with a sub-expression; the single characteristic node refers to one data or one operator in the characteristic expression;

reducing a sub-expression in the feature expression into a single feature node;

replacing a single characteristic node in the characteristic expression by a randomly generated single characteristic node;

the feature expression is replaced with a new feature expression generated randomly.

Optionally, in step C1, the selecting, by the child node executing the iterative task, N feature expressions with the highest fitness from the m feature expressions evaluated by the nth-generation fitness evaluation task based on the evaluation result of the nth-generation fitness evaluation task includes:

if the feature expressions with the same fitness exist in the m feature expressions, removing redundant k feature expressions so that the feature expressions with the same fitness do not exist in the rest feature expressions;

and selecting n characteristic expressions with the highest fitness from the rest characteristic expressions, and subtracting k from m in the steps C2-C4.

Optionally, before step a, the method further includes:

after receiving the feature generation task, the master node acquires a data file required for executing the feature generation task from a data server and transmits the acquired data file to each cluster computing machine in the cluster system;

in step E, the performing fitness calculation by the child node executing the fitness evaluation task includes:

and the sub-node executing the fitness evaluation task reads the characteristic data indicated by the coding individual in the distributed fitness evaluation task from the cluster computing machine, substitutes the read characteristic data into the characteristic expression corresponding to the coding individual, and carries out fitness calculation on the characteristic expression substituted with the characteristic data by calling the fitness evaluation function on the cluster computing machine.

Optionally, before step a, the method further includes:

the main node sends an initialization task corresponding to the characteristic generation task received by the main node to a selected sub-node;

the child node executing the initialization task randomly generates an encoding file containing a plurality of initialized encoding individuals by calling an initialization function on the cluster computing machine;

and the main node generates a plurality of first-generation fitness evaluation tasks based on the plurality of initialized coding individuals and respectively issues each generated first-generation fitness evaluation task to different sub-nodes.

Optionally, in step B, the determining and outputting, by the child node executing the output task, N feature expressions with the highest fitness based on the evaluation result of the nth-generation fitness evaluation task includes:

the child node executing the output task determines N characteristic expressions with highest fitness by calling the evaluation result of the nth generation fitness evaluation task stored in the file system by the master node, outputs a characteristic generation result report which is fed back to the user and used for indicating the N characteristic expressions with the highest fitness, and outputs characteristic data corresponding to the N characteristic expressions with the highest fitness for subsequent calling.

An embodiment of the present application provides a feature generation system, including:

the main node is used for issuing an output task to a selected sub-node if N is determined to be equal to the maximum iteration number after receiving evaluation results sent by a plurality of sub-nodes executing the N-th generation fitness evaluation task, and otherwise, issuing an iteration task to the selected sub-node; the system is also used for generating a plurality of N + 1-generation fitness evaluation tasks based on the coding files generated by the sub-nodes executing the iterative tasks, and respectively issuing each N + 1-generation fitness evaluation task to different sub-nodes, wherein each fitness evaluation task comprises a coding individual;

the child node is used for executing the output task and determining and outputting N characteristic expressions with the highest fitness based on the evaluation result of the N-th generation fitness evaluation task; the n characteristic expressions with the highest fitness are the first n characteristic expressions which are arranged from high to low according to the fitness;

the child node is used for executing the iterative task, generating a coding file containing a plurality of coding individuals based on the evaluation result of the Nth generation fitness evaluation task, and sending the coding file to the host node; the plurality of coding individuals comprise N coding individuals corresponding to N characteristic expressions with highest fitness evaluated by an Nth-generation fitness evaluation task;

and the child node executing the fitness evaluation task is used for calculating the fitness according to the characteristic expression indicated by the coding individuals in the allocated fitness evaluation task and sending the calculated fitness as an evaluation result to the master node.

By adopting the method or the system, each generation of fitness evaluation task can be executed by a plurality of sub-nodes in parallel, so that the fitness evaluation efficiency is improved, the efficiency of the whole feature generation process is improved, and the timeliness of new feature generation is ensured; when the main node issues the fitness evaluation task to the child nodes, the main node does not directly transmit the feature data to the child nodes executing the fitness evaluation task, but indicates feature expressions needing to be evaluated to the child nodes in a coding individual mode, so that the data transmission quantity can be reduced, the transmission efficiency can be increased, and the memory occupation can be reduced. In addition, based on the iterative algorithm based on the depth-first coding DEP mode provided by the embodiment of the application, the integrity of the sub-expressions is effectively guaranteed, and the optimal characteristic expressions after the last iteration are reserved in each iteration, so that the optimal characteristic expressions in the whole iteration process can be obtained after the last iteration is completed.

Drawings

Fig. 1 is a flowchart of a feature generation method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of task scheduling based on an iterative computation framework;

FIG. 3 is a schematic cross-sectional view of a feature expression;

FIG. 4 is a schematic diagram of a first variation;

FIG. 5 is a diagram illustrating a second variation;

FIG. 6 is a schematic diagram of a third variation;

fig. 7 is a schematic structural diagram of a feature generation system according to an embodiment of the present application.

Detailed Description

In the embodiment of the present application, the whole iteration process includes: the method comprises the following steps that an initialization task executed by a selected sub-node, each generation of fitness evaluation tasks executed by a plurality of sub-nodes in parallel, an iteration task executed by the selected sub-node and an output task executed after all the fitness evaluation tasks are executed; and the main node is responsible for distributing adaptive tasks for each child node and carrying out the coordination scheduling of the whole iterative process. Each generation of fitness evaluation task can be executed by a plurality of child nodes in parallel, so that the fitness evaluation efficiency is improved, the efficiency of the whole feature generation process is improved, and the timeliness of new feature generation is ensured; when the main node issues the fitness evaluation task to the child nodes, the main node does not directly transmit the feature data to the child nodes executing the fitness evaluation task, but indicates feature expressions needing to be evaluated to the child nodes in a coding individual mode, so that the data transmission quantity can be reduced, the transmission efficiency can be increased, and the memory occupation can be reduced.

In addition, in an iterative task, an iterative algorithm based on a Depth-first programming (DEP) mode is provided in the embodiment of the application, and is also called a feature generation algorithm, the algorithm is coded by using the DEP coding mode, and the integrity of a sub-expression is effectively ensured in feature generation; and the genetic characteristics are utilized, and the optimal characteristic expressions after the last iteration are reserved in each iteration, so that the optimal characteristic expressions in the whole iteration process can be obtained after the last iteration is finished.

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, and it should be noted that all other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without creative efforts belong to the protection scope of the present application.

As shown in fig. 1, which is a flowchart of a feature generation method provided in an embodiment of the present application, fig. 2 is a schematic diagram of task scheduling based on an iterative computation framework, and includes the following steps:

s101: and the main node issues the initialization task corresponding to the characteristic generation task received by the main node to the selected sub-node.

In the above steps, the master node may randomly select a child node that executes the initialization task or select a child node that executes the initialization task according to the load condition of each child node, and may indicate the following parameter information as shown in table one to the child node that executes the initialization task:

wherein a parameter of the String (String) type, data file name (filename), is used to indicate the data file where the data sample is located, an integer (Int) parameter, data file field total (fieldSize), is used to indicate the number of fields in the data file, each field identifies a data feature, the feature expressions described below, i.e. expressions for multiple data features and operators, such as feature expression X4= X1+ X2 × X3, where X1, X2, and X3 are several different data features, and "+" and "×" are operators, an integer (Int) parameter, which is the number of individuals in the population (popSize), is used to indicate the number of encoded individuals generated in the initialization task, each encoded individual corresponds to a feature expression, an algorithm parameter (parameter) is a String type parameter, and is stored in the form of key-value (key-value) pairs, such as the maximum depth (X8292 + X3), which may contain the generated features, such as X3 + X4933.

The parameter information may be information input by a user through a front-end interface. Specifically, a user conducts data import and requirement setting at a webpage client under the guidance of a front-end interface, and finally initiates a task request to a backend. The method specifically comprises three sub-processes: submit data, select algorithm, and set parameters. The data submitting means that a user inputs a data table (data file) name corresponding to data to be processed through a front-end interface, and selects fields needing to be processed and sets field types; the selection algorithm refers to that after a user submits data, the selection algorithm is selected according to the selected field type, and some suggestions are given, the user can select the corresponding algorithm according to actual requirements, and can submit a self-defined algorithm, in the embodiment of the application, the algorithm needing to be selected by the user is mainly a fitness evaluation algorithm, and a special feature generation algorithm is provided aiming at feature generation; setting parameters refers to setting parameters of the selected algorithm, and all the parameters have default values for reference of a user. After the three sub-processes are finished, all the related information is collected into one task request and sent to the back end, and then each back end calculation process from S101 is executed. The back-end operation is a process of calling a related algorithm to perform calculation, the process is isolated from a user, the running state of a task can be inquired through a front-end interface, the running state is displayed to the user in the form of a rolling log, after the fitness evaluation task is completely executed, the result output can be executed, specifically, a file (such as a subsequently introduced json file) with the result is read, analyzed and displayed to the user in a certain visual form.

S102: and the child node executing the initialization task generates an encoding file containing a plurality of initialized encoding individuals by calling the initialization function on the cluster computing machine.

Here, the Slave node (Slave node) executing the initialization task calls a relevant algorithm source code (that is, an initialization function) in a source code library on the cluster computing machine in a script mode according to the instruction of the master node to generate an encoded file, and at the same time, the generated encoded file may further include an intermediate data file, and the encoded file and the intermediate data file are returned to the master node. The intermediate data file contains some intermediate results which may be needed for subsequent analysis, and in the initialization task, since no intermediate result is generated yet, data in the intermediate data file can be set to be a null value or a default value, and can be stored in a file system of a Master node (Master node) for being called by subsequent iterative computation. In addition, in the subsequent scheduling process, the coding file and the intermediate data file generated by each iteration are also stored in the file system of the main node, so that once a system fault occurs, the system fault can be recovered to the previous iteration of the system fault for continuous calculation, and the recoverability is strong.

For example, for a data table, there are four fields [ Y, X1, X2, X3], each field corresponding to a data feature in the data table, and for a feature expression X4= X1+ X2 × X3, if the code of "+" is-1, the code of "×" is-3, the code of X1 is 1, the code of X2 is 2, and the code of X3 is 3, then the feature expression is encoded according to a Depth-first coding (DFP) mode [ -1, 1, -3, 2,3 ].

S103: the main node generates a plurality of first-generation fitness evaluation tasks based on the plurality of initialized coding individuals and respectively issues each generated first-generation fitness evaluation task to different sub-nodes; each fitness evaluation task comprises a coding individual.

If the application is applied to model classification, a Gini coefficient can be adopted to measure the fitness.

In the embodiment of the application, in order to reduce the data transmission amount, when the main node issues the fitness evaluation task to the child node, the main node does not directly transmit the feature data to the child node executing the fitness evaluation task, but indicates the feature expression to be evaluated to the child node in the form of the coding individual, so that the data transmission amount can be reduced, the transmission efficiency is increased, and the memory occupation is reduced.

In S103, the master node may wait for the initialization task to end in a thread sleep mode, then determine the number of first-generation fitness evaluation tasks according to the number of coding individuals in the coding file, and create each fitness evaluation task. Specifically, for each fitness evaluation task, a subtask identifier (Identity, ID) and a population individual identifier (Identity, ID) may be generated, the subtask ID, the population individual ID, the coding individual, the algorithm parameter, and the like are stored in the task queue as task execution information of the fitness evaluation task, and then each fitness evaluation task is taken out from the task queue and distributed to each child node executing the fitness evaluation task. Here, the child node that performs each generation of fitness evaluation task may be randomly selected by the master node, or the master node may be selected according to the load condition of each current child node.

As shown in table two, for each piece of parameter information received by the child node that performs the fitness evaluation task:

among the above parameters, the data file name (file), the task id (jobid), the population individual id (popid), the encoded individual (indicial), and the algorithm parameter (parameter) are all parameters of a string type.

S104: and the sub-node executing the fitness evaluation task calculates the fitness according to the characteristic expression indicated by the coding individuals in the allocated fitness evaluation task, and sends the calculated fitness as an evaluation result to the main node.

Here, the child node executing the fitness evaluation task processes the distributed task, specifically, calls a corresponding fitness evaluation function to perform calculation, writes the fitness calculation result, the task ID, the group individual ID, and the coding individual into a json file named by the task ID, and returns the json file to the host node. For the task identifier ID, the master node may send an execution progress query request including a task ID to any child node that executes the fitness evaluation task, and receive execution progress information returned by the any child node that executes the fitness evaluation task based on the task ID. Identifying ID for individual population; after receiving the evaluation result including the population individual ID and the coding individual sent by the child node executing the fitness evaluation task, the master node may match the population individual ID and the coding individual sent by the child node with the population individual ID and the coding individual in the fitness evaluation task issued to the child node to check the accuracy of the evaluation result fed back by the child node.

In S104, the child node executing the fitness evaluation task may read the feature data of the corresponding field according to the coded individuals, instead of reading the feature data of all the fields, so as to reduce the occupation of the memory space and better implement the task parallel processing. Preferably, after receiving the feature generation task, the master node may first obtain a data file required for executing the feature generation task from the data server, and transmit the obtained data file to each cluster computing machine in the cluster system; correspondingly, the sub-node executing the fitness evaluation task reads the feature data indicated by the coding individual in the distributed fitness evaluation task from the cluster computing machine, substitutes the read feature data into the feature expression corresponding to the coding individual, and carries out fitness calculation on the feature expression substituted with the feature data by calling the fitness evaluation function on the cluster computing machine. Here, in order to facilitate the child node to read the feature data indicated by the encoding individual, the master node downloads the feature data from the data server to the cluster computing machine where the child node is located in advance. In practical implementation, the child node may also read the required characteristic data directly from the data server, but this naturally reduces the evaluation efficiency to a large extent.

S105: and after receiving the evaluation results sent by the plurality of child nodes executing the N-th generation fitness evaluation task, the master node judges whether N is equal to the maximum iteration number, if so, the master node enters S106, and otherwise, the master node executes S107. Here, N is a positive integer greater than or equal to 1, increasing as the number of iterations increases.

In a specific implementation process, when all fitness evaluation tasks issued by the main node in a certain iteration are completely executed, the main node collects evaluation results returned by all the sub-nodes, and generates a csv file to be stored in the file system for calling. And meanwhile, the main node judges the iteration process, if the iteration meets the termination condition, the iteration is terminated, and a certain child node is instructed to execute an output task, and if the iteration does not meet the termination condition, the certain child node is instructed to execute the iteration task, namely a new coding file and a new intermediate data file are generated and returned to the main node.

S106: and the main node issues an output task to the selected child node.

Here, the child node that executes the output task may be randomly selected by the master node, or the master node may be selected according to the load condition of each current child node.

As shown in table three below, the parameter information received for the child node executing the output task:

the algorithm parameter (parameter) in table three is a String type parameter, and is stored in the form of a key-value pair, for example, the number of feature expressions with the highest retained fitness may be used, and the return value of the evaluation function is the fitness.

S107: and the child node executing the output task determines and outputs N characteristic expressions with the highest fitness based on the evaluation result of the N-th generation fitness evaluation task. Where n is a positive integer greater than or equal to 1,

here, since each iteration in the embodiment of the present application inherits the optimal n feature expressions after the last iteration, the n feature expressions whose evaluation results are optimal in all iterations can be selected directly based on the evaluation result of the last iteration.

In specific implementation, if the iterative process is terminated, the master node collects the evaluation results, stores the collected evaluation results in a file system, determines n feature expressions with highest fitness by calling the evaluation results in the file system by the child node executing the output task, outputs a feature generation result report which is fed back to a user and used for indicating the n feature expressions with highest fitness, and outputs features corresponding to the n feature expressions with highest fitness for subsequent calling. For example, a child node executing the output task outputs a json file and a csv file, wherein the json file is used for storing the formatting result, and generates a feature generation result report which is displayed to the user after the json file is returned to the front end; the csv file is used for storing the finally generated and reserved feature data corresponding to the feature expression, and finally uploading the feature data to the server for subsequent calling by the user. In addition, the system can automatically delete all related files and release the hard disk space.

S108: and the main node issues an iteration task to the selected child node.

Here, the child node that executes the iterative task may be randomly selected by the master node, or the master node may be selected according to the load condition of each current child node.

As shown in table four, the parameter information received for the child node executing the iterative task is:

the algorithm parameters in table four may include cross probability (pCross) and mutation probability (pMutation), which are described in the following description of an iterative algorithm, i.e., a feature generation algorithm.

S109: the sub-node executing the iterative task generates a coding file containing a plurality of coding individuals based on the evaluation result of the Nth generation fitness evaluation task, and sends the coding file to the main node; the plurality of coding individuals comprise N coding individuals corresponding to N characteristic expressions with highest fitness evaluated by the N-th generation fitness evaluation task.

In this step, the selected child node for executing the iterative task uses a script to call an iterative function, and according to the evaluation result of the nth generation fitness evaluation task, that is, the fitness, a code file for executing the (N + 1) th generation fitness evaluation task is generated, and an intermediate data file stored by the host node for executing the (N-1) th generation fitness evaluation task may be called to generate an intermediate data file for executing the nth generation fitness evaluation task and return the intermediate data file to the host node, where the intermediate data file may include intermediate results that may be needed for subsequent analysis, such as N feature expressions with optimal evaluation results and their respectively corresponding fitness.

In the embodiment of the application, in order to keep the feature expression with higher fitness in subsequent iterations, n feature expressions with highest fitness (that is, the best evaluation result) in the last iteration are kept in each iteration, and besides the n feature expressions, a new feature expression can be randomly generated or obtained through transformation in each iteration. In the preferred embodiment of the present application, in order to achieve the purpose that the advantageous features can be retained and the features with higher adaptability can be further generated, an iterative algorithm based on a Depth-First programming (DEP) mode is innovatively proposed, which is different from the conventional coding mode based on Gene Expression coding (GEP), and the coding method based on DFP traverses all branches of the feature Expression in a Depth-First mode, and can effectively and completely retain the sub-expressions.

The iterative task execution process based on the GEP mode in the embodiment of the application specifically comprises the following steps:

step 1: and the child node executing the iterative task selects N characteristic expressions with highest fitness from the m characteristic expressions evaluated by the N generation fitness evaluation task based on the evaluation result of the N generation fitness evaluation task. Here, m is a positive integer greater than n.

In the traditional genetic algorithm, when selecting the dominant individual, the selection probability is required to be distributed from high to low according to the fitness of the individuals in the population, and then the dominant individual is selected according to the distributed selection probability. But it is often difficult to effectively maintain a dominant individual in this manner. The method adopted by the embodiment of the application directly reserves some optimal characteristic expressions generated by the last iteration, and is not in a probability selection form. Preferably, in order to avoid occurrence of too many redundant features and increase complexity of feature generation, after step 1, redundant features may be removed, specifically, if there are feature expressions with the same fitness in the m feature expressions, k redundant feature expressions are removed, so that there are no feature expressions with the same fitness in the remaining feature expressions; and selecting n characteristic expressions with highest fitness from the rest characteristic expressions, and subtracting k from m in the steps 2-4. Here, k is a positive integer smaller than m.

Step 2: randomly selecting two feature expressions from the m feature expressions, respectively selecting one sub-expression from the two feature expressions for crossing according to a preset crossing probability, and reserving one feature expression after random crossing; repeating the step m-n times to obtain m-n characteristic expressions after the reserved random intersection.

In order to ensure the legality of the newly generated feature expression, the crossing based on the DFP coding is to find a complete sub-expression in two feature expressions respectively for crossing, wherein the expressions are expressed in the form of an expression tree, as shown in FIG. 3, since a depth-first coding method is adopted, a complete sub-expression is a continuous character string after coding, so that the complete sub-expression is easy to find.

And step 3: and carrying out mutation treatment on elements in the reserved m-n characteristic expressions after random crossing according to a preset mutation probability to obtain m-n characteristic expressions after random mutation.

In the step, for m-n feature expressions except for n feature expressions with highest fitness, each feature expression is mutated according to mutation probability (pMutation), for example, in the m-n feature expressions, one feature expression Xi is randomly selected to generate a random number p satisfying uniform distribution between [0 and 1], and if p < pMutation, Xi is mutated.

In order to ensure the legality of the mutated feature expression and the integrity of the sub-expressions, the mutation modes can be divided into the following four types, and one of the four types can be randomly selected during specific mutation:

the first method comprises the following steps: replacing a single feature node in the feature expression with a sub-expression; the single feature node refers to a data feature or an operator in the feature expression.

In this variant, a single feature node in the feature expression is randomly replaced with a sub-expression, as shown in fig. 4. Since an excessively complicated feature has a disadvantage in interpretability and the like, it is preferable that the maximum depth of the added subexpression is 2.

And the second method comprises the following steps: reducing a sub-expression in the feature expression to a single feature node.

As shown in fig. 5, this mutation operation can be regarded as the inverse operation of the first kind. For coding purposes, when a child expression tree is pruned, if the starting node has both left and right subtrees, the current node and left subtree can be selected to be pruned.

And the third is that: and replacing one single characteristic node in the characteristic expression by a randomly generated single characteristic node.

In the variation mode, for the data feature nodes, the data feature nodes are still the data feature nodes after variation; for operator nodes, the operator nodes are also changed after mutation, and the number of subtrees of the nodes cannot be changed. As shown in fig. 6, the operator "-" is replaced with "/".

And fourthly: the feature expression is replaced with a new feature expression generated randomly.

Since the first to third variations are variations based on the original m feature expressions, in order to prevent the local optimum, a new feature expression may be randomly generated to replace the selected feature expression with a certain probability.

And 4, step 4: and determining the coding individuals respectively corresponding to the N characteristic expressions with the highest fitness and the m-N characteristic expressions after the random variation processing as m coding individuals contained in the (N + 1) th generation fitness evaluation task.

S110: and the main node generates a plurality of N +1 generation fitness evaluation tasks based on the coding file, respectively sends each N +1 generation fitness evaluation task to different sub-nodes, executes S104, then adds 1 to N, and returns to the step S105.

To further illustrate the ideas of the embodiments of the present application, the following is further described by a specific example.

As shown in table five, there were 50 data samples belonging to one of the three subgenus under iris, respectively iris irica, iris discolour and iris virginiana, each sample initially having four data characteristics, respectively calyx length, calyx width, petal length, petal width.

In the front-end interface, a user selects a feature generation algorithm based on depth-first coding (DFP), and selects a Gini coefficient to evaluate fitness. Meanwhile, a user sets a cross probability pCross and a mutation probability pMutation; for example, pCross =0.5 and pMutation =0.6 are set.

In the initialization task, according to a DFP coding mode, randomly generating 50 feature expressions, wherein each feature expression represents a newly generated feature, for example, the DFP coding corresponding to the newly generated feature expression (A-B)/(C + D) is shown in the following table six:

in the fitness evaluation task, calculating the fitness Gini (i) of each of 50 feature expressions, wherein i = { 1, 2,3.. 49, 50 }. in the iteration task, the feature expressions are selected, crossed and mutated to realize the optimal search for new features, specifically, firstly, selecting the optimal 5 feature expressions (namely, the 5 feature expressions with the highest Gini coefficient), then, keeping the optimal 5 feature expressions, randomly selecting two Xi and Xj from the 50 feature expressions, generating a random number p which satisfies the uniform distribution between [0 and 1], if p < pCross, carrying out feature crossing on Xi = (A-B) × (C + D) and Xj = A + (C + D-B), if p < pCross, then, carrying out feature crossing on the left sub-tree (namely, the left sub-tree) of Xi and the right sub-tree (namely, the right sub-expression) of Xj, as shown in FIG. 3, selecting the left sub-tree (namely, the left sub-expression) of Xi and carrying out feature crossing on the new expression (namely, if p < Xj > C + D-B), carrying out feature crossing on the random crossing, if the feature probability of the two expressions is equal to the same, otherwise, carrying out feature crossing on the random number p, and carrying out feature crossing on the random crossing on the feature crossing on the random number p, if the feature crossing probability of the feature probability of the random number p < 5, if the feature probability of the feature is not, if the feature is equal to obtain the feature probability of 5, otherwise, and if the feature is not, then, and if the feature is repeatedly selecting the feature probability of 5, and if the feature is repeatedly selecting the feature crossing probability of the feature is not, then, and if the feature is repeatedly selecting the feature crossing probability of the feature is repeatedly selecting the feature is not less than 5, and if the probability of the feature is not, then, and if the feature is repeatedly selecting the.

As shown in table seven, after the newly generated feature expression is added, compared with the original feature, the accuracy of model classification is obviously improved.

Based on the same inventive concept, the embodiment of the present application further provides a feature generation system corresponding to the feature generation method, and as the principle of solving the problem of the system is similar to the feature generation method in the embodiment of the present application, the implementation of the system may refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 7, a schematic structural diagram of a feature generation system provided in the embodiment of the present application includes:

the master node 71 is configured to, after receiving evaluation results sent by a plurality of child nodes executing an nth-generation fitness evaluation task, issue an output task to a selected child node if it is determined that N is equal to the maximum iteration number, and otherwise, issue an iteration task to the selected child node; the system is also used for generating a plurality of N + 1-generation fitness evaluation tasks based on the coding files generated by the sub-nodes executing the iterative tasks, and respectively issuing each N + 1-generation fitness evaluation task to different sub-nodes, wherein each fitness evaluation task comprises a coding individual;

the child node 72 for executing the output task is used for determining and outputting N feature expressions with the highest fitness based on the evaluation result of the nth generation fitness evaluation task; the n characteristic expressions with the highest fitness are the first n characteristic expressions which are arranged from high to low according to the fitness;

the sub-node 73 is used for executing an iterative task, generating a coding file containing a plurality of coding individuals based on the evaluation result of the nth-generation fitness evaluation task, and sending the coding file to the main node; the plurality of coding individuals comprise N coding individuals corresponding to N characteristic expressions with highest fitness evaluated by an Nth-generation fitness evaluation task;

and the sub-node 74 executing the fitness evaluation task is configured to perform fitness calculation for the feature expression indicated by the coding individual in the allocated fitness evaluation task, and send the calculated fitness as an evaluation result to the master node.

Optionally, the coding individuals are generated by adopting a depth-first coding (DFP) mode; the child node 73 that executes the iterative task is specifically configured to:

based on the evaluation result of the N-th generation fitness evaluation task, selecting N characteristic expressions with highest fitness from the m characteristic expressions evaluated by the N-th generation fitness evaluation task;

randomly selecting two feature expressions from the m feature expressions, respectively selecting one sub-expression from the two feature expressions for crossing according to a preset crossing probability, and reserving one feature expression after random crossing; repeating the step m-n times to obtain m-n reserved characteristic expressions after random crossing;

carrying out mutation treatment on elements in the reserved m-n characteristic expressions after random crossing according to a preset mutation probability to obtain m-n characteristic expressions after random mutation;

and determining the coding individuals respectively corresponding to the N characteristic expressions with the highest fitness and the m-N characteristic expressions after the random variation processing as m coding individuals contained in the (N + 1) th generation fitness evaluation task.

Optionally, the child node 73 executing the iterative task is specifically configured to perform mutation processing on the elements in the m-n retained feature expressions after the random intersection according to the following steps:

reducing a sub-expression in the feature expression into a single feature node;

Optionally, the child node 73 for executing the iterative task is specifically configured to select N feature expressions with the highest fitness from the m feature expressions evaluated by the nth-generation fitness evaluation task according to the following steps:

if the feature expressions with the same fitness exist in the m feature expressions, removing redundant k feature expressions so that the feature expressions with the same fitness do not exist in the rest feature expressions; and selecting n characteristic expressions with highest fitness from the rest characteristic expressions, and subtracting k from m.

Optionally, the master node 71 is further configured to:

after receiving the feature generation task, acquiring a data file required for executing the feature generation task from a data server, and transmitting the acquired data file to each cluster computing machine in a cluster system;

the child node 74 that executes the fitness evaluation task is specifically configured to:

and reading the characteristic data indicated by the coding individuals in the allocated fitness evaluation task from the cluster computing machine, substituting the read characteristic data into the characteristic expression corresponding to the coding individual, and calling a fitness evaluation function on the cluster computing machine to calculate the fitness of the characteristic expression into which the characteristic data is substituted.

Optionally, the master node 71 is further configured to: sending the initialization task corresponding to the feature generation task received by the main node to a selected child node;

the system further comprises: the child node 75 for executing the initialization task is configured to randomly generate an encoding file including a plurality of initialized encoding individuals by calling an initialization function on the cluster computing machine where the child node is located;

the master node 71 is further configured to: based on the plurality of initialized coding individuals generated by the child node 75 executing the initialization task, a plurality of first-generation fitness evaluation tasks are generated, and each generated first-generation fitness evaluation task is issued to different child nodes respectively.

Optionally, the child node 72 executing the output task is specifically configured to:

and determining N characteristic expressions with highest fitness by calling the evaluation result of the Nth generation fitness evaluation task stored in the file system by the main node, outputting a characteristic generation result report which is fed back to the user and used for indicating the N characteristic expressions with the highest fitness, and outputting the characteristic data corresponding to the N characteristic expressions with the highest fitness for subsequent calling.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of feature generation, the method comprising:

c, generating a coding file containing a plurality of coding individuals by using a depth-first coding (DFP) mode based on the evaluation result of the Nth-generation fitness evaluation task by the child node executing the iterative task, and sending the coding file to the host node; the plurality of coding individuals comprise N coding individuals corresponding to N characteristic expressions with highest fitness evaluated by an Nth-generation fitness evaluation task;

2. The method of claim 1, wherein in the step C, the sub-node executing the iterative task generates a code file including a plurality of coding individuals by adopting a depth-first coding (DFP) mode based on the evaluation result of the nth-generation fitness evaluation task, and the method comprises:

3. The method of claim 2, wherein in step C3, the mutation processing on the elements in the reserved m-n feature expressions after random crossing comprises:

for any feature expression in the m-n feature expressions, randomly selecting one of the following processing modes for mutation processing:

reducing a sub-expression in the feature expression into a single feature node;

4. The method according to claim 2 or 3, wherein in step C1, the step of selecting N feature expressions with highest fitness from the m feature expressions evaluated by the N-th generation fitness evaluation task by the child nodes executing the iterative task based on the evaluation result of the N-th generation fitness evaluation task comprises:

5. The method of claim 1, further comprising, prior to step a:

6. The method of claim 1, further comprising, prior to step a:

7. The method as claimed in claim 1, wherein in step B, the sub-node executing the output task determines and outputs N feature expressions with the highest fitness based on the evaluation result of the nth-generation fitness evaluation task, including:

8. A feature generation system, comprising:

the child node is used for executing the iterative task, generating a coding file containing a plurality of coding individuals by adopting a depth-first coding (DFP) mode based on the evaluation result of the Nth-generation fitness evaluation task, and sending the coding file to the host node; the plurality of coding individuals comprise N coding individuals corresponding to N characteristic expressions with highest fitness evaluated by an Nth-generation fitness evaluation task;

9. The system of claim 8, wherein the child node that performs the iterative task is specifically configured to:

10. The system of claim 9, wherein the child node performing the iterative task is specifically configured to mutate the elements in the retained randomly intersected m-n feature expressions according to the following steps:

reducing a sub-expression in the feature expression into a single feature node;

11. The system according to claim 9 or 10, wherein the sub-nodes performing the iterative task are specifically configured to select N feature expressions with the highest fitness from the m feature expressions evaluated by the nth-generation fitness evaluation task, according to the following steps:

12. The system of claim 8, wherein the master node is further to:

the child node that executes the fitness evaluation task is specifically configured to:

13. The system of claim 8, wherein the master node is further to: sending the initialization task corresponding to the feature generation task received by the main node to a selected child node;

the system further comprises: the child node is used for executing the initialization task and randomly generating an encoding file containing a plurality of initialized encoding individuals by calling an initialization function on the cluster computing machine;

the master node is further configured to: and generating a plurality of first-generation fitness evaluation tasks based on a plurality of initialized coding individuals generated by the child nodes executing the initialization tasks, and respectively issuing each generated first-generation fitness evaluation task to different child nodes.

14. The system of claim 8, wherein the child node executing the output task is specifically configured to: