The content of the invention
The embodiment of the present application provides a kind of feature generation method and system, be used to solve carry out it is a large amount of newly-generated
During the fitness evaluation of feature, generally there is big data disposal ability deficiency, evaluate less efficient problem,
Additionally provide a kind of feature generating algorithm for effectively obtaining high value new feature.
The embodiment of the present application provides a kind of feature generation method, including:
Step A, host node send in the multiple child nodes for receiving execution N-Generation fitness evaluation task
After evaluation result, however, it is determined that N is equal to maximum iteration, then output task is issued to a child node of selection,
Otherwise, iteration task is issued to a child node of selection;
Step B, evaluation result of the child node based on N-Generation fitness evaluation task for performing output task,
It is determined that and exporting n feature expression of fitness highest;Described n feature expression of fitness highest
Refer to the preceding n feature expression after being arranged from high to low according to fitness;
Step C, evaluation knot of the child node based on the N-Generation fitness evaluation task for performing iteration task
Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many
The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality
N coding corresponding to formula is individual;
Step D, the host node are based on the coding file generated multiple N+1 generation fitness evaluation and appoint
Business, and each N+1 generation fitness evaluation task is handed down to different child nodes respectively, wherein, often
It is individual comprising a coding in one fitness evaluation task;
Step E, the child node of execution fitness evaluation task are in allocated fitness evaluation task
Encoding the feature expression indicated by individuality carries out fitness calculating, and the fitness that will be calculated is used as commenting
Valency result is sent to the host node;N plus 1, return to step A.
Alternatively, the coding individuality is to be generated by the way of depth-first encoding D FP;Step C
In, the child node for performing iteration task is based on the evaluation result of N-Generation fitness evaluation task, and generation is included
The individual coding file of multiple coding, including:
Step C1, the child node of the execution iteration task are based on the evaluation of N-Generation fitness evaluation task
As a result, fitness highest n is selected from m feature expression of N-Generation fitness evaluation task-based appraisal
Individual feature expression;
Step C2, two feature expressions of random selection from the m feature expression, according to default
Crossover probability, select a subexpression to be intersected respectively from the two feature expressions, retain with
A feature expression after machine intersection;The step is repeated m-n times, after the random intersection for being retained
M-n feature expression;
Step C3, according to default mutation probability, to m-n feature after the random intersection of the reservation
Element in expression formula enters row variation treatment, obtains m-n feature expression after random variation;
Step C4, by described n feature expression of fitness highest and the random variation treatment after
M-n feature expression distinguishes corresponding coding individuality, is defined as in N+1 generation fitness evaluation task
Comprising m coding it is individual.
Alternatively, in step C3, in m-n feature expression after the random intersection of the reservation
Element enters row variation treatment, including:
For any feature expression formula, one kind is randomly choosed from following processing mode and enters row variation treatment:
A single characteristic node in this feature expression formula is replaced with a subexpression;Single feature section
Point refers to a data or an operator in this feature expression formula;
A subexpression in this feature expression formula is reduced to a single characteristic node;
A single characteristic node in this feature expression formula is replaced with single characteristic node of random generation;
This feature expression formula is replaced with the new feature expression of random generation.
Alternatively, in step C1, the child node for performing iteration task is commented based on N-Generation fitness
The evaluation result of valency task, selects from m feature expression of N-Generation fitness evaluation task-based appraisal
N feature expression of fitness highest, including:
If in the m feature expression, there is fitness identical feature expression, then redundancy is rejected
K feature expression, with cause in remaining feature expression do not exist fitness identical mark sheet
Up to formula;
In the remaining feature expression, fitness n feature expression of highest is selected, and will step
M in rapid B2~B4 subtracts k.
Alternatively, before step A, also include:
The host node is obtained from data server and performs the feature after feature generation task is received
The data file of required by task, and the transmitting data file that will be obtained are generated to every cluster in group system
Computing machine;
In step E, the child node for performing fitness evaluation task carries out fitness calculating, including:
The child node for performing fitness evaluation task reads what is be allocated from the PC cluster machine of place
The characteristic indicated by coding individuality in fitness evaluation task, and the characteristic substitution that will be read should
Individual corresponding feature expression is encoded, the fitness evaluation letter on PC cluster machine where by calling
Number, fitness calculating is carried out to substituting into the feature expression after characteristic.
Alternatively, before step A, also include:
Host node issues first corresponding to the feature generation task of host node reception to a child node of selection
Beginning task;
The child node for performing initialization task passes through to call the initialization function on the PC cluster machine of place, with
The individual coding file of coding of the machine generation comprising multiple initialization;
The coding that the host node is based on the multiple initialization is individual, the multiple first generation fitness evaluations of generation
Task, and each first generation fitness evaluation task for generating is handed down to different child nodes respectively.
Alternatively, in stepb, the child node for performing output task is commented based on N-Generation fitness
The evaluation result of valency task, it is determined that and export n feature expression of fitness highest, including:
The child node for performing output task stores the institute in file system by calling the host node
The evaluation result of N-Generation fitness evaluation task is stated, n feature expression of fitness highest is determined, and
Output feed back to user, for indicate described n feature expression of fitness highest feature generate tie
Retribution is accused, and exports the spy corresponding to the n feature expression of fitness highest for subsequent calls
Levy data.
The embodiment of the present application provides a kind of feature generation system, including:
Host node, for receiving commenting for the multiple child nodes for performing N-Generation fitness evaluation task transmission
After valency result, however, it is determined that N is equal to maximum iteration, then output task is issued to a child node of selection,
Otherwise, iteration task is issued to a child node of selection;It is additionally operable to, based on the child node for performing iteration task
The coding file generated multiple N+1 generation fitness evaluation task of generation, and each N+1 generation is fitted
Response evaluates task and is handed down to different child nodes respectively, wherein, included in each fitness evaluation task
One coding is individual;
The child node of output task is performed, for the evaluation result based on N-Generation fitness evaluation task, really
Determine and export n feature expression of fitness highest;Described n feature expression of fitness highest be
Refer to the preceding n feature expression after being arranged from high to low according to fitness;
The child node of iteration task is performed, for the evaluation knot based on the N-Generation fitness evaluation task
Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many
The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality
N coding corresponding to formula is individual;
The child node of fitness evaluation task is performed, for for the volume in allocated fitness evaluation task
Feature expression indicated by code individuality carries out fitness calculating, and the fitness that will be calculated is used as evaluation
Result is sent to the host node.
Using the above method or system, because every generation fitness evaluation task can be parallel by multiple child nodes
Perform, therefore improve fitness evaluation efficiency, also and then improve the efficiency that whole feature generates process,
Ensure that the promptness of new feature generation;Host node when fitness evaluation task is issued to child node, not
There is the child node for directly characteristic being transferred to and performing fitness evaluation task, but use coding individuality
Form indicates the feature expression of needs assessment to it, such that it is able to reduce volume of transmitted data, increases transmission effect
Rate, and reduce EMS memory occupation.In addition, based on the embodiment of the present application proposition based on depth-first encoding D EP
Iterative algorithm under mode, is effectively guaranteed the integrality of subexpression, and in iteration each time all
Optimal several feature expressions after last iteration are remained, therefore after last time iteration is completed, can
To obtain optimal several feature expressions in whole iterative process.
Specific embodiment
In the embodiment of the present application, whole iterative process includes:The initialization performed by the child node for selecting
Task, the every generation fitness evaluation task by multiple child node executed in parallel, a child node by selecting
The iteration task of execution and the output task performed after whole fitness evaluation tasks have been performed;It is main
Node is responsible for the adaptable task of each child node distribution, carries out the coordinated scheduling of whole iterative process.By
By multiple child node executed in parallel, therefore fitness evaluation can be improve in every generation fitness evaluation task
Efficiency, also and then improves the efficiency that whole feature generates process, it is ensured that the promptness of new feature generation;
Characteristic is not directly transferred to execution by host node when fitness evaluation task is issued to child node
The child node of fitness evaluation task, but the feature of needs assessment is indicated to it in the form of individuality is encoded
Expression formula, such that it is able to reduce volume of transmitted data, increases efficiency of transmission, and reduce EMS memory occupation.
In addition, in iteration task, the embodiment of the present application is proposed and encodes (Depth-First based on depth-first
Programing, DEP) iterative algorithm under mode, generating algorithm is also referred to as characterized, the algorithm utilizes DEP
Coded system is encoded, and in feature generation, is effectively guaranteed the integrality of subexpression;And make use of something lost
Characteristic is passed, optimal several feature expressions after last iteration are all remained during iteration each time, therefore
After completing last time iteration, optimal several feature expressions in whole iterative process can be obtained.
In order that the purpose, technical scheme and advantage of the application are clearer, below in conjunction with accompanying drawing to this Shen
Please be described in further detail, it is necessary to illustrate, based on the embodiment in the application, this area is common
All other embodiment that technical staff is obtained under the premise of creative work is not made, belongs to this Shen
The scope that please be protect.
As shown in figure 1, the feature generation method flow chart provided for the embodiment of the present application, Fig. 2 is based on repeatedly
The schematic diagram of task scheduling is carried out for Computational frame, is comprised the following steps:
S101:Host node is issued corresponding to the feature generation task of host node reception to a child node of selection
Initialization task.
In above-mentioned steps, host node can randomly choose one execution initialization task child node or according to
The loading condition of each child node selects a child node for execution initialization task, it is possible to performing initialization
The child node of task indicates following parameter information as shown in Table 1:
Parameter name |
Parameter type |
Meaning of parameters |
filename |
String |
Data file name |
fieldSize |
Int |
Data file field sum |
popSize |
Int |
Individual amount in population |
parameter |
String |
Algorithm parameter |
Table one
Wherein, the parameter of this character string (String) type of data file name (filename) is used to refer to
Data file where showing data sample, data file field sum (fieldSize) this integer (Int)
Parameter is used for the field quantity for indicating in data file, a kind of data characteristics of each field identification, below institute
The feature expression stated is the expression formula on various data characteristicses and operator, such as feature expression
X4=X1+X2 × X3, wherein, X1, X2 and X3 are several different data characteristicses, "+" and "×"
It is operator.This integer (Int) parameter of individual amount (popSize) is used to indicate initialization to appoint in population
The quantity of the coding individuality generated in business, each coding one feature expression of individual correspondence.Algorithm parameter
(parameter) it is the parameter of String types, is stored in the form of with key-value (key-value),
Depth capacity (depth of such as X1+X2 × X3 is 3) of the feature expression of generation etc. can such as be included.
Above-mentioned parameter information can be the information being input into by front-end interface by user.Specifically, user is in net
Page client, by the guiding of front-end interface, carries out data and imports and demand setting, final to initiate to the back-end
Task requests.Specifically include three sub-processes:Submit data, selection algorithm and arrange parameter to.Wherein, carry
Intersection number evidence refers to that user is input into corresponding tables of data (data file) title of pending data by front-end interface,
And choose the field for needing to be processed and field type is set;Selection algorithm, refers to that user is submitting data to
Afterwards, algorithm is selected according to selected field type, and provides some of the recommendations, user can select according to the actual requirements
Corresponding algorithm is selected, it is also possible to submit customized algorithm to, in the embodiment of the present application, it is necessary to user selects
Algorithm be mainly fitness evaluation algorithm, for feature generation, the embodiment of the present application provide special spy
Levy generating algorithm;Arrange parameter, refers to that the parameter of selected algorithm is set, and all parameters can all have scarce
Province's value is for reference.After these three sub-processes terminate, all relevant informations are collected to a task
In request, send to rear end, then, perform each rear end calculation process from S101.Rear end computing
The process for namely calling related algorithm to be calculated, this process is isolated with user, can be by preceding
End interface query task running status, is shown to user in the form of rolling daily record, in fitness evaluation task
After the completion of all performing, can be exported with implementing result, specifically, reading deposits resultful file (as subsequently
The json files of introduction), parse and user is presented to certain visual pattern.
S102:The child node for performing initialization task passes through to call the initialization letter on the PC cluster machine of place
Number, the individual coding file of coding of the generation comprising multiple initialization.
Here, the child node (Slave nodes) for performing initialization task, according to the instruction of host node, passes through
Script mode call where related algorithm source code (namely initialization in source code library on PC cluster machine
Function), generation coding file, while generation can also include intermediate data file, will coding file and
Intermediate data file returns to host node.Here may be needed comprising some subsequent analysis in intermediate data file
The intermediate result to be used, in initialization task, due to also being produced without intermediate result, can be by this
Between data in data file nullify or default value, host node (Master nodes) can be saved it in
File system in, for successive iterations calculate be called.In addition, in follow-up scheduling process, it is each
The coding file and intermediate data file that secondary iteration is produced also can be all stored in the file system of host node,
Therefore, once there is the system failure, you can calculated with to return to continue at an iteration before the system failure, can
It is restorative stronger.
In specific implementation, the coding is individual to be used for identification characteristics expression formula, each in coding individuality
Encode for identifying a field (namely a kind of feature) or this feature expression formula that this feature expression formula is related to
In an operator.Such as, for a tables of data, there are [Y, X1, X2, X3] four fields, each
A kind of data characteristics in field corresponding data table, for feature expression X4=X1+X2 × X3, if "+"
Code be -1, the code of "×" is for the code of 2, X3 for the code of 1, X2 for the code of -3, X1
3, then this feature expression is encoded into (Depth-First Programing, DFP) side according to depth-first
Formula is [- 1,1, -3,2,3] after being encoded.
S103:The coding that host node is based on the multiple initialization is individual, and the multiple first generation fitness of generation are commented
Valency task, and each first generation fitness evaluation task for generating is handed down to different child nodes respectively;
Wherein, it is individual comprising a coding in each fitness evaluation task.
If the application is applied in category of model, fitness can be weighed using Geordie (Gini) coefficient.
In the embodiment of the present application, in order to reduce volume of transmitted data, host node is commented issuing fitness to child node
During valency task, characteristic is not transferred to the child node of execution fitness evaluation task directly, but
The feature expression of needs assessment is indicated to it in the form of individuality is encoded, such that it is able to reduce data transfer
Amount, increases efficiency of transmission, and reduce EMS memory occupation.
In S103, host node can wait initialization task knot by way of thread dormancy (sleep)
Beam, then determines to generate first generation fitness evaluation task according to the quantity for encoding individuality in coding file
Quantity, and create each fitness evaluation task.Specifically, for each fitness evaluation task, can be with
One sub- task identification (Identity, ID) of generation and population at individual identify (Identity, ID), and by son
The appointing as fitness evaluation task such as task ID, population at individual ID, coding individuality and algorithm parameter
Business execution information is stored in task queue, and each fitness evaluation task point is then taken out in task queue
Issue each child node for performing fitness evaluation task.Here, perform per generation fitness evaluation task
Child node can be that host node is randomly selected, or host node is according to the load of current each child node
Situation selection.
As shown in Table 2, to perform the parameters information that receives of child node of fitness evaluation task:
Parameter name |
Parameter type |
Meaning of parameters |
filename |
String |
Data file name |
JobID |
String |
Task ID |
popID |
String |
Population at individual ID |
individual |
String |
Coding is individual |
parameter |
String |
Algorithm parameter |
Table two
In above-mentioned parameter, data file name (filename), task ID (JobID), population at individual
ID (popID), coding individual (individual), algorithm parameter (parameter) are all character string types
Parameter.
S104:The child node of fitness evaluation task is performed for the volume in allocated fitness evaluation task
Feature expression indicated by code individuality carries out fitness calculating, and the fitness that will be calculated is used as evaluation
Result is sent to the host node.
Here, the child node for performing fitness evaluation task is processed distributing for task, specifically, is adjusted
Calculated with corresponding fitness function, and by fitness result of calculation, task ID, population at individual
During ID and the individual write-in of coding one are with the json files of task ID name, the text of host node is returned to
In part system.For task identification ID, host node can be to the child node of any execution fitness evaluation task
The implementation progress inquiry request comprising task ID is sent, and receives any execution fitness evaluation task
Child node be based on the task ID return implementation progress information.ID is identified for population at individual;Main section
Point receive perform the fitness evaluation task child node send comprising the population at individual ID and volume
After the individual evaluation result of code, the population at individual ID and coding that can be sent the child node are individual, with
The population at individual ID and coding individuality being handed down in the fitness evaluation task of the child node are matched, with
Verify the accuracy of the evaluation result of child node feedback.
In S104, the child node for performing fitness evaluation task can be according to the individual reading correspondence word of coding
The characteristic of section, and without reading the characteristic of all fields, such that it is able to reduce to memory headroom
Take, tasks in parallel treatment is better achieved.Preferably, host node is after feature generation task is received,
The data file for performing the feature generation required by task can be obtained first from data server, and will be obtained
The transmitting data file for taking is to every PC cluster machine in group system;Correspondingly, fitness is performed to comment
The child node of valency task reads the volume in allocated fitness evaluation task from the PC cluster machine of place
Characteristic indicated by code individuality, and the characteristic of reading is substituted into the individual corresponding mark sheet of the coding
Up to formula, the fitness function on PC cluster machine where by calling, after substituting into characteristic
Feature expression carries out fitness calculating.Here, read indicated by the coding individuality for the ease of child node
Characteristic, host node characteristic is downloaded into child node from data server in advance where cluster meter
Calculate on machine.In actually implementing, child node can also directly read required characteristic from data server
According to, but so nature can largely reduce evaluation efficiency.
S105:Host node is in commenting that the multiple child nodes for receiving execution N-Generation fitness evaluation task send
After valency result, judge whether N is equal to maximum iteration, if so, then entering S106, otherwise, perform
S107.Here, N is the positive integer more than or equal to 1, is increased with the increase of iterations.
In specific implementation process, when in certain iteration host node to distribute all fitness evaluation tasks for complete
Portion performs completion, and host node can be collected the evaluation result that each child node is returned, and generate a csv text
Part is stored in file system for calling.Meanwhile, host node can be judged iterative process, if repeatedly
In generation, has met end condition and has then terminated iteration, and indicates certain child node to perform output task, if being unsatisfactory for
End condition, it indicates that certain child node performs iteration task, that is, generate new coding file and new centre
Data file, and return to host node.
S106:Host node issues output task to a child node of selection.
Here, the child node for performing output task can be that host node is randomly selected, or host node
What the loading condition according to current each child node was selected.
As shown in following table three, for the parameter information that the child node for performing output task is received:
Parameter name |
Parameter type |
Meaning of parameters |
filename |
String |
Data file name |
popSize |
Int |
Individual amount in population |
parameter |
String |
Algorithm parameter |
PVfilename |
String |
Encode individual and corresponding evaluation function return value |
Midfilename |
String |
Intermediate data file title |
IterNum |
Int |
Current iteration number of times |
Table three
Wherein, the algorithm parameter (parameter) in table three for String types parameter, with keyword-
Value (key-value) to form storage, such as can be the fitness highest feature expression for retaining
Number, the evaluation function return value is fitness.
S107:The child node for performing output task is based on the evaluation result of N-Generation fitness evaluation task, really
Determine and export n feature expression of fitness highest.Here n is the positive integer more than or equal to 1,
Here, because iteration will inherit the optimal n after last iteration each time in the embodiment of the present application
Individual feature expression, therefore, it can be directly based upon the evaluation result of last time iteration, select all of
N optimal feature expression of evaluation result in iterations.
In specific implementation, if iterative process terminates, host node collects to evaluation result, will collect
Evaluation result afterwards is stored in file system, performs the child node of output task by calling the file system
Evaluation result in system, determines n feature expression of fitness highest, and export feed back to user,
Feature for indicating described n feature expression of fitness highest generates result and reports, and output is used
The feature corresponding to the n feature expression of fitness highest in subsequent calls.Such as, output is performed
Child node output one json file and a csv file of task, wherein json files are used to preserve form
Change result, generation is shown to the feature generation result report of user after front end is returned it into;Csv files are then
For preserving the characteristic corresponding to the feature expression for ultimately generating and retaining, can finally be uploaded to
On server, for user's subsequent calls.In addition, system can be automatically deleted all associated documents, release is hard
Disk space.
S108:Host node issues iteration task to a child node of selection.
Here, the child node for performing iteration task can be that host node is randomly selected, or host node
What the loading condition according to current each child node was selected.
As shown in Table 4, to perform the parameter information that receives of child node of iteration task:
Parameter name |
Parameter type |
Meaning of parameters |
filename |
String |
Data file name |
popSize |
Int |
Individual amount in population |
parameter |
String |
Algorithm parameter |
PVfilename |
String |
Encode individual and corresponding evaluation function return value |
Midfilename |
String |
Intermediate data file title |
IterNum |
Int |
Current iteration number of times |
Table four
Wherein, the algorithm parameter in table four can include crossover probability (pCross), mutation probability
(pMutation), see below in relation to iterative algorithm, namely feature generating algorithm description.
S109:The child node for performing iteration task is based on the evaluation knot of the N-Generation fitness evaluation task
Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many
The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality
N coding corresponding to formula is individual.
In the step, the child node of the execution iteration task of host node selection calls iteration letter using script mode
Number, according to the evaluation result of N-Generation fitness evaluation task, namely fitness, generates for performing the
N+1 for fitness evaluation task coding file, can also by call host node preserve execution N-1
For the intermediate data file after fitness evaluation task, during generation is performed after N-Generation fitness evaluation task
Between data file and return to host node, be able to may need to use comprising subsequent analysis in intermediate data file
Intermediate result, such as n optimal mark sheet of N-Generation fitness evaluation task-based appraisal goes out evaluation result
Up to formula and its corresponding fitness of difference.
In the embodiment of the present application, in order to fitness feature expression higher is retained in successive iterations
Come, the n of fitness highest (namely evaluation result is optimal) when can all retain last iteration during each iteration
Individual feature expression, in addition to this n feature expression, can at random generate or lead to during iteration each time
Cross conversion and obtain new feature expression.The application preferred embodiment in, in order to realize retaining
Advantageous characteristic, can further generate fitness feature higher again, innovatively propose and compiled based on depth-first
Iterative algorithm under code (Depth-First Programing, DEP) mode, with traditional based on gene table
Coded system up to coding (Gene Expression Programing, GEP) is different, the volume based on DFP
Code method is that all branches of feature expression are traveled through in the way of depth-first, effectively can be reached sublist
Formula is intactly remained.
The iteration tasks carrying process that the embodiment of the present application is based under GEP modes is specifically included:
Step 1:The child node for performing iteration task is based on the evaluation result of N-Generation fitness evaluation task,
Select fitness highest n special from m feature expression of N-Generation fitness evaluation task-based appraisal
Levy expression formula.Here, m is the positive integer more than n.
In traditional genetic algorithm, it is necessary to be fitted according to individual in population when selective advantage is individual
Response distributes selected probability from high to low, then carries out the individual choosing of advantage according to the selected probability of distribution
Take.But this mode is generally difficult to the individual reservation of advantage effectively.The embodiment of the present application is by the way of
Some optimal feature expressions that last iteration is produced directly are remained, rather than with probability selection
Form.Preferably, in order to avoid there is excessive redundancy feature, the complexity of feature generation is increased, can be with
After step 1, redundancy feature is rejected, specifically, if in the m feature expression, existing suitable
Response identical feature expression, then reject k feature expression of redundancy, to cause in remaining feature
Do not exist fitness identical feature expression in expression formula;In the remaining feature expression, selection
N feature expression of fitness highest, and the m in step 2~4 is subtracted into k.Here, k is less than m
Positive integer.
Step 2:Two feature expressions are randomly choosed from the m feature expression, according to default
Crossover probability, selects a subexpression to be intersected respectively from the two feature expressions, retains random
A feature expression after intersection;Repeat the step m-n times, the m-n after the random intersection for being retained
Individual feature expression.
In order to ensure the legitimacy of newly-generated feature expression, the intersection based on DFP codings is in two spies
Levy and find a complete subexpression respectively in expression formula and intersected, here, come in the form of expression tree
Expression, as shown in Figure 3.As a result of the coding method of depth-first, a complete sublist
It is in encoded continuous character string up to formula, so being easily found.This interleaved mode not only ensure that friendship
The legitimacy of feature expression after fork, while it is also ensured that the integrality of subexpression so that advantage sublist
Can effectively be inherited up to formula.In addition, in order to prevent from being absorbed in local optimum, it is excellent in crossover process
N feature expression of choosing is involved in intersecting with further feature expression formula.Such as, in m feature expression
In, randomly choose two feature expressions Xi and Xj, generate one meet [0,1] between it is equally distributed with
Machine number p, if p<Crossover probability (pCross), then to feature expression Xi=(A-B) × (C+D) and
Xj=A+ (C+D-B) carries out subexpression intersection, such as, and can be by the subexpression (A-B) of Xi with Xj's
Subexpression (C+D-B) is swapped, will be " from m mark sheet of N-Generation fitness evaluation task-based appraisal
Whether action Repeated m-n times of two feature expressions of random selection up in formula " (is really handed over every time
Fork is then probabilistic), if retaining the new feature expression after an intersection every time or being intersected
Then retain a randomly selected feature expression.
Step 3:According to default mutation probability, to m-n mark sheet after the random intersection of the reservation
Enter row variation treatment up to the element in formula, obtain m-n feature expression after random variation.
The step namely for m-n feature representation outside deconditioning degree n feature expression of highest
Formula, enters row variation, such as described to wherein each feature expression according to mutation probability (pMutation)
In m-n feature expression, a feature expression Xi is randomly choosed, between one satisfaction [0,1] of generation
The random number p of even distribution, if p<PMutation, then enter row variation to Xi.
In order to ensure variation after feature expression legitimacy, and subexpression integrality, make a variation mode
Following four can be divided into, the different time is become specific, one of which can be randomly choosed:
The first:A single characteristic node in this feature expression formula is replaced with a subexpression;It is described
Single characteristic node refers to a kind of data characteristics or an operator in this feature expression formula.
Under this variation mode, a single characteristic node in feature expression is expressed with a son at random
Formula is substituted, as shown in Figure 4.Because there is inferior position at aspects such as interpretations in excessively complicated feature, because
This preferably increased subexpression depth capacity is 2.
Second:A subexpression in this feature expression formula is reduced to a single characteristic node.
As shown in figure 5, this mutation operation can be regarded as the inverse operation of the first.The need for for coding,
When sub- expression tree is sheared, if the existing left subtree of start node has right subtree again, can select always to cut
Except present node and left subtree.
The third:A single characteristic node in this feature expression formula is used single characteristic node generation of random generation
Replace.
Under this variation mode, for data characteristics node, after variation or data characteristics node;For
Operator node, after variation also or operator node, the subtree number of node can not change.Such as Fig. 6
Shown in, operator "-" is replaced by "/".
4th kind:This feature expression formula is replaced with the new feature expression of random generation.
Because above-mentioned the first~tri- kind variation is all change on the basis of m original feature expression, it is
Prevent from being absorbed in local optimum, can be under certain probability, generation one new feature expression is replaced at random
The feature expression that generation is selected.
Step 4:By the m-n after described n feature expression of fitness highest and random variation treatment
Individual feature expression distinguishes corresponding coding individuality, is defined as being included in N+1 generation fitness evaluation task
M coding it is individual.
S110:Host node is based on the coding file generated multiple N+1 generation fitness evaluation task, and will
Each N+1 generation fitness evaluation task is handed down to different child nodes respectively, performs S104, afterwards,
N plus 1, return to step S105.
In order to further illustrate the thought of the embodiment of the present application, make further below by a specific example
Explanation.
As shown in Table 5, there are 50 data samples, they are belonging respectively in three kinds of subgenus under Jris
One kind, be respectively mountain iris, Iris versicolor and Virginia iris, each sample initially has four numbers
It is respectively calyx length, calyx width, petal length, petal width according to feature.
Calyx length A |
Calyx width B |
Petal length C |
Petal width D |
Category kind Y |
5.1 |
3.5 |
1.4 |
0.2 |
Mountain iris |
4.9 |
3 |
1.4 |
0.2 |
Mountain iris |
4.7 |
3.2 |
1.3 |
0.2 |
Mountain iris |
6.6 |
2.9 |
4.6 |
1.3 |
Iris versicolor |
5.2 |
2.7 |
3.9 |
1.4 |
Iris versicolor |
5 |
2 |
3.5 |
1 |
Iris versicolor |
6.3 |
2.8 |
5.1 |
1.5 |
Iris versicolor |
6.1 |
2.6 |
5.6 |
1.4 |
Iris versicolor |
7.7 |
3 |
6.1 |
2.3 |
Iris versicolor |
…… |
…… |
…… |
…… |
Iris versicolor |
Table five
In front-end interface, user's feature generating algorithm of the selection based on depth-first coding (DFP), choosing
Gini coefficients are selected to evaluate fitness.Meanwhile, user sets crossover probability pCross and mutation probability
pMutation;Such as, pCross=0.5, pMutation=0.6 are set.
Aforementioned four data characteristics is designated as { A, B, C, D }, operator set be combined into+,-, × ,/.
In initialization task, according to DFP coded systems, 50 feature expressions, each feature are generated at random
Expression formula represents a newly-generated feature, such as, newly-generated feature expression (A-B)/(C+D) is corresponding
DFP is encoded as shown in following table six:
Table six
In fitness evaluation task, 50 respective fitness Gini (i) of feature expression are calculated, wherein
I={ 1,2,3 ..., 49,50 }.In iteration task, feature expression is selected, intersected and is made a variation
Operate to realize the optimal search to new feature.Specifically, 5 optimal feature expressions are selected first
(i.e. 5 feature expressions of Gini coefficients highest);Then, 5 optimal feature expressions are retained,
Two Xi and Xj are randomly choosed from 50 feature expressions, generation one is uniformly distributed between meeting [0,1]
Random number p, if p<PCross, then carry out spy to Xi=(A-B) × (C+D) and Xj=A+ (C+D-B)
Intersection is levied, as shown in figure 3, the left subtree (i.e. a subexpression on the left side) of Xi is chosen with Xj's
Right subtree (i.e. a subexpression on the right) is swapped.This two action of new feature of random selection
Repeat 45 times, a feature expression after random the intersection is retained every time (if being carried out according to crossover probability
Intersect, then retain a new feature expression after intersecting, otherwise, retain without the feature intersected
Expression formula);Finally, 5 optimal feature expressions are still retained, 45 features from after random the intersection
In expression formula (be probably after intersecting according to probability, be also likely to be not intersected according to probability), with
Machine selects a new feature expression Xi, generates one and meets equally distributed random number p between [0,1],
If p<PMutation, then enter row variation to Xi, then for above-mentioned the first~tetra- kind variation mode, according to
Identical probability randomly chooses a kind of variation mode and enters row variation.One feature expression of this random selection enters
The operation of row variation is repeated 45 times, after obtaining being made a variation according to probability or without 45 feature expressions for making a variation,
Plus 5 optimal feature expressions, this 50 feature expressions are continued to evaluate, repeated above-mentioned
Process.
As shown in Table 7, after adding newly-generated feature expression, compared with only primitive character, hence it is evident that carry
The accuracy rate of category of model is risen.
Table seven
Additionally provided based on same inventive concept, in the embodiment of the present application a kind of corresponding with feature generation method
Feature generation system, due to principle and the embodiment of the present application feature generation method phase of the system solve problem
Seemingly, therefore the implementation of the system may refer to the implementation of method, repeat part and repeat no more.
As shown in fig. 7, the feature generation system structural representation provided for the embodiment of the present application, including:
Host node 71, for receiving the multiple child nodes for performing N-Generation fitness evaluation task transmission
Evaluation result after, however, it is determined that N be equal to maximum iteration, then to selection a child node issue output appoint
Business, otherwise, iteration task is issued to a child node of selection;It is additionally operable to, based on the son for performing iteration task
The coding file generated multiple N+1 generation fitness evaluation tasks of node generation, and by each N+1
Different child nodes are handed down to respectively for fitness evaluation task, wherein, in each fitness evaluation task
It is individual comprising a coding;
The child node 72 of output task is performed, for the evaluation result based on N-Generation fitness evaluation task,
It is determined that and exporting n feature expression of fitness highest;Described n feature expression of fitness highest
Refer to the preceding n feature expression after being arranged from high to low according to fitness;
The child node 73 of iteration task is performed, for the evaluation based on the N-Generation fitness evaluation task
As a result, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described
The n mark sheet of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in multiple coding individuality
The n coding corresponding to formula is individual;
The child node 74 of fitness evaluation task is performed, for in allocated fitness evaluation task
Encoding the feature expression indicated by individuality carries out fitness calculating, and the fitness that will be calculated is used as commenting
Valency result is sent to the host node.
Alternatively, the coding individuality is to be generated by the way of depth-first encoding D FP;It is described to hold
The child node 73 of row iteration task specifically for:
Based on the evaluation result of N-Generation fitness evaluation task, from N-Generation fitness evaluation task-based appraisal
M feature expression in select fitness n feature expression of highest;
Two feature expressions are randomly choosed from the m feature expression, is intersected generally according to default
Rate, selects a subexpression to be intersected respectively from the two feature expressions, after retaining random the intersection
A feature expression;Repeat the step m-n times, m-n feature after the random intersection for being retained
Expression formula;
According to default mutation probability, in m-n feature expression after the random intersection of the reservation
Element enters row variation treatment, obtains m-n feature expression after random variation;
By m-n feature after described n feature expression of fitness highest and random variation treatment
Expression formula distinguishes corresponding coding individuality, is defined as the m included in N+1 generation fitness evaluation task
Individual coding is individual.
Alternatively, it is described perform iteration task child node 73 specifically for according to following steps to the guarantor
The element in m-n feature expression after the random intersection stayed enters row variation treatment:
For any feature expression formula, one kind is randomly choosed from following processing mode and enters row variation treatment:
A single characteristic node in this feature expression formula is replaced with a subexpression;Single feature section
Point refers to a data or an operator in this feature expression formula;
A subexpression in this feature expression formula is reduced to a single characteristic node;
A single characteristic node in this feature expression formula is replaced with single characteristic node of random generation;
This feature expression formula is replaced with the new feature expression of random generation.
Alternatively, the child node 73 for performing iteration task is specifically for according to following steps from N-Generation
Fitness n feature expression of highest is selected in m feature expression of fitness evaluation task-based appraisal:
If in the m feature expression, there is fitness identical feature expression, then redundancy is rejected
K feature expression, with cause in remaining feature expression do not exist fitness identical mark sheet
Up to formula;In the remaining feature expression, fitness n feature expression of highest is selected, and will
M subtracts k.
Alternatively, the host node 71 is additionally operable to:
After feature generation task is received, obtained from data server and perform the feature generation task institute
The data file for needing, and the transmitting data file that will be obtained is to every PC cluster machine in group system;
It is described perform fitness evaluation task child node 74 specifically for:
The coding read from the PC cluster machine of place in allocated fitness evaluation task is individual signified
The characteristic shown, and the characteristic of reading is substituted into the individual corresponding feature expression of the coding, pass through
Fitness function on PC cluster machine where calling, to substituting into the feature expression after characteristic
Carry out fitness calculating.
Alternatively, the host node 71 is additionally operable to:Host node reception is issued to a child node of selection
Initialization task corresponding to feature generation task;
The system also includes:The child node 75 of initialization task is performed, based on the cluster where by calling
Calculate the initialization function on machine, the individual coding file of coding of the random generation comprising multiple initialization;
The host node 71 is additionally operable to:Multiple based on the generation of child node 75 for performing initialization task is initial
The coding of change is individual, the multiple first generation fitness evaluation tasks of generation, and each first generation for generating is fitted
Response evaluates task and is handed down to different child nodes respectively.
Alternatively, it is described perform output task child node 72 specifically for:
The N-Generation fitness evaluation task in file system is stored by calling the host node
Evaluation result, determines n feature expression of fitness highest, and export feed back to user, for referring to
Show the feature generation result report of described n feature expression of fitness highest, and export for follow-up
The characteristic corresponding to the n feature expression of fitness highest called.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter
Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot
Close the form of the embodiment in terms of software and hardware.And, the application can be used and wherein wrapped at one or more
Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on implement computer program product form.
The application is produced with reference to the method according to the embodiment of the present application, device (system) and computer program
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and
/ or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided
The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set
In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory
Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple
The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made
Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place
Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one
The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
Although having been described for the preferred embodiment of the application, those skilled in the art once know base
This creative concept, then can make other change and modification to these embodiments.So, appended right will
Ask and be intended to be construed to include preferred embodiment and fall into having altered and changing for the application scope.
Obviously, those skilled in the art can carry out various changes and modification without deviating from this Shen to the application
Spirit and scope please.So, if the application these modification and modification belong to the application claim and
Within the scope of its equivalent technologies, then the application is also intended to comprising these changes and modification.