CN106708609A - Characteristics generation method and system - Google Patents

Characteristics generation method and system Download PDF

Info

Publication number
CN106708609A
CN106708609A CN201510784474.4A CN201510784474A CN106708609A CN 106708609 A CN106708609 A CN 106708609A CN 201510784474 A CN201510784474 A CN 201510784474A CN 106708609 A CN106708609 A CN 106708609A
Authority
CN
China
Prior art keywords
task
fitness
generation
feature
feature expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510784474.4A
Other languages
Chinese (zh)
Other versions
CN106708609B (en
Inventor
冯天恒
王雯晋
乔彦辉
王学庆
周胜臣
方炜超
娄鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510784474.4A priority Critical patent/CN106708609B/en
Publication of CN106708609A publication Critical patent/CN106708609A/en
Application granted granted Critical
Publication of CN106708609B publication Critical patent/CN106708609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to the technical field of internets, and in particular to a characteristics generation method and a characteristics generation system, and is used for solving the problems of insufficient big data processing capacity and relatively low evaluation efficiency when adaptation evaluation for a lot of newly generated characteristics is carried out. In the embodiment, the whole iterative process comprises an initialization task executed by a selected sub-node, adaptation evaluation tasks of each generation which are concurrently executed by a plurality of sub-nodes, an iterative task executed by the selected sub-node, and an output task which is executed after all of the adaptation evaluation tasks are completely executed; and a major node is responsible for coordinated scheduling of the whole iterative process. Since the adaptation evaluation tasks of each generation can be executed by the plurality of sub-nodes, the efficiency of the whole characteristics generation process is improved; and the major node indicates a characteristics expression needing to be evaluated for the sub-nodes which execute the adaptation evaluation tasks in the form of encoding individuals, therefore, the data transmission amount can be reduced.

Description

A kind of feature generation method and system
Technical field
The application is related to Internet technical field, more particularly to a kind of feature generation method and system.
Background technology
With the development of Internet information technique, the species of the business service provided the user by internet is got over Come more, how preferably to provide the user business service is a major issue in internet industry.Mould Type classification can effectively lift business service level, such as, the income level to user is classified, and will be used The income level at family is divided into high, medium and low three classifications, can be based on not being all for user's income level classification User provides different information promotion services.
When being classified based on model, it is necessary to be input into multiple features, good characteristic set can be carried effectively Rise the accuracy rate of category of model.Under many circumstances, the information content that single feature is contained is limited, and passes through Significant classification performance can be produced after combinations of features conversion.Therefore, it can based on primitive character collection symphysis Into some new features, the feature for enabling these new reflects the recessive classification capacity of primitive character set.Together When, in order to avoid converting influence of a large amount of invalid or redundancy the feature of generation to category of model accuracy rate, need Fitness evaluation is carried out to newly-generated feature.
At present, when the fitness evaluation of a large amount of newly-generated features is carried out, generally there is big data disposal ability Deficiency, evaluates less efficient problem, so as to limit the further optimization to newly-generated feature, causes nothing Method timely and effectively obtains valuable feature.
The content of the invention
The embodiment of the present application provides a kind of feature generation method and system, be used to solve carry out it is a large amount of newly-generated During the fitness evaluation of feature, generally there is big data disposal ability deficiency, evaluate less efficient problem, Additionally provide a kind of feature generating algorithm for effectively obtaining high value new feature.
The embodiment of the present application provides a kind of feature generation method, including:
Step A, host node send in the multiple child nodes for receiving execution N-Generation fitness evaluation task After evaluation result, however, it is determined that N is equal to maximum iteration, then output task is issued to a child node of selection, Otherwise, iteration task is issued to a child node of selection;
Step B, evaluation result of the child node based on N-Generation fitness evaluation task for performing output task, It is determined that and exporting n feature expression of fitness highest;Described n feature expression of fitness highest Refer to the preceding n feature expression after being arranged from high to low according to fitness;
Step C, evaluation knot of the child node based on the N-Generation fitness evaluation task for performing iteration task Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality N coding corresponding to formula is individual;
Step D, the host node are based on the coding file generated multiple N+1 generation fitness evaluation and appoint Business, and each N+1 generation fitness evaluation task is handed down to different child nodes respectively, wherein, often It is individual comprising a coding in one fitness evaluation task;
Step E, the child node of execution fitness evaluation task are in allocated fitness evaluation task Encoding the feature expression indicated by individuality carries out fitness calculating, and the fitness that will be calculated is used as commenting Valency result is sent to the host node;N plus 1, return to step A.
Alternatively, the coding individuality is to be generated by the way of depth-first encoding D FP;Step C In, the child node for performing iteration task is based on the evaluation result of N-Generation fitness evaluation task, and generation is included The individual coding file of multiple coding, including:
Step C1, the child node of the execution iteration task are based on the evaluation of N-Generation fitness evaluation task As a result, fitness highest n is selected from m feature expression of N-Generation fitness evaluation task-based appraisal Individual feature expression;
Step C2, two feature expressions of random selection from the m feature expression, according to default Crossover probability, select a subexpression to be intersected respectively from the two feature expressions, retain with A feature expression after machine intersection;The step is repeated m-n times, after the random intersection for being retained M-n feature expression;
Step C3, according to default mutation probability, to m-n feature after the random intersection of the reservation Element in expression formula enters row variation treatment, obtains m-n feature expression after random variation;
Step C4, by described n feature expression of fitness highest and the random variation treatment after M-n feature expression distinguishes corresponding coding individuality, is defined as in N+1 generation fitness evaluation task Comprising m coding it is individual.
Alternatively, in step C3, in m-n feature expression after the random intersection of the reservation Element enters row variation treatment, including:
For any feature expression formula, one kind is randomly choosed from following processing mode and enters row variation treatment:
A single characteristic node in this feature expression formula is replaced with a subexpression;Single feature section Point refers to a data or an operator in this feature expression formula;
A subexpression in this feature expression formula is reduced to a single characteristic node;
A single characteristic node in this feature expression formula is replaced with single characteristic node of random generation;
This feature expression formula is replaced with the new feature expression of random generation.
Alternatively, in step C1, the child node for performing iteration task is commented based on N-Generation fitness The evaluation result of valency task, selects from m feature expression of N-Generation fitness evaluation task-based appraisal N feature expression of fitness highest, including:
If in the m feature expression, there is fitness identical feature expression, then redundancy is rejected K feature expression, with cause in remaining feature expression do not exist fitness identical mark sheet Up to formula;
In the remaining feature expression, fitness n feature expression of highest is selected, and will step M in rapid B2~B4 subtracts k.
Alternatively, before step A, also include:
The host node is obtained from data server and performs the feature after feature generation task is received The data file of required by task, and the transmitting data file that will be obtained are generated to every cluster in group system Computing machine;
In step E, the child node for performing fitness evaluation task carries out fitness calculating, including:
The child node for performing fitness evaluation task reads what is be allocated from the PC cluster machine of place The characteristic indicated by coding individuality in fitness evaluation task, and the characteristic substitution that will be read should Individual corresponding feature expression is encoded, the fitness evaluation letter on PC cluster machine where by calling Number, fitness calculating is carried out to substituting into the feature expression after characteristic.
Alternatively, before step A, also include:
Host node issues first corresponding to the feature generation task of host node reception to a child node of selection Beginning task;
The child node for performing initialization task passes through to call the initialization function on the PC cluster machine of place, with The individual coding file of coding of the machine generation comprising multiple initialization;
The coding that the host node is based on the multiple initialization is individual, the multiple first generation fitness evaluations of generation Task, and each first generation fitness evaluation task for generating is handed down to different child nodes respectively.
Alternatively, in stepb, the child node for performing output task is commented based on N-Generation fitness The evaluation result of valency task, it is determined that and export n feature expression of fitness highest, including:
The child node for performing output task stores the institute in file system by calling the host node The evaluation result of N-Generation fitness evaluation task is stated, n feature expression of fitness highest is determined, and Output feed back to user, for indicate described n feature expression of fitness highest feature generate tie Retribution is accused, and exports the spy corresponding to the n feature expression of fitness highest for subsequent calls Levy data.
The embodiment of the present application provides a kind of feature generation system, including:
Host node, for receiving commenting for the multiple child nodes for performing N-Generation fitness evaluation task transmission After valency result, however, it is determined that N is equal to maximum iteration, then output task is issued to a child node of selection, Otherwise, iteration task is issued to a child node of selection;It is additionally operable to, based on the child node for performing iteration task The coding file generated multiple N+1 generation fitness evaluation task of generation, and each N+1 generation is fitted Response evaluates task and is handed down to different child nodes respectively, wherein, included in each fitness evaluation task One coding is individual;
The child node of output task is performed, for the evaluation result based on N-Generation fitness evaluation task, really Determine and export n feature expression of fitness highest;Described n feature expression of fitness highest be Refer to the preceding n feature expression after being arranged from high to low according to fitness;
The child node of iteration task is performed, for the evaluation knot based on the N-Generation fitness evaluation task Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality N coding corresponding to formula is individual;
The child node of fitness evaluation task is performed, for for the volume in allocated fitness evaluation task Feature expression indicated by code individuality carries out fitness calculating, and the fitness that will be calculated is used as evaluation Result is sent to the host node.
Using the above method or system, because every generation fitness evaluation task can be parallel by multiple child nodes Perform, therefore improve fitness evaluation efficiency, also and then improve the efficiency that whole feature generates process, Ensure that the promptness of new feature generation;Host node when fitness evaluation task is issued to child node, not There is the child node for directly characteristic being transferred to and performing fitness evaluation task, but use coding individuality Form indicates the feature expression of needs assessment to it, such that it is able to reduce volume of transmitted data, increases transmission effect Rate, and reduce EMS memory occupation.In addition, based on the embodiment of the present application proposition based on depth-first encoding D EP Iterative algorithm under mode, is effectively guaranteed the integrality of subexpression, and in iteration each time all Optimal several feature expressions after last iteration are remained, therefore after last time iteration is completed, can To obtain optimal several feature expressions in whole iterative process.
Brief description of the drawings
The feature generation method flow chart that Fig. 1 is provided for the embodiment of the present application;
Fig. 2 is the schematic diagram that task scheduling is carried out based on iterative calculation framework;
Fig. 3 is characterized expression formula and intersects schematic diagram;
Fig. 4 is the first variation schematic diagram;
Fig. 5 is second variation schematic diagram;
Fig. 6 is the third variation schematic diagram;
The feature generation system structural representation that Fig. 7 is provided for the embodiment of the present application.
Specific embodiment
In the embodiment of the present application, whole iterative process includes:The initialization performed by the child node for selecting Task, the every generation fitness evaluation task by multiple child node executed in parallel, a child node by selecting The iteration task of execution and the output task performed after whole fitness evaluation tasks have been performed;It is main Node is responsible for the adaptable task of each child node distribution, carries out the coordinated scheduling of whole iterative process.By By multiple child node executed in parallel, therefore fitness evaluation can be improve in every generation fitness evaluation task Efficiency, also and then improves the efficiency that whole feature generates process, it is ensured that the promptness of new feature generation; Characteristic is not directly transferred to execution by host node when fitness evaluation task is issued to child node The child node of fitness evaluation task, but the feature of needs assessment is indicated to it in the form of individuality is encoded Expression formula, such that it is able to reduce volume of transmitted data, increases efficiency of transmission, and reduce EMS memory occupation.
In addition, in iteration task, the embodiment of the present application is proposed and encodes (Depth-First based on depth-first Programing, DEP) iterative algorithm under mode, generating algorithm is also referred to as characterized, the algorithm utilizes DEP Coded system is encoded, and in feature generation, is effectively guaranteed the integrality of subexpression;And make use of something lost Characteristic is passed, optimal several feature expressions after last iteration are all remained during iteration each time, therefore After completing last time iteration, optimal several feature expressions in whole iterative process can be obtained.
In order that the purpose, technical scheme and advantage of the application are clearer, below in conjunction with accompanying drawing to this Shen Please be described in further detail, it is necessary to illustrate, based on the embodiment in the application, this area is common All other embodiment that technical staff is obtained under the premise of creative work is not made, belongs to this Shen The scope that please be protect.
As shown in figure 1, the feature generation method flow chart provided for the embodiment of the present application, Fig. 2 is based on repeatedly The schematic diagram of task scheduling is carried out for Computational frame, is comprised the following steps:
S101:Host node is issued corresponding to the feature generation task of host node reception to a child node of selection Initialization task.
In above-mentioned steps, host node can randomly choose one execution initialization task child node or according to The loading condition of each child node selects a child node for execution initialization task, it is possible to performing initialization The child node of task indicates following parameter information as shown in Table 1:
Parameter name Parameter type Meaning of parameters
filename String Data file name
fieldSize Int Data file field sum
popSize Int Individual amount in population
parameter String Algorithm parameter
Table one
Wherein, the parameter of this character string (String) type of data file name (filename) is used to refer to Data file where showing data sample, data file field sum (fieldSize) this integer (Int) Parameter is used for the field quantity for indicating in data file, a kind of data characteristics of each field identification, below institute The feature expression stated is the expression formula on various data characteristicses and operator, such as feature expression X4=X1+X2 × X3, wherein, X1, X2 and X3 are several different data characteristicses, "+" and "×" It is operator.This integer (Int) parameter of individual amount (popSize) is used to indicate initialization to appoint in population The quantity of the coding individuality generated in business, each coding one feature expression of individual correspondence.Algorithm parameter (parameter) it is the parameter of String types, is stored in the form of with key-value (key-value), Depth capacity (depth of such as X1+X2 × X3 is 3) of the feature expression of generation etc. can such as be included.
Above-mentioned parameter information can be the information being input into by front-end interface by user.Specifically, user is in net Page client, by the guiding of front-end interface, carries out data and imports and demand setting, final to initiate to the back-end Task requests.Specifically include three sub-processes:Submit data, selection algorithm and arrange parameter to.Wherein, carry Intersection number evidence refers to that user is input into corresponding tables of data (data file) title of pending data by front-end interface, And choose the field for needing to be processed and field type is set;Selection algorithm, refers to that user is submitting data to Afterwards, algorithm is selected according to selected field type, and provides some of the recommendations, user can select according to the actual requirements Corresponding algorithm is selected, it is also possible to submit customized algorithm to, in the embodiment of the present application, it is necessary to user selects Algorithm be mainly fitness evaluation algorithm, for feature generation, the embodiment of the present application provide special spy Levy generating algorithm;Arrange parameter, refers to that the parameter of selected algorithm is set, and all parameters can all have scarce Province's value is for reference.After these three sub-processes terminate, all relevant informations are collected to a task In request, send to rear end, then, perform each rear end calculation process from S101.Rear end computing The process for namely calling related algorithm to be calculated, this process is isolated with user, can be by preceding End interface query task running status, is shown to user in the form of rolling daily record, in fitness evaluation task After the completion of all performing, can be exported with implementing result, specifically, reading deposits resultful file (as subsequently The json files of introduction), parse and user is presented to certain visual pattern.
S102:The child node for performing initialization task passes through to call the initialization letter on the PC cluster machine of place Number, the individual coding file of coding of the generation comprising multiple initialization.
Here, the child node (Slave nodes) for performing initialization task, according to the instruction of host node, passes through Script mode call where related algorithm source code (namely initialization in source code library on PC cluster machine Function), generation coding file, while generation can also include intermediate data file, will coding file and Intermediate data file returns to host node.Here may be needed comprising some subsequent analysis in intermediate data file The intermediate result to be used, in initialization task, due to also being produced without intermediate result, can be by this Between data in data file nullify or default value, host node (Master nodes) can be saved it in File system in, for successive iterations calculate be called.In addition, in follow-up scheduling process, it is each The coding file and intermediate data file that secondary iteration is produced also can be all stored in the file system of host node, Therefore, once there is the system failure, you can calculated with to return to continue at an iteration before the system failure, can It is restorative stronger.
In specific implementation, the coding is individual to be used for identification characteristics expression formula, each in coding individuality Encode for identifying a field (namely a kind of feature) or this feature expression formula that this feature expression formula is related to In an operator.Such as, for a tables of data, there are [Y, X1, X2, X3] four fields, each A kind of data characteristics in field corresponding data table, for feature expression X4=X1+X2 × X3, if "+" Code be -1, the code of "×" is for the code of 2, X3 for the code of 1, X2 for the code of -3, X1 3, then this feature expression is encoded into (Depth-First Programing, DFP) side according to depth-first Formula is [- 1,1, -3,2,3] after being encoded.
S103:The coding that host node is based on the multiple initialization is individual, and the multiple first generation fitness of generation are commented Valency task, and each first generation fitness evaluation task for generating is handed down to different child nodes respectively; Wherein, it is individual comprising a coding in each fitness evaluation task.
If the application is applied in category of model, fitness can be weighed using Geordie (Gini) coefficient.
In the embodiment of the present application, in order to reduce volume of transmitted data, host node is commented issuing fitness to child node During valency task, characteristic is not transferred to the child node of execution fitness evaluation task directly, but The feature expression of needs assessment is indicated to it in the form of individuality is encoded, such that it is able to reduce data transfer Amount, increases efficiency of transmission, and reduce EMS memory occupation.
In S103, host node can wait initialization task knot by way of thread dormancy (sleep) Beam, then determines to generate first generation fitness evaluation task according to the quantity for encoding individuality in coding file Quantity, and create each fitness evaluation task.Specifically, for each fitness evaluation task, can be with One sub- task identification (Identity, ID) of generation and population at individual identify (Identity, ID), and by son The appointing as fitness evaluation task such as task ID, population at individual ID, coding individuality and algorithm parameter Business execution information is stored in task queue, and each fitness evaluation task point is then taken out in task queue Issue each child node for performing fitness evaluation task.Here, perform per generation fitness evaluation task Child node can be that host node is randomly selected, or host node is according to the load of current each child node Situation selection.
As shown in Table 2, to perform the parameters information that receives of child node of fitness evaluation task:
Parameter name Parameter type Meaning of parameters
filename String Data file name
JobID String Task ID
popID String Population at individual ID
individual String Coding is individual
parameter String Algorithm parameter
Table two
In above-mentioned parameter, data file name (filename), task ID (JobID), population at individual ID (popID), coding individual (individual), algorithm parameter (parameter) are all character string types Parameter.
S104:The child node of fitness evaluation task is performed for the volume in allocated fitness evaluation task Feature expression indicated by code individuality carries out fitness calculating, and the fitness that will be calculated is used as evaluation Result is sent to the host node.
Here, the child node for performing fitness evaluation task is processed distributing for task, specifically, is adjusted Calculated with corresponding fitness function, and by fitness result of calculation, task ID, population at individual During ID and the individual write-in of coding one are with the json files of task ID name, the text of host node is returned to In part system.For task identification ID, host node can be to the child node of any execution fitness evaluation task The implementation progress inquiry request comprising task ID is sent, and receives any execution fitness evaluation task Child node be based on the task ID return implementation progress information.ID is identified for population at individual;Main section Point receive perform the fitness evaluation task child node send comprising the population at individual ID and volume After the individual evaluation result of code, the population at individual ID and coding that can be sent the child node are individual, with The population at individual ID and coding individuality being handed down in the fitness evaluation task of the child node are matched, with Verify the accuracy of the evaluation result of child node feedback.
In S104, the child node for performing fitness evaluation task can be according to the individual reading correspondence word of coding The characteristic of section, and without reading the characteristic of all fields, such that it is able to reduce to memory headroom Take, tasks in parallel treatment is better achieved.Preferably, host node is after feature generation task is received, The data file for performing the feature generation required by task can be obtained first from data server, and will be obtained The transmitting data file for taking is to every PC cluster machine in group system;Correspondingly, fitness is performed to comment The child node of valency task reads the volume in allocated fitness evaluation task from the PC cluster machine of place Characteristic indicated by code individuality, and the characteristic of reading is substituted into the individual corresponding mark sheet of the coding Up to formula, the fitness function on PC cluster machine where by calling, after substituting into characteristic Feature expression carries out fitness calculating.Here, read indicated by the coding individuality for the ease of child node Characteristic, host node characteristic is downloaded into child node from data server in advance where cluster meter Calculate on machine.In actually implementing, child node can also directly read required characteristic from data server According to, but so nature can largely reduce evaluation efficiency.
S105:Host node is in commenting that the multiple child nodes for receiving execution N-Generation fitness evaluation task send After valency result, judge whether N is equal to maximum iteration, if so, then entering S106, otherwise, perform S107.Here, N is the positive integer more than or equal to 1, is increased with the increase of iterations.
In specific implementation process, when in certain iteration host node to distribute all fitness evaluation tasks for complete Portion performs completion, and host node can be collected the evaluation result that each child node is returned, and generate a csv text Part is stored in file system for calling.Meanwhile, host node can be judged iterative process, if repeatedly In generation, has met end condition and has then terminated iteration, and indicates certain child node to perform output task, if being unsatisfactory for End condition, it indicates that certain child node performs iteration task, that is, generate new coding file and new centre Data file, and return to host node.
S106:Host node issues output task to a child node of selection.
Here, the child node for performing output task can be that host node is randomly selected, or host node What the loading condition according to current each child node was selected.
As shown in following table three, for the parameter information that the child node for performing output task is received:
Parameter name Parameter type Meaning of parameters
filename String Data file name
popSize Int Individual amount in population
parameter String Algorithm parameter
PVfilename String Encode individual and corresponding evaluation function return value
Midfilename String Intermediate data file title
IterNum Int Current iteration number of times
Table three
Wherein, the algorithm parameter (parameter) in table three for String types parameter, with keyword- Value (key-value) to form storage, such as can be the fitness highest feature expression for retaining Number, the evaluation function return value is fitness.
S107:The child node for performing output task is based on the evaluation result of N-Generation fitness evaluation task, really Determine and export n feature expression of fitness highest.Here n is the positive integer more than or equal to 1,
Here, because iteration will inherit the optimal n after last iteration each time in the embodiment of the present application Individual feature expression, therefore, it can be directly based upon the evaluation result of last time iteration, select all of N optimal feature expression of evaluation result in iterations.
In specific implementation, if iterative process terminates, host node collects to evaluation result, will collect Evaluation result afterwards is stored in file system, performs the child node of output task by calling the file system Evaluation result in system, determines n feature expression of fitness highest, and export feed back to user, Feature for indicating described n feature expression of fitness highest generates result and reports, and output is used The feature corresponding to the n feature expression of fitness highest in subsequent calls.Such as, output is performed Child node output one json file and a csv file of task, wherein json files are used to preserve form Change result, generation is shown to the feature generation result report of user after front end is returned it into;Csv files are then For preserving the characteristic corresponding to the feature expression for ultimately generating and retaining, can finally be uploaded to On server, for user's subsequent calls.In addition, system can be automatically deleted all associated documents, release is hard Disk space.
S108:Host node issues iteration task to a child node of selection.
Here, the child node for performing iteration task can be that host node is randomly selected, or host node What the loading condition according to current each child node was selected.
As shown in Table 4, to perform the parameter information that receives of child node of iteration task:
Parameter name Parameter type Meaning of parameters
filename String Data file name
popSize Int Individual amount in population
parameter String Algorithm parameter
PVfilename String Encode individual and corresponding evaluation function return value
Midfilename String Intermediate data file title
IterNum Int Current iteration number of times
Table four
Wherein, the algorithm parameter in table four can include crossover probability (pCross), mutation probability (pMutation), see below in relation to iterative algorithm, namely feature generating algorithm description.
S109:The child node for performing iteration task is based on the evaluation knot of the N-Generation fitness evaluation task Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality N coding corresponding to formula is individual.
In the step, the child node of the execution iteration task of host node selection calls iteration letter using script mode Number, according to the evaluation result of N-Generation fitness evaluation task, namely fitness, generates for performing the N+1 for fitness evaluation task coding file, can also by call host node preserve execution N-1 For the intermediate data file after fitness evaluation task, during generation is performed after N-Generation fitness evaluation task Between data file and return to host node, be able to may need to use comprising subsequent analysis in intermediate data file Intermediate result, such as n optimal mark sheet of N-Generation fitness evaluation task-based appraisal goes out evaluation result Up to formula and its corresponding fitness of difference.
In the embodiment of the present application, in order to fitness feature expression higher is retained in successive iterations Come, the n of fitness highest (namely evaluation result is optimal) when can all retain last iteration during each iteration Individual feature expression, in addition to this n feature expression, can at random generate or lead to during iteration each time Cross conversion and obtain new feature expression.The application preferred embodiment in, in order to realize retaining Advantageous characteristic, can further generate fitness feature higher again, innovatively propose and compiled based on depth-first Iterative algorithm under code (Depth-First Programing, DEP) mode, with traditional based on gene table Coded system up to coding (Gene Expression Programing, GEP) is different, the volume based on DFP Code method is that all branches of feature expression are traveled through in the way of depth-first, effectively can be reached sublist Formula is intactly remained.
The iteration tasks carrying process that the embodiment of the present application is based under GEP modes is specifically included:
Step 1:The child node for performing iteration task is based on the evaluation result of N-Generation fitness evaluation task, Select fitness highest n special from m feature expression of N-Generation fitness evaluation task-based appraisal Levy expression formula.Here, m is the positive integer more than n.
In traditional genetic algorithm, it is necessary to be fitted according to individual in population when selective advantage is individual Response distributes selected probability from high to low, then carries out the individual choosing of advantage according to the selected probability of distribution Take.But this mode is generally difficult to the individual reservation of advantage effectively.The embodiment of the present application is by the way of Some optimal feature expressions that last iteration is produced directly are remained, rather than with probability selection Form.Preferably, in order to avoid there is excessive redundancy feature, the complexity of feature generation is increased, can be with After step 1, redundancy feature is rejected, specifically, if in the m feature expression, existing suitable Response identical feature expression, then reject k feature expression of redundancy, to cause in remaining feature Do not exist fitness identical feature expression in expression formula;In the remaining feature expression, selection N feature expression of fitness highest, and the m in step 2~4 is subtracted into k.Here, k is less than m Positive integer.
Step 2:Two feature expressions are randomly choosed from the m feature expression, according to default Crossover probability, selects a subexpression to be intersected respectively from the two feature expressions, retains random A feature expression after intersection;Repeat the step m-n times, the m-n after the random intersection for being retained Individual feature expression.
In order to ensure the legitimacy of newly-generated feature expression, the intersection based on DFP codings is in two spies Levy and find a complete subexpression respectively in expression formula and intersected, here, come in the form of expression tree Expression, as shown in Figure 3.As a result of the coding method of depth-first, a complete sublist It is in encoded continuous character string up to formula, so being easily found.This interleaved mode not only ensure that friendship The legitimacy of feature expression after fork, while it is also ensured that the integrality of subexpression so that advantage sublist Can effectively be inherited up to formula.In addition, in order to prevent from being absorbed in local optimum, it is excellent in crossover process N feature expression of choosing is involved in intersecting with further feature expression formula.Such as, in m feature expression In, randomly choose two feature expressions Xi and Xj, generate one meet [0,1] between it is equally distributed with Machine number p, if p<Crossover probability (pCross), then to feature expression Xi=(A-B) × (C+D) and Xj=A+ (C+D-B) carries out subexpression intersection, such as, and can be by the subexpression (A-B) of Xi with Xj's Subexpression (C+D-B) is swapped, will be " from m mark sheet of N-Generation fitness evaluation task-based appraisal Whether action Repeated m-n times of two feature expressions of random selection up in formula " (is really handed over every time Fork is then probabilistic), if retaining the new feature expression after an intersection every time or being intersected Then retain a randomly selected feature expression.
Step 3:According to default mutation probability, to m-n mark sheet after the random intersection of the reservation Enter row variation treatment up to the element in formula, obtain m-n feature expression after random variation.
The step namely for m-n feature representation outside deconditioning degree n feature expression of highest Formula, enters row variation, such as described to wherein each feature expression according to mutation probability (pMutation) In m-n feature expression, a feature expression Xi is randomly choosed, between one satisfaction [0,1] of generation The random number p of even distribution, if p<PMutation, then enter row variation to Xi.
In order to ensure variation after feature expression legitimacy, and subexpression integrality, make a variation mode Following four can be divided into, the different time is become specific, one of which can be randomly choosed:
The first:A single characteristic node in this feature expression formula is replaced with a subexpression;It is described Single characteristic node refers to a kind of data characteristics or an operator in this feature expression formula.
Under this variation mode, a single characteristic node in feature expression is expressed with a son at random Formula is substituted, as shown in Figure 4.Because there is inferior position at aspects such as interpretations in excessively complicated feature, because This preferably increased subexpression depth capacity is 2.
Second:A subexpression in this feature expression formula is reduced to a single characteristic node.
As shown in figure 5, this mutation operation can be regarded as the inverse operation of the first.The need for for coding, When sub- expression tree is sheared, if the existing left subtree of start node has right subtree again, can select always to cut Except present node and left subtree.
The third:A single characteristic node in this feature expression formula is used single characteristic node generation of random generation Replace.
Under this variation mode, for data characteristics node, after variation or data characteristics node;For Operator node, after variation also or operator node, the subtree number of node can not change.Such as Fig. 6 Shown in, operator "-" is replaced by "/".
4th kind:This feature expression formula is replaced with the new feature expression of random generation.
Because above-mentioned the first~tri- kind variation is all change on the basis of m original feature expression, it is Prevent from being absorbed in local optimum, can be under certain probability, generation one new feature expression is replaced at random The feature expression that generation is selected.
Step 4:By the m-n after described n feature expression of fitness highest and random variation treatment Individual feature expression distinguishes corresponding coding individuality, is defined as being included in N+1 generation fitness evaluation task M coding it is individual.
S110:Host node is based on the coding file generated multiple N+1 generation fitness evaluation task, and will Each N+1 generation fitness evaluation task is handed down to different child nodes respectively, performs S104, afterwards, N plus 1, return to step S105.
In order to further illustrate the thought of the embodiment of the present application, make further below by a specific example Explanation.
As shown in Table 5, there are 50 data samples, they are belonging respectively in three kinds of subgenus under Jris One kind, be respectively mountain iris, Iris versicolor and Virginia iris, each sample initially has four numbers It is respectively calyx length, calyx width, petal length, petal width according to feature.
Calyx length A Calyx width B Petal length C Petal width D Category kind Y
5.1 3.5 1.4 0.2 Mountain iris
4.9 3 1.4 0.2 Mountain iris
4.7 3.2 1.3 0.2 Mountain iris
6.6 2.9 4.6 1.3 Iris versicolor
5.2 2.7 3.9 1.4 Iris versicolor
5 2 3.5 1 Iris versicolor
6.3 2.8 5.1 1.5 Iris versicolor
6.1 2.6 5.6 1.4 Iris versicolor
7.7 3 6.1 2.3 Iris versicolor
…… …… …… …… Iris versicolor
Table five
In front-end interface, user's feature generating algorithm of the selection based on depth-first coding (DFP), choosing Gini coefficients are selected to evaluate fitness.Meanwhile, user sets crossover probability pCross and mutation probability pMutation;Such as, pCross=0.5, pMutation=0.6 are set.
Aforementioned four data characteristics is designated as { A, B, C, D }, operator set be combined into+,-, × ,/. In initialization task, according to DFP coded systems, 50 feature expressions, each feature are generated at random Expression formula represents a newly-generated feature, such as, newly-generated feature expression (A-B)/(C+D) is corresponding DFP is encoded as shown in following table six:
/ - A B + C D
Table six
In fitness evaluation task, 50 respective fitness Gini (i) of feature expression are calculated, wherein I={ 1,2,3 ..., 49,50 }.In iteration task, feature expression is selected, intersected and is made a variation Operate to realize the optimal search to new feature.Specifically, 5 optimal feature expressions are selected first (i.e. 5 feature expressions of Gini coefficients highest);Then, 5 optimal feature expressions are retained, Two Xi and Xj are randomly choosed from 50 feature expressions, generation one is uniformly distributed between meeting [0,1] Random number p, if p<PCross, then carry out spy to Xi=(A-B) × (C+D) and Xj=A+ (C+D-B) Intersection is levied, as shown in figure 3, the left subtree (i.e. a subexpression on the left side) of Xi is chosen with Xj's Right subtree (i.e. a subexpression on the right) is swapped.This two action of new feature of random selection Repeat 45 times, a feature expression after random the intersection is retained every time (if being carried out according to crossover probability Intersect, then retain a new feature expression after intersecting, otherwise, retain without the feature intersected Expression formula);Finally, 5 optimal feature expressions are still retained, 45 features from after random the intersection In expression formula (be probably after intersecting according to probability, be also likely to be not intersected according to probability), with Machine selects a new feature expression Xi, generates one and meets equally distributed random number p between [0,1], If p<PMutation, then enter row variation to Xi, then for above-mentioned the first~tetra- kind variation mode, according to Identical probability randomly chooses a kind of variation mode and enters row variation.One feature expression of this random selection enters The operation of row variation is repeated 45 times, after obtaining being made a variation according to probability or without 45 feature expressions for making a variation, Plus 5 optimal feature expressions, this 50 feature expressions are continued to evaluate, repeated above-mentioned Process.
As shown in Table 7, after adding newly-generated feature expression, compared with only primitive character, hence it is evident that carry The accuracy rate of category of model is risen.
Table seven
Additionally provided based on same inventive concept, in the embodiment of the present application a kind of corresponding with feature generation method Feature generation system, due to principle and the embodiment of the present application feature generation method phase of the system solve problem Seemingly, therefore the implementation of the system may refer to the implementation of method, repeat part and repeat no more.
As shown in fig. 7, the feature generation system structural representation provided for the embodiment of the present application, including:
Host node 71, for receiving the multiple child nodes for performing N-Generation fitness evaluation task transmission Evaluation result after, however, it is determined that N be equal to maximum iteration, then to selection a child node issue output appoint Business, otherwise, iteration task is issued to a child node of selection;It is additionally operable to, based on the son for performing iteration task The coding file generated multiple N+1 generation fitness evaluation tasks of node generation, and by each N+1 Different child nodes are handed down to respectively for fitness evaluation task, wherein, in each fitness evaluation task It is individual comprising a coding;
The child node 72 of output task is performed, for the evaluation result based on N-Generation fitness evaluation task, It is determined that and exporting n feature expression of fitness highest;Described n feature expression of fitness highest Refer to the preceding n feature expression after being arranged from high to low according to fitness;
The child node 73 of iteration task is performed, for the evaluation based on the N-Generation fitness evaluation task As a result, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described The n mark sheet of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in multiple coding individuality The n coding corresponding to formula is individual;
The child node 74 of fitness evaluation task is performed, for in allocated fitness evaluation task Encoding the feature expression indicated by individuality carries out fitness calculating, and the fitness that will be calculated is used as commenting Valency result is sent to the host node.
Alternatively, the coding individuality is to be generated by the way of depth-first encoding D FP;It is described to hold The child node 73 of row iteration task specifically for:
Based on the evaluation result of N-Generation fitness evaluation task, from N-Generation fitness evaluation task-based appraisal M feature expression in select fitness n feature expression of highest;
Two feature expressions are randomly choosed from the m feature expression, is intersected generally according to default Rate, selects a subexpression to be intersected respectively from the two feature expressions, after retaining random the intersection A feature expression;Repeat the step m-n times, m-n feature after the random intersection for being retained Expression formula;
According to default mutation probability, in m-n feature expression after the random intersection of the reservation Element enters row variation treatment, obtains m-n feature expression after random variation;
By m-n feature after described n feature expression of fitness highest and random variation treatment Expression formula distinguishes corresponding coding individuality, is defined as the m included in N+1 generation fitness evaluation task Individual coding is individual.
Alternatively, it is described perform iteration task child node 73 specifically for according to following steps to the guarantor The element in m-n feature expression after the random intersection stayed enters row variation treatment:
For any feature expression formula, one kind is randomly choosed from following processing mode and enters row variation treatment:
A single characteristic node in this feature expression formula is replaced with a subexpression;Single feature section Point refers to a data or an operator in this feature expression formula;
A subexpression in this feature expression formula is reduced to a single characteristic node;
A single characteristic node in this feature expression formula is replaced with single characteristic node of random generation;
This feature expression formula is replaced with the new feature expression of random generation.
Alternatively, the child node 73 for performing iteration task is specifically for according to following steps from N-Generation Fitness n feature expression of highest is selected in m feature expression of fitness evaluation task-based appraisal:
If in the m feature expression, there is fitness identical feature expression, then redundancy is rejected K feature expression, with cause in remaining feature expression do not exist fitness identical mark sheet Up to formula;In the remaining feature expression, fitness n feature expression of highest is selected, and will M subtracts k.
Alternatively, the host node 71 is additionally operable to:
After feature generation task is received, obtained from data server and perform the feature generation task institute The data file for needing, and the transmitting data file that will be obtained is to every PC cluster machine in group system;
It is described perform fitness evaluation task child node 74 specifically for:
The coding read from the PC cluster machine of place in allocated fitness evaluation task is individual signified The characteristic shown, and the characteristic of reading is substituted into the individual corresponding feature expression of the coding, pass through Fitness function on PC cluster machine where calling, to substituting into the feature expression after characteristic Carry out fitness calculating.
Alternatively, the host node 71 is additionally operable to:Host node reception is issued to a child node of selection Initialization task corresponding to feature generation task;
The system also includes:The child node 75 of initialization task is performed, based on the cluster where by calling Calculate the initialization function on machine, the individual coding file of coding of the random generation comprising multiple initialization;
The host node 71 is additionally operable to:Multiple based on the generation of child node 75 for performing initialization task is initial The coding of change is individual, the multiple first generation fitness evaluation tasks of generation, and each first generation for generating is fitted Response evaluates task and is handed down to different child nodes respectively.
Alternatively, it is described perform output task child node 72 specifically for:
The N-Generation fitness evaluation task in file system is stored by calling the host node Evaluation result, determines n feature expression of fitness highest, and export feed back to user, for referring to Show the feature generation result report of described n feature expression of fitness highest, and export for follow-up The characteristic corresponding to the n feature expression of fitness highest called.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot Close the form of the embodiment in terms of software and hardware.And, the application can be used and wherein wrapped at one or more Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) on implement computer program product form.
The application is produced with reference to the method according to the embodiment of the present application, device (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and / or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
Although having been described for the preferred embodiment of the application, those skilled in the art once know base This creative concept, then can make other change and modification to these embodiments.So, appended right will Ask and be intended to be construed to include preferred embodiment and fall into having altered and changing for the application scope.
Obviously, those skilled in the art can carry out various changes and modification without deviating from this Shen to the application Spirit and scope please.So, if the application these modification and modification belong to the application claim and Within the scope of its equivalent technologies, then the application is also intended to comprising these changes and modification.

Claims (14)

1. a kind of feature generation method, it is characterised in that the method includes:
Step A, host node send in the multiple child nodes for receiving execution N-Generation fitness evaluation task After evaluation result, however, it is determined that N is equal to maximum iteration, then output task is issued to a child node of selection, Otherwise, iteration task is issued to a child node of selection;
Step B, evaluation result of the child node based on N-Generation fitness evaluation task for performing output task, It is determined that and exporting n feature expression of fitness highest;Described n feature expression of fitness highest Refer to the preceding n feature expression after being arranged from high to low according to fitness;
Step C, evaluation knot of the child node based on the N-Generation fitness evaluation task for performing iteration task Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality N coding corresponding to formula is individual;
Step D, the host node are based on the coding file generated multiple N+1 generation fitness evaluation and appoint Business, and each N+1 generation fitness evaluation task is handed down to different child nodes respectively, wherein, often It is individual comprising a coding in one fitness evaluation task;
Step E, the child node of execution fitness evaluation task are in allocated fitness evaluation task Encoding the feature expression indicated by individuality carries out fitness calculating, and the fitness that will be calculated is used as commenting Valency result is sent to the host node;N plus 1, return to step A.
2. the method for claim 1, it is characterised in that the coding individuality is excellent using depth What the mode of first encoding D FP was generated;In step C, the child node for performing iteration task is suitable based on N-Generation Response evaluates the evaluation result of task, and generation includes the individual coding file of multiple codings, including:
Step C1, the child node of the execution iteration task are based on the evaluation of N-Generation fitness evaluation task As a result, fitness highest n is selected from m feature expression of N-Generation fitness evaluation task-based appraisal Individual feature expression;
Step C2, two feature expressions of random selection from the m feature expression, according to default Crossover probability, select a subexpression to be intersected respectively from the two feature expressions, retain with A feature expression after machine intersection;The step is repeated m-n times, after the random intersection for being retained M-n feature expression;
Step C3, according to default mutation probability, to m-n feature after the random intersection of the reservation Element in expression formula enters row variation treatment, obtains m-n feature expression after random variation;
Step C4, by described n feature expression of fitness highest and the random variation treatment after M-n feature expression distinguishes corresponding coding individuality, is defined as in N+1 generation fitness evaluation task Comprising m coding it is individual.
3. method as claimed in claim 2, it is characterised in that in step C3, to the reservation with The element in m-n feature expression after machine intersection enters row variation treatment, including:
For any feature expression formula in the m-n feature expression, selected at random from following processing mode Select one kind and enter row variation treatment:
A single characteristic node in this feature expression formula is replaced with a subexpression;Single feature section Point refers to a data or an operator in this feature expression formula;
A subexpression in this feature expression formula is reduced to a single characteristic node;
A single characteristic node in this feature expression formula is replaced with single characteristic node of random generation;
This feature expression formula is replaced with the new feature expression of random generation.
4. method as claimed in claim 2 or claim 3, it is characterised in that in step C1, the execution The child node of iteration task is based on the evaluation result of N-Generation fitness evaluation task, from N-Generation fitness Selection fitness n feature expression of highest in m feature expression of task-based appraisal is evaluated, including:
If in the m feature expression, there is fitness identical feature expression, then redundancy is rejected K feature expression, with cause in remaining feature expression do not exist fitness identical mark sheet Up to formula;
In the remaining feature expression, fitness n feature expression of highest is selected, and will step M in rapid B2~B4 subtracts k.
5. the method for claim 1, it is characterised in that before step A, also include:
The host node is obtained from data server and performs the feature after feature generation task is received The data file of required by task, and the transmitting data file that will be obtained are generated to every cluster in group system Computing machine;
In step E, the child node for performing fitness evaluation task carries out fitness calculating, including:
The child node for performing fitness evaluation task reads what is be allocated from the PC cluster machine of place The characteristic indicated by coding individuality in fitness evaluation task, and the characteristic substitution that will be read should Individual corresponding feature expression is encoded, the fitness evaluation letter on PC cluster machine where by calling Number, fitness calculating is carried out to substituting into the feature expression after characteristic.
6. the method for claim 1, it is characterised in that before step A, also include:
Host node issues first corresponding to the feature generation task of host node reception to a child node of selection Beginning task;
The child node for performing initialization task passes through to call the initialization function on the PC cluster machine of place, with The individual coding file of coding of the machine generation comprising multiple initialization;
The coding that the host node is based on the multiple initialization is individual, the multiple first generation fitness evaluations of generation Task, and each first generation fitness evaluation task for generating is handed down to different child nodes respectively.
7. the method for claim 1, it is characterised in that in stepb, described to perform output The child node of task is based on the evaluation result of N-Generation fitness evaluation task, it is determined that and exporting fitness highest N feature expression, including:
The child node for performing output task stores the institute in file system by calling the host node The evaluation result of N-Generation fitness evaluation task is stated, n feature expression of fitness highest is determined, and Output feed back to user, for indicate described n feature expression of fitness highest feature generate tie Retribution is accused, and exports the spy corresponding to the n feature expression of fitness highest for subsequent calls Levy data.
8. a kind of feature generation system, it is characterised in that the system includes:
Host node, for receiving commenting for the multiple child nodes for performing N-Generation fitness evaluation task transmission After valency result, however, it is determined that N is equal to maximum iteration, then output task is issued to a child node of selection, Otherwise, iteration task is issued to a child node of selection;It is additionally operable to, based on the child node for performing iteration task The coding file generated multiple N+1 generation fitness evaluation task of generation, and each N+1 generation is fitted Response evaluates task and is handed down to different child nodes respectively, wherein, included in each fitness evaluation task One coding is individual;
The child node of output task is performed, for the evaluation result based on N-Generation fitness evaluation task, really Determine and export n feature expression of fitness highest;Described n feature expression of fitness highest be Refer to the preceding n feature expression after being arranged from high to low according to fitness;
The child node of iteration task is performed, for the evaluation knot based on the N-Generation fitness evaluation task Really, generation includes the individual coding file of multiple codings, and is sent to the host node;Wherein, it is described many The n feature representation of fitness highest gone out comprising N-Generation fitness evaluation task-based appraisal in individual coding individuality N coding corresponding to formula is individual;
The child node of fitness evaluation task is performed, for for the volume in allocated fitness evaluation task Feature expression indicated by code individuality carries out fitness calculating, and the fitness that will be calculated is used as evaluation Result is sent to the host node.
9. system as claimed in claim 8, it is characterised in that the coding individuality is excellent using depth What the mode of first encoding D FP was generated;It is described perform iteration task child node specifically for:
Based on the evaluation result of N-Generation fitness evaluation task, from N-Generation fitness evaluation task-based appraisal M feature expression in select fitness n feature expression of highest;
Two feature expressions are randomly choosed from the m feature expression, is intersected generally according to default Rate, selects a subexpression to be intersected respectively from the two feature expressions, after retaining random the intersection A feature expression;Repeat the step m-n times, m-n feature after the random intersection for being retained Expression formula;
According to default mutation probability, in m-n feature expression after the random intersection of the reservation Element enters row variation treatment, obtains m-n feature expression after random variation;
By m-n feature after described n feature expression of fitness highest and random variation treatment Expression formula distinguishes corresponding coding individuality, is defined as the m included in N+1 generation fitness evaluation task Individual coding is individual.
10. system as claimed in claim 9, it is characterised in that the child node of the execution iteration task Specifically for the unit in m-n feature expression after the random intersection according to following steps to the reservation Element enters row variation treatment:
For any feature expression formula, one kind is randomly choosed from following processing mode and enters row variation treatment:
A single characteristic node in this feature expression formula is replaced with a subexpression;Single feature section Point refers to a data or an operator in this feature expression formula;
A subexpression in this feature expression formula is reduced to a single characteristic node;
A single characteristic node in this feature expression formula is replaced with single characteristic node of random generation;
This feature expression formula is replaced with the new feature expression of random generation.
11. system as described in claim 9 or 10, it is characterised in that the execution iteration task Child node is specifically for according to following steps from m feature representation of N-Generation fitness evaluation task-based appraisal Fitness n feature expression of highest is selected in formula:
If in the m feature expression, there is fitness identical feature expression, then redundancy is rejected K feature expression, with cause in remaining feature expression do not exist fitness identical mark sheet Up to formula;In the remaining feature expression, fitness n feature expression of highest is selected, and will M subtracts k.
12. systems as claimed in claim 8, it is characterised in that the host node is additionally operable to:
After feature generation task is received, obtained from data server and perform the feature generation task institute The data file for needing, and the transmitting data file that will be obtained is to every PC cluster machine in group system;
It is described perform fitness evaluation task child node specifically for:
The coding read from the PC cluster machine of place in allocated fitness evaluation task is individual signified The characteristic shown, and the characteristic of reading is substituted into the individual corresponding feature expression of the coding, pass through Fitness function on PC cluster machine where calling, to substituting into the feature expression after characteristic Carry out fitness calculating.
13. systems as claimed in claim 8, it is characterised in that the host node is additionally operable to:To selection A child node issue the host node reception feature generation task corresponding to initialization task;
The system also includes:The child node of initialization task is performed, for by calling place PC cluster Initialization function on machine, the individual coding file of coding of the random generation comprising multiple initialization;
The host node is additionally operable to:The volume of the multiple initialization based on the child node generation for performing initialization task Code is individual, the multiple first generation fitness evaluation tasks of generation, and each first generation fitness for generating is commented Valency task is handed down to different child nodes respectively.
14. systems as claimed in claim 8, it is characterised in that the execution exports the child node of task Specifically for:
The N-Generation fitness evaluation task in file system is stored by calling the host node Evaluation result, determines n feature expression of fitness highest, and export feed back to user, for referring to Show the feature generation result report of described n feature expression of fitness highest, and export for follow-up The characteristic corresponding to the n feature expression of fitness highest called.
CN201510784474.4A 2015-11-16 2015-11-16 Feature generation method and system Active CN106708609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510784474.4A CN106708609B (en) 2015-11-16 2015-11-16 Feature generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510784474.4A CN106708609B (en) 2015-11-16 2015-11-16 Feature generation method and system

Publications (2)

Publication Number Publication Date
CN106708609A true CN106708609A (en) 2017-05-24
CN106708609B CN106708609B (en) 2020-06-26

Family

ID=58931599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510784474.4A Active CN106708609B (en) 2015-11-16 2015-11-16 Feature generation method and system

Country Status (1)

Country Link
CN (1) CN106708609B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905340A (en) * 2019-03-11 2019-06-18 北京邮电大学 A kind of characteristic optimization function choosing method, device and electronic equipment
CN111178656A (en) * 2019-07-31 2020-05-19 腾讯科技(深圳)有限公司 Credit model training method, credit scoring device and electronic equipment
CN112380215A (en) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 Automatic feature generation method based on cross aggregation
CN114356422A (en) * 2022-03-21 2022-04-15 四川新迎顺信息技术股份有限公司 Graph calculation method, device and equipment based on big data and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4961152A (en) * 1988-06-10 1990-10-02 Bolt Beranek And Newman Inc. Adaptive computing system
US6182057B1 (en) * 1996-12-12 2001-01-30 Fujitsu Limited Device, method, and program storage medium for executing genetic algorithm
CN101419610A (en) * 2007-10-22 2009-04-29 索尼株式会社 Information processing device, information processing method, and program
CN101464922A (en) * 2009-01-22 2009-06-24 中国人民解放军国防科学技术大学 Computer architecture scheme parallel simulation optimization method based on cluster system
CN103336790A (en) * 2013-06-06 2013-10-02 湖州师范学院 Hadoop-based fast neighborhood rough set attribute reduction method
CN104239144A (en) * 2014-09-22 2014-12-24 珠海许继芝电网自动化有限公司 Multilevel distributed task processing system
CN104798043A (en) * 2014-06-27 2015-07-22 华为技术有限公司 Data processing method and computer system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4961152A (en) * 1988-06-10 1990-10-02 Bolt Beranek And Newman Inc. Adaptive computing system
US6182057B1 (en) * 1996-12-12 2001-01-30 Fujitsu Limited Device, method, and program storage medium for executing genetic algorithm
CN101419610A (en) * 2007-10-22 2009-04-29 索尼株式会社 Information processing device, information processing method, and program
CN101464922A (en) * 2009-01-22 2009-06-24 中国人民解放军国防科学技术大学 Computer architecture scheme parallel simulation optimization method based on cluster system
CN103336790A (en) * 2013-06-06 2013-10-02 湖州师范学院 Hadoop-based fast neighborhood rough set attribute reduction method
CN104798043A (en) * 2014-06-27 2015-07-22 华为技术有限公司 Data processing method and computer system
CN104239144A (en) * 2014-09-22 2014-12-24 珠海许继芝电网自动化有限公司 Multilevel distributed task processing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高家全,何桂霞: "并行遗传算法研究综述", 《浙江工业大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109905340A (en) * 2019-03-11 2019-06-18 北京邮电大学 A kind of characteristic optimization function choosing method, device and electronic equipment
CN111178656A (en) * 2019-07-31 2020-05-19 腾讯科技(深圳)有限公司 Credit model training method, credit scoring device and electronic equipment
CN112380215A (en) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 Automatic feature generation method based on cross aggregation
CN114356422A (en) * 2022-03-21 2022-04-15 四川新迎顺信息技术股份有限公司 Graph calculation method, device and equipment based on big data and readable storage medium

Also Published As

Publication number Publication date
CN106708609B (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN106897810B (en) Business processing method and system, workflow engine and system and business system
CN106708609A (en) Characteristics generation method and system
CN108228166A (en) A kind of back-end code generation method and system based on template
CN105677763B (en) A kind of image quality measure system based on Hadoop
CN107330641A (en) A kind of real-time risk control system of financial derivatives based on Storm stream process framework and regulation engine and method
EP3076310B1 (en) Variable virtual split dictionary for search optimization
CN104317942A (en) Massive data comparison method and system based on hadoop cloud platform
CN106202092A (en) The method and system that data process
CN103392169A (en) Sorting
CN110245091A (en) A kind of method, apparatus and computer storage medium of memory management
CN109189393A (en) Method for processing business and device
CN110347888A (en) Processing method, device and the storage medium of order data
CN111179016A (en) Electricity sales package recommendation method, equipment and storage medium
CN103577455A (en) Data processing method and system for database aggregating operation
CN110413539A (en) A kind of data processing method and device
CN108629375A (en) Power customer sorting technique, system, terminal and computer readable storage medium
CN106708875B (en) Feature screening method and system
CN104156505B (en) A kind of Hadoop cluster job scheduling method and devices based on user behavior analysis
CN105005501B (en) A kind of second order optimizing and scheduling task method towards cloud data center
KR20140076010A (en) A system for simultaneous and parallel processing of many twig pattern queries for massive XML data and method thereof
CN115018624A (en) Decision engine and method based on wind control strategy
CN114385921B (en) Bidding recommendation method, system, equipment and storage medium
CN109558403A (en) Data aggregation method and device, computer installation and computer readable storage medium
CN102651755B (en) Service discovering method and system based on multi-feature matching
CN113065734A (en) Index system-based decision tree construction method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200921

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200921

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: Cayman Islands Grand Cayman capital building, a four storey No. 847 mailbox

Patentee before: Alibaba Group Holding Ltd.