Summary of the invention
The technical problem to be solved in the present invention is that how to optimize Feature Engineering to improve the accuracy of model.
For this purpose, according in a first aspect, the embodiment of the invention discloses a kind of Feature Engineering variables to determine method, including:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of
Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
Optionally, various types of characteristic variables are derived respectively after obtaining augmented features variable based on preset rules
EDS extended data set include:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type;Based on
The derivative rule that the type matches is derived to obtain multiple after selected characteristic variable expands to selected characteristic variable
The augmented features variable that characteristic variable is concentrated as expanding data.
Optionally, screening characteristic variable is concentrated to be combined to obtain the characteristic variable group for machine learning from expanding data
Conjunction includes:Inquiry expanding data concentrates the feature of each characteristic variable to combine;Preferred feature is determined from different feature combinations
It combines and is combined as the characteristic variable for machine learning.
Optionally, determine preferred feature combination as the characteristic variable group for being used for machine learning from different feature combinations
Conjunction includes:Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;From different feature combinations
It chooses to combine with the associated feature of different degree index and be combined as the characteristic variable for machine learning.
Optionally, it is obtaining for the characteristic variable data acquisition system of machine learning and based on preset rules to various types of
Characteristic variable is derived to obtain respectively between the EDS extended data set after augmented features variable, further includes:To different types of spy
Sign variable is pre-processed to obtain characteristic variable data acquisition system.
According to second aspect, the embodiment of the invention discloses a kind of Feature Engineering variable determining devices, including:
Data acquisition module, for obtaining the characteristic variable data acquisition system for being used for machine learning, characteristic variable data acquisition system
Include a variety of different types of characteristic variables;Feature derives module, for being based on preset rules to various types of characteristic variables
Derived to obtain the EDS extended data set after augmented features variable respectively;Feature Selection module concentrates screening special from expanding data
Sign variable is combined to obtain the characteristic variable combination for machine learning.
Optionally, the derivative module of feature includes:Rule Extraction unit, for the type-collection according to selected characteristic variable
The derivative rule to match with the type;Feature expansion unit, for derivative regular to selected based on matching with the type
Characteristic variable derived to obtain the expansion that multiple characteristic variables after selected characteristic variable expands are concentrated as expanding data
Fill characteristic variable.
Optionally, further include:Query composition module concentrates the feature group of each characteristic variable for inquiring expanding data
It closes;Determining module is combined, for determining preferred feature combination as the feature for being used for machine learning from different feature combinations
Variable combination.
Optionally, combination determining module includes:Index selection unit is used to characterize the spy for machine learning for obtaining
Levy the different degree index of variable combination;Selection unit is combined, is closed for being chosen from different feature combinations with different degree index
The feature combination of connection is combined as the characteristic variable for machine learning.
Optionally, further include:Preprocessing module obtains feature change for being pre-processed to different types of characteristic variable
Measure data acquisition system.
According to the third aspect, the embodiment of the invention discloses a kind of computer installation, including processor, processor is for holding
The computer program stored in line storage realizes following method:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of
Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
According to fourth aspect, the embodiment of the invention discloses a kind of computer readable storage mediums, are stored thereon with calculating
Machine program, processor are used to execute the computer program stored in storage medium and realize following method:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of
Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
Technical solution of the present invention has the following advantages that:
Feature Engineering variable provided in an embodiment of the present invention determines method and device, is obtaining the feature for being used for machine learning
After variable data set, various types of characteristic variables are derived respectively after obtaining augmented features variable based on preset rules
EDS extended data set, enhance the interpretation of characteristic variable, enrich the information that characteristic variable is included, thus, from expansion
Make up the number according to concentrating screening characteristic variable to be combined to obtain characteristic variable combination for machine learning, may include it is richer and
Stronger explanatory data information then optimizes the characteristic variable for machine learning, so that the later period can use when modeling
Good feature combines to improve the accuracy of model.
In addition, being derived based on preset rules to various types of characteristic variables, then concentrates and screen from expanding data
Characteristic variable is combined, and be can be realized characteristic variable automatic derivatization and is screened combined operation, it is thereby achieved that automation is special
Engineering is levied, Feature Engineering efficiency is improved.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation
Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical",
The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to
Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation,
It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ",
" third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase
Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can
To be mechanical connection, it is also possible to be electrically connected;It can be directly connected, can also indirectly connected through an intermediary, it can be with
It is the connection inside two elements, can be wireless connection, be also possible to wired connection.For those of ordinary skill in the art
For, the concrete meaning of above-mentioned term in the present invention can be understood with concrete condition.
As long as in addition, the non-structure each other of technical characteristic involved in invention described below different embodiments
It can be combined with each other at conflict.
In order to optimize Feature Engineering and improve Feature Engineering efficiency, present embodiment discloses a kind of Feature Engineering variable is true
Determine method, referring to FIG. 1, determining method flow diagram for a kind of Feature Engineering variable disclosed in the present embodiment, this feature engineering becomes
Measuring the method for determination includes:
Step S100 obtains the characteristic variable data acquisition system for being used for machine learning.It in a particular embodiment, can be from data
Front end such as mobile terminal, the end PC obtain characteristic variable, can also obtain characteristic variable data acquisition system from external memory.This reality
It applies in example, characteristic variable data acquisition system includes a variety of different types of characteristic variables, and specifically, characteristic variable can be character type
Variable, time type variable, classifying type variable and numeric type variable.In a particular embodiment, characteristic variable data acquisition system is being obtained
It afterwards, can be with the data type of automatic identification characteristic variable, to carry out different data processings.In the specific implementation process, may be used
With using existing recognition methods come the data type of identification feature variable.
Step S200 derives various types of characteristic variables based on preset rules to obtain augmented features variable respectively
EDS extended data set afterwards.In the present embodiment, it is changed by primitive character variable and generates new characteristic variable, these new spies
Sign variable is the augmented features variable obtained after primitive character variable is derivative, with the interpretation of this Enhanced feature variable.?
In specific embodiment, derive to obtain all types of characteristic variables using different deriving modes for different types of characteristic variable
Augmented features variable.Specifically, various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Filling data set includes:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type;Be based on and such
The derivative rule that type matches derives selected characteristic variable to obtain multiple features after selected characteristic variable expands
The augmented features variable that variable is concentrated as expanding data.As an example, it to time characteristic variable, can be cut using the time is carried out
Point derivative rule carry out derivative augmented features variable;To continuous characteristic variable, can be derived using the method for statistics
Augmented features variable;Logarithm type characteristic variable, can using be for example averaged, median, growth rate and addition subtraction multiplication and division operation
Etc. deriving new characteristic variable.
Step S300 concentrates screening characteristic variable to be combined to obtain the characteristic variable for machine learning from expanding data
Combination.In the present embodiment, the attribute of the characteristic variable according to required for machine learning is come automatic screening characteristic variable, then
The characteristic variable screened is combined to obtain characteristic variable combination.As being sieved according to attribute required for machine learning
Characteristic variable is selected, so that the characteristic variable screened is more bonded the needs of machine learning, more specific aim, therefore can be improved
The accuracy of model.
In an alternate embodiment of the invention, when executing step S300, screening characteristic variable is concentrated to be combined from expanding data
It obtains combining for the characteristic variable of machine learning and includes:Inquiry expanding data concentrates the feature of each characteristic variable to combine;From
Determine that preferred feature combination is combined as the characteristic variable for machine learning in different feature combinations.In specific implementation process
In, all possible characteristic variable combination can be generated by way of enumerating traversal, for example, by using the mode of permutation and combination
All possible feature combination, such as n characteristic variable are enumerated, can produce 2n-1 characteristic variable combination;It is of course also possible to
It is in such a way that correlation coefficient threshold is set come assemblage characteristic variable, such as the related coefficient of certain two characteristic variable is more than to set
When fixed threshold value, then shows that two characteristic variable relevances are larger, can be combined.It should be noted that specific real
It applies in example, can be characteristic variable combination of two, be also possible to the combination of multiple characteristic variables.From different feature combinations
It is available to be used for machine learning for characterizing when determining that preferred feature combination is combined as the characteristic variable for machine learning
Characteristic variable combination different degree index, chosen and the associated feature group cooperation of different degree index from different feature combinations
To be combined for the characteristic variable of machine learning.As an example, in machine learning, when realizing linear analysis, linear coefficient K is
Relatively important index can choose feature relevant with linear coefficient K and combine as the characteristic variable group for being used for machine learning
It closes.Choose combined with the associated feature of different degree index when, can use annealing algorithm, genetic algorithm scheduling algorithm filters out pair
Machine learning model promotes maximum feature combination.
In order to improve data-handling efficiency and accuracy, in an alternate embodiment of the invention, step S100 and step are being executed
Between S200, can also include:
Step S400 pre-processes different types of characteristic variable to obtain characteristic variable data acquisition system.The present embodiment
In, characteristic variable can be pre-processed according to the type of specific features variable.In order to facilitate the understanding of those skilled in the art,
It is hereafter directed to numeric type characteristic variable, classifying type characteristic variable, time type feature variable and character type characteristic variable respectively
Pretreatment be illustrated.
(1) for numeric type characteristic variable data.It is pre- that nondimensionalization, missing values amendment, discretization etc. can be carried out automatically
Processing, wherein:The data of different size are transformed into same specification, such as standardization, section scaling, data canonical by nondimensionalization
Change etc.;Missing values correcting process, which can be, deletes missing values, Supplementing Data, ignores missing values etc.;Sliding-model control can be spy
Branch mailbox, feature binary etc. are levied, feature branch mailbox is cut into the class variable of different level, feature binary according to numerical value difference value
The process of change is that numeric type data is converted to Boolean property, sets a threshold value, is assigned a value of 1 greater than threshold value, be less than etc.
0 is assigned a value of in threshold value.
(2) for classifying type characteristic variable data.The pretreatments such as categories combination, numeralization coding can be carried out automatically,
In:Categories combination can be branch mailbox, dualization etc., i.e., by the way that higher-dimension classified variable is merged into low-dimensional variable;Numeralization coding
Can be dummy variable, one-hot coding (one-hot) etc., dummy variableization using N bit status register to N number of possible value into
Row coding, N is positive integer, and each state is indicated by independent register-bit, and only wherein one at any time
Effectively, it is assumed that the value of some attribute is nonnumeric discrete set [discrete value 1, discrete value 2 ..., discrete value m], and m is positive whole
Number is then directed to the tuple for being encoded to a m member of the attribute, and the component of the tuple has and only one is 1, remaining is all 0,
And one-hot coding encodes N number of state using N bit status register, each state has its independent register-bit,
And when any, wherein only one effective, so as to realize categories combination.
(3) for time type feature variable data.It can be converted and cutting, time type feature variable are convertible
Apart from the Base day how many days (be also possible to hour, point etc. times of day), can cutting be different time sections.It should be noted that
After time type feature variable is converted to the interval apart from fiducial time, these time type feature variables can be changed into
Numeric type characteristic variable can also then carry out the time categorical data after conversion using above-mentioned numeric type characteristic variable data
Pretreatment, details are not described herein.
(4) for character type characteristic variable data, classification and cutting can be carried out to character type data.Specifically, Ke Yijin
Row is such as hyphenation, semantic fractionation.
The present embodiment also discloses a kind of Feature Engineering variable determining device, referring to FIG. 2, for one disclosed in the present embodiment
Kind Feature Engineering variable determining device structural schematic diagram, this feature engineering variable determining device include:Data acquisition module 100,
The derivative module 200 of feature and Feature Selection module 300, wherein:
Data acquisition module 100 is used to obtain the characteristic variable data acquisition system for machine learning, characteristic variable data set
Closing includes a variety of different types of characteristic variables;The derivative module 200 of feature is used for based on preset rules to various types of features
Variable is derived to obtain the EDS extended data set after augmented features variable respectively;Feature Selection module 300 is concentrated from expanding data
Screening characteristic variable is combined to obtain the characteristic variable combination for machine learning.
In an alternate embodiment of the invention, the derivative module 200 of feature includes:Rule Extraction unit is used to be become according to selected feature
The derivative rule that the type-collection and the type of amount match;Feature expansion unit is used for based on the derivative to match with the type
Rule is derived to obtain multiple characteristic variables after selected characteristic variable expands as expansion number to selected characteristic variable
According to the augmented features variable of concentration.
In an alternate embodiment of the invention, this feature engineering variable determining device further includes:Query composition module expands for inquiring
It makes up the number according to the feature combination for concentrating each characteristic variable;It is preferred special for determining from different feature combinations to combine determining module
Sign combination is combined as the characteristic variable for machine learning.
In an alternate embodiment of the invention, combination determining module includes:Index selection unit, for obtaining for characterizing the machine of being used for
The different degree index of the characteristic variable combination of device study;Selection unit is combined, for choosing from different feature combinations and again
The associated feature combination of index is spent to combine as the characteristic variable for machine learning.
In an alternate embodiment of the invention, this feature engineering variable determining device further includes:Preprocessing module 400 pre-processes mould
Block 400 is for pre-processing different types of characteristic variable to obtain characteristic variable data acquisition system.
In addition, the present embodiment also discloses a kind of computer installation, including processor, processor is for executing in memory
The computer program of storage realizes following method:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of
Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
In an alternate embodiment of the invention, various types of characteristic variables are derived based on preset rules respectively and is expanded
EDS extended data set after characteristic variable includes:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type
Then;Selected characteristic variable is derived based on the derivative rule to match with the type to obtain selected characteristic variable expansion
The augmented features variable that multiple characteristic variables afterwards are concentrated as expanding data.
In an alternate embodiment of the invention, screening characteristic variable is concentrated to be combined to obtain for machine learning from expanding data
Characteristic variable combines:Inquiry expanding data concentrates the feature of each characteristic variable to combine;From different feature combinations really
Determine preferred feature combination to combine as the characteristic variable for machine learning.
In an alternate embodiment of the invention, determine preferred feature combination as machine learning from different feature combinations
Characteristic variable combines:Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;From different
It chooses to combine with the associated feature of different degree index in feature combination and be combined as the characteristic variable for machine learning.
In an alternate embodiment of the invention, it is obtaining the characteristic variable data acquisition system for machine learning and is being based on preset rules pair
Various types of characteristic variables are derived to obtain respectively between the EDS extended data set after augmented features variable, further include:To not
The characteristic variable of same type is pre-processed to obtain characteristic variable data acquisition system.
It is that can lead to it will be understood by those skilled in the art that realizing all or part of the process in above-described embodiment method
Computer program is crossed to instruct relevant hardware and complete, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (ROM) or random access memory (RAM) etc..Computer processor is situated between for executing storage
The computer program stored in matter realizes following methods:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of
Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
In an alternate embodiment of the invention, various types of characteristic variables are derived based on preset rules respectively and is expanded
EDS extended data set after characteristic variable includes:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type
Then;Selected characteristic variable is derived based on the derivative rule to match with the type to obtain selected characteristic variable expansion
The augmented features variable that multiple characteristic variables afterwards are concentrated as expanding data.
In an alternate embodiment of the invention, screening characteristic variable is concentrated to be combined to obtain for machine learning from expanding data
Characteristic variable combines:Inquiry expanding data concentrates the feature of each characteristic variable to combine;From different feature combinations really
Determine preferred feature combination to combine as the characteristic variable for machine learning.
In an alternate embodiment of the invention, determine preferred feature combination as machine learning from different feature combinations
Characteristic variable combines:Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;From different
It chooses to combine with the associated feature of different degree index in feature combination and be combined as the characteristic variable for machine learning.
In an alternate embodiment of the invention, it is obtaining the characteristic variable data acquisition system for machine learning and is being based on preset rules pair
Various types of characteristic variables are derived to obtain respectively between the EDS extended data set after augmented features variable, further include:To not
The characteristic variable of same type is pre-processed to obtain characteristic variable data acquisition system.
Feature Engineering variable provided in this embodiment determines method and device, is obtaining the characteristic variable for being used for machine learning
After data acquisition system, based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable
Fill data set, enhance the interpretation of characteristic variable, enrich the information that characteristic variable is included, thus, from expand number
According to concentrating screening characteristic variable to be combined to obtain the characteristic variable combination for machine learning, may include richer and stronger
Explanatory data information then optimizes the characteristic variable for machine learning, so that can use when later period modeling high-quality
Feature combination improve the accuracy of model.
In addition, being derived based on preset rules to various types of characteristic variables, then concentrates and screen from expanding data
Characteristic variable is combined, and be can be realized characteristic variable automatic derivatization and is screened combined operation, it is thereby achieved that automation is special
Engineering is levied, Feature Engineering efficiency is improved.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right
For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or
It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or
It changes still within the protection scope of the invention.