CN108875815A - Feature Engineering variable determines method and device - Google Patents

Feature Engineering variable determines method and device Download PDF

Info

Publication number
CN108875815A
CN108875815A CN201810564705.4A CN201810564705A CN108875815A CN 108875815 A CN108875815 A CN 108875815A CN 201810564705 A CN201810564705 A CN 201810564705A CN 108875815 A CN108875815 A CN 108875815A
Authority
CN
China
Prior art keywords
variable
characteristic variable
feature
characteristic
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810564705.4A
Other languages
Chinese (zh)
Inventor
徐靖然
姜凤英
罗晓生
林庆治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Credit Micro Loan Co Ltd
Original Assignee
Shenzhen Research Credit Micro Loan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Credit Micro Loan Co Ltd filed Critical Shenzhen Research Credit Micro Loan Co Ltd
Priority to CN201810564705.4A priority Critical patent/CN108875815A/en
Publication of CN108875815A publication Critical patent/CN108875815A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A kind of Feature Engineering variable determines method and device, wherein the method includes:The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of characteristic variables;Based on preset rules various types of characteristic variables are derived to obtain respectively with the EDS extended data set after augmented features variable;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.Enhance the interpretation of characteristic variable, enrich the information that characteristic variable is included, to, screening characteristic variable is being concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data, it may include richer and stronger explanatory data information, then, the characteristic variable for machine learning is optimized, so that the later period can use good feature combination to improve the accurate of model when modeling.

Description

Feature Engineering variable determines method and device
Technical field
The present invention relates to technical field of data processing, and in particular to a kind of Feature Engineering variable determines method and device.
Background technique
As the mankind collect, storage, transmission, the ability fast lifting for handling data, social all trades and professions are had accumulated largely Data, need effectively to analyze data, and the urgent need in big epoch has just been complied in machine learning, is widely used In the data process&analysis of all trades and professions.Feature Engineering is an important ring for machine learning, and quality will affect the effect of model Fruit, Feature Engineering are that the feature that machine learning algorithm can be made to reach optimum performance is created using the relevant knowledge of data fields Process.
Feature Engineering is exactly one and initial data is transformed into the process that can be used for the characteristic variable of machine learning, these are special Levying variable can be with the description primary data information (pdi) and feature of all-dimensional multi-angle, and has well using the model that they are established Generalization ability, i.e., the performance capabilities on unknown data can achieve optimal (or close to optimum performance).In the prior art, Professional skill of the Feature Engineering generally according to modeling personnel, manual identification data, according to the algorithm and application scenarios of model, manually Corresponding pretreatment mode is selected, and feature is carried out according to the expertise of modeling personnel and is derived, for the derivative feature finished Filtering type feature selecting.Whole flow process needs a large amount of artificial access to carry out threshold value setting or threshold value selection.In the prior art, The screening of Feature Engineering variable is larger to artificial dependency, is easy to cause error, also, as data volume is increasing, tradition Feature Engineering calculate expend time will increasingly grow, and to modeling personnel experience and Capability Requirement it is higher and higher, seriously The efficiency and big data quantity for constraining modeling are to the castering action of model accuracy.
Therefore, how to optimize Feature Engineering to improve the accuracy of model as technical problem urgently to be resolved, how to mention High Feature Engineering efficiency becomes the second technical problem urgently to be resolved.
Summary of the invention
The technical problem to be solved in the present invention is that how to optimize Feature Engineering to improve the accuracy of model.
For this purpose, according in a first aspect, the embodiment of the invention discloses a kind of Feature Engineering variables to determine method, including:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
Optionally, various types of characteristic variables are derived respectively after obtaining augmented features variable based on preset rules EDS extended data set include:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type;Based on The derivative rule that the type matches is derived to obtain multiple after selected characteristic variable expands to selected characteristic variable The augmented features variable that characteristic variable is concentrated as expanding data.
Optionally, screening characteristic variable is concentrated to be combined to obtain the characteristic variable group for machine learning from expanding data Conjunction includes:Inquiry expanding data concentrates the feature of each characteristic variable to combine;Preferred feature is determined from different feature combinations It combines and is combined as the characteristic variable for machine learning.
Optionally, determine preferred feature combination as the characteristic variable group for being used for machine learning from different feature combinations Conjunction includes:Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;From different feature combinations It chooses to combine with the associated feature of different degree index and be combined as the characteristic variable for machine learning.
Optionally, it is obtaining for the characteristic variable data acquisition system of machine learning and based on preset rules to various types of Characteristic variable is derived to obtain respectively between the EDS extended data set after augmented features variable, further includes:To different types of spy Sign variable is pre-processed to obtain characteristic variable data acquisition system.
According to second aspect, the embodiment of the invention discloses a kind of Feature Engineering variable determining devices, including:
Data acquisition module, for obtaining the characteristic variable data acquisition system for being used for machine learning, characteristic variable data acquisition system Include a variety of different types of characteristic variables;Feature derives module, for being based on preset rules to various types of characteristic variables Derived to obtain the EDS extended data set after augmented features variable respectively;Feature Selection module concentrates screening special from expanding data Sign variable is combined to obtain the characteristic variable combination for machine learning.
Optionally, the derivative module of feature includes:Rule Extraction unit, for the type-collection according to selected characteristic variable The derivative rule to match with the type;Feature expansion unit, for derivative regular to selected based on matching with the type Characteristic variable derived to obtain the expansion that multiple characteristic variables after selected characteristic variable expands are concentrated as expanding data Fill characteristic variable.
Optionally, further include:Query composition module concentrates the feature group of each characteristic variable for inquiring expanding data It closes;Determining module is combined, for determining preferred feature combination as the feature for being used for machine learning from different feature combinations Variable combination.
Optionally, combination determining module includes:Index selection unit is used to characterize the spy for machine learning for obtaining Levy the different degree index of variable combination;Selection unit is combined, is closed for being chosen from different feature combinations with different degree index The feature combination of connection is combined as the characteristic variable for machine learning.
Optionally, further include:Preprocessing module obtains feature change for being pre-processed to different types of characteristic variable Measure data acquisition system.
According to the third aspect, the embodiment of the invention discloses a kind of computer installation, including processor, processor is for holding The computer program stored in line storage realizes following method:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
According to fourth aspect, the embodiment of the invention discloses a kind of computer readable storage mediums, are stored thereon with calculating Machine program, processor are used to execute the computer program stored in storage medium and realize following method:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
Technical solution of the present invention has the following advantages that:
Feature Engineering variable provided in an embodiment of the present invention determines method and device, is obtaining the feature for being used for machine learning After variable data set, various types of characteristic variables are derived respectively after obtaining augmented features variable based on preset rules EDS extended data set, enhance the interpretation of characteristic variable, enrich the information that characteristic variable is included, thus, from expansion Make up the number according to concentrating screening characteristic variable to be combined to obtain characteristic variable combination for machine learning, may include it is richer and Stronger explanatory data information then optimizes the characteristic variable for machine learning, so that the later period can use when modeling Good feature combines to improve the accuracy of model.
In addition, being derived based on preset rules to various types of characteristic variables, then concentrates and screen from expanding data Characteristic variable is combined, and be can be realized characteristic variable automatic derivatization and is screened combined operation, it is thereby achieved that automation is special Engineering is levied, Feature Engineering efficiency is improved.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those skilled in the art, without creative efforts, It is also possible to obtain other drawings based on these drawings.
Fig. 1 is that a kind of Feature Engineering variable disclosed by the embodiments of the present invention determines method flow diagram;
Fig. 2 is a kind of Feature Engineering variable determining device structural schematic diagram disclosed by the embodiments of the present invention.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical", The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ", " third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can To be mechanical connection, it is also possible to be electrically connected;It can be directly connected, can also indirectly connected through an intermediary, it can be with It is the connection inside two elements, can be wireless connection, be also possible to wired connection.For those of ordinary skill in the art For, the concrete meaning of above-mentioned term in the present invention can be understood with concrete condition.
As long as in addition, the non-structure each other of technical characteristic involved in invention described below different embodiments It can be combined with each other at conflict.
In order to optimize Feature Engineering and improve Feature Engineering efficiency, present embodiment discloses a kind of Feature Engineering variable is true Determine method, referring to FIG. 1, determining method flow diagram for a kind of Feature Engineering variable disclosed in the present embodiment, this feature engineering becomes Measuring the method for determination includes:
Step S100 obtains the characteristic variable data acquisition system for being used for machine learning.It in a particular embodiment, can be from data Front end such as mobile terminal, the end PC obtain characteristic variable, can also obtain characteristic variable data acquisition system from external memory.This reality It applies in example, characteristic variable data acquisition system includes a variety of different types of characteristic variables, and specifically, characteristic variable can be character type Variable, time type variable, classifying type variable and numeric type variable.In a particular embodiment, characteristic variable data acquisition system is being obtained It afterwards, can be with the data type of automatic identification characteristic variable, to carry out different data processings.In the specific implementation process, may be used With using existing recognition methods come the data type of identification feature variable.
Step S200 derives various types of characteristic variables based on preset rules to obtain augmented features variable respectively EDS extended data set afterwards.In the present embodiment, it is changed by primitive character variable and generates new characteristic variable, these new spies Sign variable is the augmented features variable obtained after primitive character variable is derivative, with the interpretation of this Enhanced feature variable.? In specific embodiment, derive to obtain all types of characteristic variables using different deriving modes for different types of characteristic variable Augmented features variable.Specifically, various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Filling data set includes:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type;Be based on and such The derivative rule that type matches derives selected characteristic variable to obtain multiple features after selected characteristic variable expands The augmented features variable that variable is concentrated as expanding data.As an example, it to time characteristic variable, can be cut using the time is carried out Point derivative rule carry out derivative augmented features variable;To continuous characteristic variable, can be derived using the method for statistics Augmented features variable;Logarithm type characteristic variable, can using be for example averaged, median, growth rate and addition subtraction multiplication and division operation Etc. deriving new characteristic variable.
Step S300 concentrates screening characteristic variable to be combined to obtain the characteristic variable for machine learning from expanding data Combination.In the present embodiment, the attribute of the characteristic variable according to required for machine learning is come automatic screening characteristic variable, then The characteristic variable screened is combined to obtain characteristic variable combination.As being sieved according to attribute required for machine learning Characteristic variable is selected, so that the characteristic variable screened is more bonded the needs of machine learning, more specific aim, therefore can be improved The accuracy of model.
In an alternate embodiment of the invention, when executing step S300, screening characteristic variable is concentrated to be combined from expanding data It obtains combining for the characteristic variable of machine learning and includes:Inquiry expanding data concentrates the feature of each characteristic variable to combine;From Determine that preferred feature combination is combined as the characteristic variable for machine learning in different feature combinations.In specific implementation process In, all possible characteristic variable combination can be generated by way of enumerating traversal, for example, by using the mode of permutation and combination All possible feature combination, such as n characteristic variable are enumerated, can produce 2n-1 characteristic variable combination;It is of course also possible to It is in such a way that correlation coefficient threshold is set come assemblage characteristic variable, such as the related coefficient of certain two characteristic variable is more than to set When fixed threshold value, then shows that two characteristic variable relevances are larger, can be combined.It should be noted that specific real It applies in example, can be characteristic variable combination of two, be also possible to the combination of multiple characteristic variables.From different feature combinations It is available to be used for machine learning for characterizing when determining that preferred feature combination is combined as the characteristic variable for machine learning Characteristic variable combination different degree index, chosen and the associated feature group cooperation of different degree index from different feature combinations To be combined for the characteristic variable of machine learning.As an example, in machine learning, when realizing linear analysis, linear coefficient K is Relatively important index can choose feature relevant with linear coefficient K and combine as the characteristic variable group for being used for machine learning It closes.Choose combined with the associated feature of different degree index when, can use annealing algorithm, genetic algorithm scheduling algorithm filters out pair Machine learning model promotes maximum feature combination.
In order to improve data-handling efficiency and accuracy, in an alternate embodiment of the invention, step S100 and step are being executed Between S200, can also include:
Step S400 pre-processes different types of characteristic variable to obtain characteristic variable data acquisition system.The present embodiment In, characteristic variable can be pre-processed according to the type of specific features variable.In order to facilitate the understanding of those skilled in the art, It is hereafter directed to numeric type characteristic variable, classifying type characteristic variable, time type feature variable and character type characteristic variable respectively Pretreatment be illustrated.
(1) for numeric type characteristic variable data.It is pre- that nondimensionalization, missing values amendment, discretization etc. can be carried out automatically Processing, wherein:The data of different size are transformed into same specification, such as standardization, section scaling, data canonical by nondimensionalization Change etc.;Missing values correcting process, which can be, deletes missing values, Supplementing Data, ignores missing values etc.;Sliding-model control can be spy Branch mailbox, feature binary etc. are levied, feature branch mailbox is cut into the class variable of different level, feature binary according to numerical value difference value The process of change is that numeric type data is converted to Boolean property, sets a threshold value, is assigned a value of 1 greater than threshold value, be less than etc. 0 is assigned a value of in threshold value.
(2) for classifying type characteristic variable data.The pretreatments such as categories combination, numeralization coding can be carried out automatically, In:Categories combination can be branch mailbox, dualization etc., i.e., by the way that higher-dimension classified variable is merged into low-dimensional variable;Numeralization coding Can be dummy variable, one-hot coding (one-hot) etc., dummy variableization using N bit status register to N number of possible value into Row coding, N is positive integer, and each state is indicated by independent register-bit, and only wherein one at any time Effectively, it is assumed that the value of some attribute is nonnumeric discrete set [discrete value 1, discrete value 2 ..., discrete value m], and m is positive whole Number is then directed to the tuple for being encoded to a m member of the attribute, and the component of the tuple has and only one is 1, remaining is all 0, And one-hot coding encodes N number of state using N bit status register, each state has its independent register-bit, And when any, wherein only one effective, so as to realize categories combination.
(3) for time type feature variable data.It can be converted and cutting, time type feature variable are convertible Apart from the Base day how many days (be also possible to hour, point etc. times of day), can cutting be different time sections.It should be noted that After time type feature variable is converted to the interval apart from fiducial time, these time type feature variables can be changed into Numeric type characteristic variable can also then carry out the time categorical data after conversion using above-mentioned numeric type characteristic variable data Pretreatment, details are not described herein.
(4) for character type characteristic variable data, classification and cutting can be carried out to character type data.Specifically, Ke Yijin Row is such as hyphenation, semantic fractionation.
The present embodiment also discloses a kind of Feature Engineering variable determining device, referring to FIG. 2, for one disclosed in the present embodiment Kind Feature Engineering variable determining device structural schematic diagram, this feature engineering variable determining device include:Data acquisition module 100, The derivative module 200 of feature and Feature Selection module 300, wherein:
Data acquisition module 100 is used to obtain the characteristic variable data acquisition system for machine learning, characteristic variable data set Closing includes a variety of different types of characteristic variables;The derivative module 200 of feature is used for based on preset rules to various types of features Variable is derived to obtain the EDS extended data set after augmented features variable respectively;Feature Selection module 300 is concentrated from expanding data Screening characteristic variable is combined to obtain the characteristic variable combination for machine learning.
In an alternate embodiment of the invention, the derivative module 200 of feature includes:Rule Extraction unit is used to be become according to selected feature The derivative rule that the type-collection and the type of amount match;Feature expansion unit is used for based on the derivative to match with the type Rule is derived to obtain multiple characteristic variables after selected characteristic variable expands as expansion number to selected characteristic variable According to the augmented features variable of concentration.
In an alternate embodiment of the invention, this feature engineering variable determining device further includes:Query composition module expands for inquiring It makes up the number according to the feature combination for concentrating each characteristic variable;It is preferred special for determining from different feature combinations to combine determining module Sign combination is combined as the characteristic variable for machine learning.
In an alternate embodiment of the invention, combination determining module includes:Index selection unit, for obtaining for characterizing the machine of being used for The different degree index of the characteristic variable combination of device study;Selection unit is combined, for choosing from different feature combinations and again The associated feature combination of index is spent to combine as the characteristic variable for machine learning.
In an alternate embodiment of the invention, this feature engineering variable determining device further includes:Preprocessing module 400 pre-processes mould Block 400 is for pre-processing different types of characteristic variable to obtain characteristic variable data acquisition system.
In addition, the present embodiment also discloses a kind of computer installation, including processor, processor is for executing in memory The computer program of storage realizes following method:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
In an alternate embodiment of the invention, various types of characteristic variables are derived based on preset rules respectively and is expanded EDS extended data set after characteristic variable includes:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type Then;Selected characteristic variable is derived based on the derivative rule to match with the type to obtain selected characteristic variable expansion The augmented features variable that multiple characteristic variables afterwards are concentrated as expanding data.
In an alternate embodiment of the invention, screening characteristic variable is concentrated to be combined to obtain for machine learning from expanding data Characteristic variable combines:Inquiry expanding data concentrates the feature of each characteristic variable to combine;From different feature combinations really Determine preferred feature combination to combine as the characteristic variable for machine learning.
In an alternate embodiment of the invention, determine preferred feature combination as machine learning from different feature combinations Characteristic variable combines:Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;From different It chooses to combine with the associated feature of different degree index in feature combination and be combined as the characteristic variable for machine learning.
In an alternate embodiment of the invention, it is obtaining the characteristic variable data acquisition system for machine learning and is being based on preset rules pair Various types of characteristic variables are derived to obtain respectively between the EDS extended data set after augmented features variable, further include:To not The characteristic variable of same type is pre-processed to obtain characteristic variable data acquisition system.
It is that can lead to it will be understood by those skilled in the art that realizing all or part of the process in above-described embodiment method Computer program is crossed to instruct relevant hardware and complete, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (ROM) or random access memory (RAM) etc..Computer processor is situated between for executing storage The computer program stored in matter realizes following methods:
The characteristic variable data acquisition system for being used for machine learning is obtained, characteristic variable data acquisition system includes a variety of different types of Characteristic variable;Based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Data set;Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from expanding data.
In an alternate embodiment of the invention, various types of characteristic variables are derived based on preset rules respectively and is expanded EDS extended data set after characteristic variable includes:The derivative rule to be matched according to the type-collection of selected characteristic variable and the type Then;Selected characteristic variable is derived based on the derivative rule to match with the type to obtain selected characteristic variable expansion The augmented features variable that multiple characteristic variables afterwards are concentrated as expanding data.
In an alternate embodiment of the invention, screening characteristic variable is concentrated to be combined to obtain for machine learning from expanding data Characteristic variable combines:Inquiry expanding data concentrates the feature of each characteristic variable to combine;From different feature combinations really Determine preferred feature combination to combine as the characteristic variable for machine learning.
In an alternate embodiment of the invention, determine preferred feature combination as machine learning from different feature combinations Characteristic variable combines:Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;From different It chooses to combine with the associated feature of different degree index in feature combination and be combined as the characteristic variable for machine learning.
In an alternate embodiment of the invention, it is obtaining the characteristic variable data acquisition system for machine learning and is being based on preset rules pair Various types of characteristic variables are derived to obtain respectively between the EDS extended data set after augmented features variable, further include:To not The characteristic variable of same type is pre-processed to obtain characteristic variable data acquisition system.
Feature Engineering variable provided in this embodiment determines method and device, is obtaining the characteristic variable for being used for machine learning After data acquisition system, based on preset rules various types of characteristic variables are derived to obtain respectively with the expansion after augmented features variable Fill data set, enhance the interpretation of characteristic variable, enrich the information that characteristic variable is included, thus, from expand number According to concentrating screening characteristic variable to be combined to obtain the characteristic variable combination for machine learning, may include richer and stronger Explanatory data information then optimizes the characteristic variable for machine learning, so that can use when later period modeling high-quality Feature combination improve the accuracy of model.
In addition, being derived based on preset rules to various types of characteristic variables, then concentrates and screen from expanding data Characteristic variable is combined, and be can be realized characteristic variable automatic derivatization and is screened combined operation, it is thereby achieved that automation is special Engineering is levied, Feature Engineering efficiency is improved.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or It changes still within the protection scope of the invention.

Claims (12)

1. a kind of Feature Engineering variable determines method, which is characterized in that including:
The characteristic variable data acquisition system for being used for machine learning is obtained, the characteristic variable data acquisition system includes a variety of different types of Characteristic variable;
Based on preset rules various types of characteristic variables are derived to obtain respectively with the expanding data after augmented features variable Collection;
Screening characteristic variable is concentrated to be combined to obtain the characteristic variable combination for machine learning from the expanding data.
2. Feature Engineering variable as described in claim 1 determines method, which is characterized in that the preset rules that are based on are to various The characteristic variable of type is derived to obtain respectively the EDS extended data set after augmented features variable:
The derivative rule to be matched according to the type-collection of selected characteristic variable and the type;
The selected characteristic variable is derived to obtain the selected spy based on the derivative rule to match with the type The augmented features variable that multiple characteristic variables after sign variable expansion are concentrated as the expanding data.
3. Feature Engineering variable as described in claim 1 determines method, which is characterized in that described to be concentrated from the expanding data Screening characteristic variable is combined to obtain:
Inquiring the expanding data concentrates the feature of each characteristic variable to combine;
Determine that preferred feature combination is combined as the characteristic variable for machine learning from different feature combinations.
4. Feature Engineering variable as claimed in claim 3 determines method, which is characterized in that described from different feature combinations Determine that preferred feature combination includes as the characteristic variable combination for machine learning:
Obtain the different degree index that the characteristic variable for characterizing for machine learning combines;
It chooses from different feature combinations and combines with the associated feature of different degree index as the spy for being used for machine learning Levy variable combination.
5. the Feature Engineering variable as described in claim 1-4 any one determines method, which is characterized in that used in the acquisition Spread out respectively in the characteristic variable data acquisition system of machine learning and the preset rules that are based on to various types of characteristic variables It gives birth between the EDS extended data set after obtaining augmented features variable, further includes:
Different types of characteristic variable is pre-processed to obtain the characteristic variable data acquisition system.
6. a kind of Feature Engineering variable determining device, which is characterized in that including:
Data acquisition module, for obtaining the characteristic variable data acquisition system for being used for machine learning, the characteristic variable data acquisition system Include a variety of different types of characteristic variables;
Feature derives module, for being derived to obtain augmented features respectively to various types of characteristic variables based on preset rules EDS extended data set after variable;
Feature Selection module concentrates screening characteristic variable to be combined to obtain the feature for machine learning from the expanding data Variable combination.
7. Feature Engineering variable determining device as claimed in claim 6, which is characterized in that the feature derives module and includes:
Rule Extraction unit, the derivative rule to match for the type-collection and the type according to selected characteristic variable;
Feature expansion unit, for being derived based on the derivative rule to match with the type to the selected characteristic variable Obtain the augmented features variable that multiple characteristic variables after the selected characteristic variable expands are concentrated as the expanding data.
8. Feature Engineering variable determining device as claimed in claim 6, which is characterized in that further include:
Query composition module concentrates the feature of each characteristic variable to combine for inquiring the expanding data;
Determining module is combined, for determining preferred feature combination as the feature for being used for machine learning from different feature combinations Variable combination.
9. Feature Engineering variable determining device as claimed in claim 8, which is characterized in that the combination determining module includes:
Index selection unit is used to characterize the different degree index combined for the characteristic variable of machine learning for obtaining;
Selection unit is combined, combines conduct with the associated feature of different degree index for choosing from different feature combinations Characteristic variable for machine learning combines.
10. the Feature Engineering variable determining device as described in claim 6-9 any one, which is characterized in that further include:
Preprocessing module obtains the characteristic variable data acquisition system for being pre-processed to different types of characteristic variable.
11. a kind of computer installation, which is characterized in that including processor, the processor is used to execute to store in memory Computer program realizes the method such as claim 1-5 any one.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that processor is for executing The computer program stored in storage medium realizes the method as described in claim 1-5 any one.
CN201810564705.4A 2018-06-04 2018-06-04 Feature Engineering variable determines method and device Pending CN108875815A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810564705.4A CN108875815A (en) 2018-06-04 2018-06-04 Feature Engineering variable determines method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810564705.4A CN108875815A (en) 2018-06-04 2018-06-04 Feature Engineering variable determines method and device

Publications (1)

Publication Number Publication Date
CN108875815A true CN108875815A (en) 2018-11-23

Family

ID=64336210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810564705.4A Pending CN108875815A (en) 2018-06-04 2018-06-04 Feature Engineering variable determines method and device

Country Status (1)

Country Link
CN (1) CN108875815A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657723A (en) * 2018-12-20 2019-04-19 四川新网银行股份有限公司 A method of enhancing higher-dimension category feature ability to express
CN110717182A (en) * 2019-10-14 2020-01-21 杭州安恒信息技术股份有限公司 Webpage Trojan horse detection method, device and equipment and readable storage medium
CN111985553A (en) * 2020-08-18 2020-11-24 北京云从科技有限公司 Feature construction method and device, machine readable medium and equipment
WO2021084471A1 (en) * 2019-10-31 2021-05-06 International Business Machines Corporation Artificial intelligence transparency
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
WO2021196843A1 (en) * 2020-03-31 2021-10-07 支付宝(杭州)信息技术有限公司 Derived variable selection method and apparatus for risk identification model
CN113496287A (en) * 2020-04-07 2021-10-12 广州华工弈高科技有限公司 Automatic feature engineering method and device based on regional data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063747A (en) * 2014-06-26 2014-09-24 上海交通大学 Performance abnormality prediction method in distributed system and system
CN105786860A (en) * 2014-12-23 2016-07-20 华为技术有限公司 Data processing method and device in data modeling
CN107168965A (en) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 Feature Engineering strategy determines method and device
CN107392217A (en) * 2016-05-17 2017-11-24 上海点融信息科技有限责任公司 Computer implemented information processing method and device
CN107688865A (en) * 2017-07-31 2018-02-13 上海恺英网络科技有限公司 Identify the method and apparatus of potential high consumption user in online game
CN107784322A (en) * 2017-09-30 2018-03-09 东软集团股份有限公司 Abnormal deviation data examination method, device, storage medium and program product
CN107808246A (en) * 2017-10-26 2018-03-16 上海维信荟智金融科技有限公司 The intelligent evaluation method and system of collage-credit data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063747A (en) * 2014-06-26 2014-09-24 上海交通大学 Performance abnormality prediction method in distributed system and system
CN105786860A (en) * 2014-12-23 2016-07-20 华为技术有限公司 Data processing method and device in data modeling
CN107168965A (en) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 Feature Engineering strategy determines method and device
CN107392217A (en) * 2016-05-17 2017-11-24 上海点融信息科技有限责任公司 Computer implemented information processing method and device
CN107688865A (en) * 2017-07-31 2018-02-13 上海恺英网络科技有限公司 Identify the method and apparatus of potential high consumption user in online game
CN107784322A (en) * 2017-09-30 2018-03-09 东软集团股份有限公司 Abnormal deviation data examination method, device, storage medium and program product
CN107808246A (en) * 2017-10-26 2018-03-16 上海维信荟智金融科技有限公司 The intelligent evaluation method and system of collage-credit data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KEDAR POTDAR: "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers", 《INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS》 *
江鹏: "面向非平衡数据集的多簇IB算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN109657723A (en) * 2018-12-20 2019-04-19 四川新网银行股份有限公司 A method of enhancing higher-dimension category feature ability to express
CN110717182A (en) * 2019-10-14 2020-01-21 杭州安恒信息技术股份有限公司 Webpage Trojan horse detection method, device and equipment and readable storage medium
WO2021084471A1 (en) * 2019-10-31 2021-05-06 International Business Machines Corporation Artificial intelligence transparency
US11651276B2 (en) 2019-10-31 2023-05-16 International Business Machines Corporation Artificial intelligence transparency
WO2021196843A1 (en) * 2020-03-31 2021-10-07 支付宝(杭州)信息技术有限公司 Derived variable selection method and apparatus for risk identification model
CN113496287A (en) * 2020-04-07 2021-10-12 广州华工弈高科技有限公司 Automatic feature engineering method and device based on regional data
CN111985553A (en) * 2020-08-18 2020-11-24 北京云从科技有限公司 Feature construction method and device, machine readable medium and equipment

Similar Documents

Publication Publication Date Title
CN108875815A (en) Feature Engineering variable determines method and device
JP6307169B2 (en) System and method for rapid data analysis
CN108399748B (en) Road travel time prediction method based on random forest and clustering algorithm
CN109697456A (en) Business diagnosis method, apparatus, equipment and storage medium
CN105786860A (en) Data processing method and device in data modeling
CN105718490A (en) Method and device for updating classifying model
CN110276966B (en) Intersection signal control time interval dividing method
CN104598632A (en) Hot event detection method and device
CN109308303B (en) Multi-table connection online aggregation method based on Markov chain
CN108304509A (en) A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text
CN109033322A (en) A kind of test method and device of multidimensional data
CN108763536A (en) Data bank access method and device
CN106528778A (en) Method and device for obtaining user retention data
CN106919808B (en) Gene identification system based on change step length least mean square error sef-adapting filter
CN110348647A (en) A kind of global trade big data intelligent analysis system and method
CN104050291A (en) Parallel processing method and system for account balance data
CN114022051A (en) Index fluctuation analysis method, storage medium and electronic equipment
CN106055645A (en) Dimensionality quality estimation method for high-dimensional data analysis
CN117078049A (en) Homeland space planning evaluation method and system
CN112634004A (en) Blood margin map analysis method and system for credit investigation data
CN109872265A (en) A kind of visual government affairs service data analysis method
CN111737555A (en) Method and device for selecting hot keywords and storage medium
CN103500071B (en) Method and device for storing performance index data quantitatively
CN116070958A (en) Attribution analysis method, attribution analysis device, electronic equipment and storage medium
CN103714049B (en) The similar method and device of dynamic validation sample

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Jiang Fengying

Inventor after: Lin Qingzhi

Inventor before: Xu Jingran

Inventor before: Jiang Fengying

Inventor before: Luo Xiaosheng

Inventor before: Lin Qingzhi

CB03 Change of inventor or designer information
CB02 Change of applicant information

Address after: 518000 Unit A, B, C, D, Unit 21, Unit A, Unit 22, Unit C, Unit D, Block 11, Keyuan Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen feidai small loan Co., Ltd

Address before: 518000 Unit A, B, C, D, Unit 21, Unit A, Unit 22, Unit C, Unit D, Block 11, Keyuan Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN YANXIN PETTY LOAN Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20181123

RJ01 Rejection of invention patent application after publication