CN110442417A - Feature Extraction Method, machine learning method and its device - Google Patents
Feature Extraction Method, machine learning method and its device Download PDFInfo
- Publication number
- CN110442417A CN110442417A CN201910743847.1A CN201910743847A CN110442417A CN 110442417 A CN110442417 A CN 110442417A CN 201910743847 A CN201910743847 A CN 201910743847A CN 110442417 A CN110442417 A CN 110442417A
- Authority
- CN
- China
- Prior art keywords
- item
- feature extraction
- data record
- field
- configuration item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/451—Execution arrangements for user interfaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
Provide Feature Extraction Method, machine learning method and corresponding device.Feature Extraction Method includes: acquisition data record;Obtain the feature extraction configuration item for being used to limit and how extracting predetermined characteristic from data record, wherein, the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, source field item is used to the field restriction of data record involved in every kind of predetermined characteristic be source field, processing method item is for the specified reference to the data processing function for being previously programmed as executable code, wherein, the field value for the source field that data processing function is used to limit for origin source field item executes the data processing for extracting every kind of predetermined characteristic;Data processing is executed based on field value of the feature extraction configuration item to data record to obtain the characteristic value of the predetermined characteristic.Feature extraction according to an embodiment of the present invention and machine learning techniques enhance the flexibility of programming and the reusability of code, particularly suitable for big data application.
Description
The application be the applying date be on January 8th, 2016, application No. is 201610011587.5, entitled " feature extraction sides
The divisional application of the patent application of method, machine learning method and its device ".
Technical field
The present invention relates generally to information technology field, relate more specifically to Feature Extraction Method, machine learning method with
And corresponding device.
Background technique
In information technology fields such as data mining, machine learning, handled object is data, to immense data into
Before row processing, will usually feature extraction be carried out to data.
Feature can be used as the raw material of data processing, and in brief, every data record may include multiple fields, and feature
The part of each field itself or field or the combination of field or the transformation of field or other processing results etc. can be indicated, with side
Help the internal association and latent meaning for preferably reflecting data distribution.With the field of data mining as an example, being characterized in engineering
The raw material of learning system, have significant impact to final mask, wherein efficiently and accurately extracting feature can help to learn
Process preferably refines data rule, from the internal association and subtext in multiple angles dialysis data distribution.This process
It is known as Feature Engineering in machine learning.Material of the output of Feature Engineering as machine learning, quality directly determine
The accuracy that Machine Learning Problems are portrayed, and then influence the superiority and inferiority of model.
In fact, being not limited to the Feature Engineering in machine learning field, in existing any data processing system, usually
It requires to carry out feature extraction, and in order to extract corresponding feature from each field contents, generally need programmer for every
One category feature writes executable program code to carry out feature extraction.
For example, when wishing to obtain the year information in data-oriented (" data ") in the time field of every record, it can
To be realized by executing following one section of python program
#param:list-data stores records of fields as list of dictionary
#param:string-‘YYYY-MM-DD’formatted date field
#return:list-Year sequence for each record
def getYearOf(data):
TimeFields=[rec [' time '] for rec in data]
Years=map (lambda x:x.split ('-') [0], timeFields)
return years
In above procedure, one section is defined for extracting each data record (rec) as former state from data source (data)
Code of time (year) field as time feature, wherein the extraction time field first from the record of data source, and pressing
The yyyy (0 part is designated as under that is) being partitioned into "-" is extracted according to the specific format (yyyy-mm-dd) of time field, it will
It is mapped to feature years, and returns to the time value of extraction.
As it can be seen that this section of program for the format of data (year field) and the output of feature extraction all done it is stronger about
Beam.That is, this section of feature extraction code is the data and specific output customization for specific format.Therefore, generally, if
The data formats of given data is different, and/or the feature output to be obtained is different, then require for its specific format,
Used algorithm writes the totally different code of content.Even if only the field input sequence of data record or feature output sequence
Difference will rewrite the code of a set of Complete customization.This not only brings complicated work load to programmer, but also
Biggish expense will be expended in program operation.In view of the diversification of practical application scene, the diversification of data requirement, it is this quite
Power way is difficult extension and multiplexing.
Therefore, the existing thinking for every kind of data format and the extraction a set of different disposal process of content development is to asking
As a result the traversal of topic scale causes the exploitation complexity nonlinear of feature extraction to increase, while running complexity and being also difficult to constrain.
Summary of the invention
In view of the foregoing, it is made that the present invention.
According to an aspect of the invention, there is provided a kind of method for carrying out feature extraction for data record, can wrap
Include: data record obtaining step obtains data record;Feature extraction configuration item obtaining step is obtained for how limiting from institute
State the feature extraction configuration item that data record extracts predetermined characteristic, wherein the feature extraction configuration item of every kind of predetermined characteristic includes
Source field item and processing method item, source field item are used for the field of data record involved in every kind of predetermined characteristic
It is limited to source field, processing method item is used for the specified reference to the data processing function for being previously programmed as executable code,
Wherein, the field value for the source field that the data processing function is used to limit for origin source field item is executed for extracting
State the data processing of every kind of predetermined characteristic;And characteristic value obtaining step, based on feature extraction configuration item to the data record
Field value execute data processing to obtain the characteristic value of the predetermined characteristic.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein the feature extraction configuration item obtains step
It suddenly may include: from the configuration file reading feature extraction configuration item for being provided with feature extraction configuration item or according to the input of user
Operation is to obtain feature extraction configuration item, wherein configuration file is stored locally or remotely reception.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein the feature extraction configuration item obtains step
It suddenly may include: the interface shown to user for feature extraction configuration item to be arranged;It is executed on the interface according to user
Input operation is to generate the configuration file provided with feature extraction configuration item;And feature is read from the configuration file of generation and is taken out
Take configuration item.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein for feature extraction configuration item to be arranged
Interface can be graphic user interface, and the graphic user interface may include the text editing for manual editing's configuration file
Interface and/or for showing the content options of feature extraction configuration item for the imported interface of the selection manually selected.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein obtained in the feature extraction configuration item
In step, it can be cut between text editing interface and the imported interface of selection in response to the changing interface operation input of user
It changes, the feature extraction configuration item setting result under interface is synchronously displayed under the interface after switching before the handover.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein in selecting imported interface, at least
Show data record can be as the feature extraction configuration item of the predetermined characteristic of each field and setting of source field.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein defeated including selecting in graphic user interface
In the case where entering type interface, the step of showing the interface for feature extraction configuration item to be arranged to user may include: by user
The field selected from each field be shown as setting source field, it is described come source item field selected while,
By processing method list display near source field, and the processing method that user selects from processing method list is shown as
The processing method of setting.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein processing method item list includes all places
Reason method and all processing methods are in state of activation, alternatively, processing method item list includes all processing methods but only
The processing method that can be applied to source field item is active, alternatively, the list of processing method item only includes that can apply
In the processing method of source field item.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein the feature of every kind of predetermined characteristic is taken out
Taking configuration item can also include processing parameter item corresponding with the processing method item, and the processing parameter item is described for limiting
The parameter that data processing function is related to.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein the feature of every kind of predetermined characteristic is taken out
Taking configuration item can also include that storage location identifies, and be used to indicate and be with corresponding calculate of characteristic value of every kind of predetermined characteristic
The storage region of number in memory.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein in the characteristic value obtaining step,
It can be performed in parallel at data to the pieces of data record in the data record or by a plurality of each group of data record formed
Reason.
Further, Feature Extraction Method according to an embodiment of the present invention, wherein in the characteristic value obtaining step,
Data processing can be performed in parallel by distributed computing cluster.
According to another aspect of the present invention, a kind of machine learning method that computer executes is provided, may include: data
Obtaining step is recorded, data record is obtained;Feature extraction configuration item obtaining step obtains to be used to limit and how remember from the data
The feature extraction configuration item of predetermined characteristic is extracted in record, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field
Item and processing method item, source field item are used to the field restriction of data record involved in every kind of predetermined characteristic be next
Source field, processing method item is for the specified reference to the data processing function for being previously programmed as executable code, wherein described
The field value for the source field that data processing function is used to limit for origin source field item executes pre- for extracting described every kind
Determine the data processing of feature;Characteristic value obtaining step is executed based on field value of the feature extraction configuration item to the data record
Data processing is to obtain the characteristic value of the predetermined characteristic;Sample obtains step, is at least partially based on the characteristic value and obtains step
Suddenly the characteristic value obtained forms feature vector, the sample as machine learning;And machine learning step, it is based on the sample
Carry out machine learning.
Further, machine learning method according to an embodiment of the present invention, wherein in the machine learning step, base
At least one among model training, model measurement and model application is carried out in the sample.
According to another aspect of the present invention, a kind of computing device that feature extraction is carried out for data record, packet are provided
Storage unit and processor are included, set of computer-executable instructions conjunction is stored in storage unit, is referred to when the computer is executable
When set being enabled to be executed by the processor, following step: data record obtaining step is executed, obtains data record;Feature extraction
Configuration item obtaining step obtains the feature extraction configuration item for being used to limit and how extracting predetermined characteristic from the data record,
In, the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, and source field item is used for institute
The field restriction for stating data record involved in every kind of predetermined characteristic is source field, and processing method item is compiled for specified to preparatory
Journey is the reference of the data processing function of executable code, wherein the data processing function is used to be directed to origin source field item
The field value of the source field of restriction executes the data processing for extracting every kind of predetermined characteristic;And characteristic value obtains step
Suddenly, data processing is executed to obtain the spy of the predetermined characteristic based on field value of the feature extraction configuration item to the data record
Value indicative.
According to another aspect of the present invention, provide it is a kind of carry out machine learning computing device, including storage unit and
Processor is stored with set of computer-executable instructions conjunction in storage unit, when the set of computer-executable instructions close it is described
When processor executes, following step: data record obtaining step is executed, obtains data record;Feature extraction configuration item obtains step
Suddenly, the feature extraction configuration item for being used to limit and how extracting predetermined characteristic from the data record is obtained, wherein every kind of predetermined spy
The feature extraction configuration item of sign includes source field item and processing method item, and source field item is used for every kind of predetermined characteristic
The field restriction of related data record is source field, and processing method item is for specified to being previously programmed as executable code
Data processing function reference, wherein the data processing function be used for for origin source field item limit source field
Field value execute the data processing for extracting every kind of predetermined characteristic;Characteristic value obtaining step, is matched based on feature extraction
It sets item and data processing is executed to obtain the characteristic value of the predetermined characteristic to the field value of the data record;Sample is walked
Suddenly, it is at least partially based on the characteristic value that the characteristic value obtaining step obtains, forms feature vector, the sample as machine learning
This;And machine learning step, machine learning is carried out based on the sample.
According to another aspect of the present invention, a kind of feature extraction dress that feature extraction is carried out for data record is provided
It sets, may include: data record acquiring unit, be configured to obtain data record;Feature extraction configuration item acquiring unit, is configured to
Obtain the feature extraction configuration item for being used to limit and how extracting predetermined characteristic from the data record, wherein every kind of predetermined characteristic
Feature extraction configuration item include source field item and processing method item, source field item is used for every kind of predetermined characteristic institute
The field restriction for the data record being related to is source field, and processing method item is for specified to being previously programmed as executable code
The reference of data processing function, wherein the source field that the data processing function is used to limit for origin source field item
Field value executes the data processing for extracting every kind of predetermined characteristic;And characteristic value acquiring unit, it is configured to spy
Sign extracts configuration item and executes data processing to the field value of the data record to obtain the characteristic value of the predetermined characteristic.
Further, feature extraction device according to an embodiment of the present invention, wherein the feature extraction configuration item obtains single
Member can read feature extraction configuration item from the configuration file for being provided with feature extraction configuration item or be operated according to the input of user
To obtain feature extraction configuration item, wherein configuration file is stored locally or remotely reception.
Further, feature extraction device according to an embodiment of the present invention, wherein the feature extraction configuration item obtains single
Member can show the interface for feature extraction configuration item to be arranged to user, be grasped according to the input that user executes on the interface
Make to generate the configuration file provided with feature extraction configuration item, and reads feature extraction configuration from the configuration file of generation
.
Further, feature extraction device according to an embodiment of the present invention, wherein for feature extraction configuration item to be arranged
Interface can be graphic user interface, and the graphic user interface may include text editing circle for manual editing's configuration file
Face and/or for showing the content options of feature extraction configuration item for the imported interface of the selection manually selected.
Further, feature extraction device according to an embodiment of the present invention, wherein the feature extraction configuration item obtains single
Member can switch between text editing interface and the imported interface of selection in response to the changing interface operation input of user, cut
The feature extraction configuration item setting result changed under front interface is synchronously displayed under the interface after switching.
Further, feature extraction device according to an embodiment of the present invention, wherein, can be in selecting imported interface
At least show capable of matching as the feature extraction of the predetermined characteristic of each field and setting of source field for data record
Set item.
Further, feature extraction device according to an embodiment of the present invention, wherein defeated including selecting in graphic user interface
In the case where entering type interface, word that the feature extraction configuration item acquiring unit can select user from each field
Section be shown as setting source field, it is described come source item field selected while, by processing method list display in source
Near field, and by the processing method that user selects from processing method list be shown as setting processing method.
Further, feature extraction device according to an embodiment of the present invention, wherein processing method item list may include institute
There is processing method and all processing methods are in state of activation, alternatively, processing method item list may include all processing sides
The method but processing method that only can be applied to source field item is active, alternatively, processing method item list can be only
Processing method including can be applied to source field item.
Further, feature extraction device according to an embodiment of the present invention, wherein the feature of every kind of predetermined characteristic is taken out
Taking configuration item can further include processing parameter item corresponding with the processing method item, and the processing parameter item can be used for limiting
The parameter that the data processing function is related to.
Further, feature extraction device according to an embodiment of the present invention, wherein the feature of every kind of predetermined characteristic is taken out
Taking configuration item can further include storage location mark, be used to indicate and be with corresponding calculate of characteristic value of every kind of predetermined characteristic
The storage region of number in memory.
Further, feature extraction device according to an embodiment of the present invention, wherein the characteristic value acquiring unit can be right
Pieces of data in the data record records or is performed in parallel data processing by a plurality of each group of data record formed.
Further, feature extraction device according to an embodiment of the present invention, the characteristic value acquiring unit can be by dividing
Cloth computing cluster is performed in parallel data processing.
According to another aspect of the present invention, a kind of machine learning device is provided, may include: that data record obtains list
Member is configured to obtain data record;Feature extraction configuration item acquiring unit is configured to obtain for how limiting from the data
Record extracts the feature extraction configuration item of predetermined characteristic, wherein the feature extraction configuration item of every kind of predetermined characteristic includes carrying out source word
Section item and processing method item, source field item for being by the field restriction of data record involved in every kind of predetermined characteristic
Source field, processing method item is for the specified reference to the data processing function for being previously programmed as executable code, wherein institute
State data processing function for for origin source field item limit source field field value execution it is every kind described for extracting
The data processing of predetermined characteristic;Characteristic value acquiring unit is configured to feature extraction configuration item to the word of the data record
Segment value executes data processing to obtain the characteristic value of the predetermined characteristic;Sample obtaining unit is configured to be at least partially based on institute
The characteristic value of characteristic value acquiring unit acquisition is stated, feature vector, the sample as machine learning are formed;And machine learning list
Member is configured to the sample and carries out machine learning.
Further, machine learning device according to an embodiment of the present invention, wherein the machine learning unit can be based on
The sample carries out at least one among model training, model measurement and model application.
Feature extraction technique and machine learning techniques according to an embodiment of the present invention, can be independently of feature extraction main program
Change each feature extraction configuration item as needed, so as to according to scene to feature extraction carry out effective " abstract " and
" expression " is both not necessarily to material alterations feature extraction main program, while neatly can independently write or increase data processing function,
Enhance the flexibility of programming and the reusability of code.Accordingly, for different databases, as long as defined feature is taken out as needed
Take configuration item, so that it may utilize same feature extraction main program and corresponding data processing function, enhance the flexible of programming
Property, the reusability of ease for maintenance and code.
Detailed description of the invention
From the detailed description with reference to the accompanying drawing to the embodiment of the present invention, these and/or other aspects of the invention and
Advantage will become clearer and be easier to understand, in which:
Fig. 1 shows the overview flow chart of Feature Extraction Method according to an embodiment of the invention.
Fig. 2 shows the examples of feature extraction configuration file content.
Fig. 3 is shown when data record is the sample data in machine learning, executes feature extraction process in a distributed manner
Example.
Fig. 4 A shows graphical user circle according to an exemplary embodiment of the present invention for being configured for feature extraction
The example in face.
Fig. 4 B shows user and shows processing side while left area chooses single field (for example, " age " field)
The example of the partial graphical user interface of method list.
The partial graphical that Fig. 4 C shows user's display processing method list while left area chooses multiple fields is used
The example at family interface.
Fig. 5 shows exemplary graphical user circle with the region that can carry out text editing to feature extraction configuration item
Face.
Fig. 6 shows the machine learning side of the Feature Extraction Method according to an embodiment of the present invention for applying above-described embodiment
The overview flow chart of method.
Fig. 7 shows the configuration block diagram of computing device according to an embodiment of the present invention.
Specific embodiment
In order to make those skilled in the art more fully understand the present invention, with reference to the accompanying drawings and detailed description to this hair
It is bright to be described in further detail.
Fig. 1 shows the overview flow chart of Feature Extraction Method 100 according to an embodiment of the invention.The method can
It is executed, can also be executed by special feature extraction device by computer program.It here, as an example, can be by realizing
The program for stating method is encapsulated as special software package (for example, the library lib), to whether call or call offline described soft online
The feature extraction service of consistency can be realized in part packet, and overcome causes since online offline environment is different in the prior art
The inconsistent defect of feature extraction result.
In step s 110, data record is obtained.Data can be presented in the form of data record, and data record refers to correspondence
One group of complete relevant information of a row information in data source.For example, the institute of related certain client in Customer mail list
There is information for data record.
Here data record is the raw material of subsequent feature extraction, wherein every data record can have various
The field of type and its corresponding field value.Recorded for describing client about loan repayment is represented under as an example,
The example of the single data record of people's information, the feature which extracts will be used for training about customer lending risk
Model:
No. | Age | Work | Whether house is had | Contact person | Birthday | Repay the loan mark (label) |
1 | 37 | Teacher | It is (" y ") | Spouse | 1979-3-1 | It is (" y ") |
Listed by table as above, data record describes some essential informations of the client, for example, age (age), work
(job), whether possess house (housing), contact person (contact), birthday (birthday), further comprise and repaid about client
The mark (lable) of repaying is specifically labeled as " y " and indicates that the client records with positive loan repayment, be labeled as
" n " indicates that there is the client negative loan repayment to record.As an example, the feature extracted in above-mentioned data record can be used as
Training sample, to train the model for predicting customer lending risk based on machine learning algorithm.
It should be understood that data record can have a variety of different fields to describe the information of various aspects, the content of field and
Format is unrestricted.Moreover, data record not necessarily has the mark about prediction target, but can not have any mark
Note.
Here, for the acquisition methods of data record, there is no limit, it may include various acquisition online datas or off-line data
Mode.For example, one or more data files can be stored on local hard drive or distributed file storage system in advance, lead to
Reading data file is crossed to obtain data record;Alternatively, data record can be obtained by access Local or Remote database;Again
Alternatively, what data record was also possible to generate in real time, for example, data record can be by particular communication agreement (for example, interface is retouched
Predicate says IDL) and obtained in real time from the device of calling feature extraction service.As an example, can by by multiple data records into
Row splices and generates complete data record.
Here acquisition data record can be the primary data that obtains and record, is also possible to once obtain a plurality of number
According to record.
In the step s 120, the feature extraction configuration for being used to limit and how extracting predetermined characteristic from the data record is obtained
, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, and source field item is used for
It is source field by the field restriction of data record involved in every kind of predetermined characteristic, processing method item is for specified to pre-
First it is programmed for the reference of the data processing function of executable code, wherein the data processing function is used to be directed to origin source word
The field value for the source field that section item limits executes the data processing for extracting every kind of predetermined characteristic.
In one example, the feature extraction configuration item of every kind of predetermined characteristic further includes corresponding with processing method item
Processing parameter item, processing parameter item is for limiting the parameter that the data processing function is related to.The example of processing parameter item is for example
Have format parameter, extract interval parameter, divide threshold parameter, mapping ruler parameter etc..By independently of processing method Xiang Laishe
Parameter item is set, similar data processing function can be effectively integrated, without writing corresponding letter for every kind of parameter detail
Number, to further increase the code efficiency of data processing.
In one example, the feature extraction configuration item of every kind of predetermined characteristic can also include that storage location identifies, and be used for
Indicate the storage region of design factor corresponding with the characteristic value of every kind of predetermined characteristic in memory.Here, with machine
For study, the corresponding design factor of each characteristic value (for example, weight in machine learning algorithm) quilt of every kind of predetermined characteristic
It is respectively stored in the corresponding position in memory, however, as the dimension of sample characteristics constantly extends (even up to several hundred million grades
Not) and the space of memory is very limited, it is difficult to one-to-one storage address is distributed for every kind of characteristic value, accordingly, it is desirable to will
The storage address of design factor is mapped to limited address space in order to search, for this purpose, will be to the corresponding calculating of each characteristic value
Coefficient storage location carries out hash conversion, with the memory address mapped.However, hash conversion can bring rushing on address
Prominent, different sample characteristics can be obscured mutually, this can bring biggish error to the calculating of machine learning, for this purpose, can be based on
Storage location mark separates to divide the memory space of memory down to the relevant design factor of the different types of predetermined characteristic of major general
Storage, for example, can be N bit byte (N is integer) corresponding with feature type by storage location logo design using as memory
The upper byte of location, and using the address after hash conversion as the low byte of memory address, thus the memory address after combination
It can be distinguished on memory space according to feature type, so that the design factor of variety classes feature will not mistakenly cover each other
Lid.
In one example, feature extraction configuration item is previously provided in configuration file, correspondingly, in the step 120,
Feature extraction configuration item is read from the configuration file for being provided with feature extraction configuration item, for example, obtaining by parsing configuration file
Take feature extraction configuration item.This configuration file is stored locally or remotely reception, as an example, can be software programming people
Member writes generation manually.The pre-stored configuration file write by programmer can be read from local data base, can also be led to
It crosses network and receives configuration file from other equipment.Here, suppose that the feature extracted is instructed being used to carry out the model in machine learning
Practicing, then programmer is when writing configuration file, according to model built, in conjunction with practical scene modeling, determine feature needed for model, into
And for each characteristic Design configuration item to obtain configuration file.It alternatively, can also be by showing to software users for setting
The interface (such as graphic user interface) for setting feature extraction configuration item, it is automatic according to the input operation that user executes on interface
Generate configuration file.It is exemplary to being carried out by interface by the method that user's custom features extract configuration item later in association with attached drawing
It is described in detail.
In another example, it can be operated according to the input of user to obtain feature extraction configuration item.As an example, can be
It executes and directly acquires feature extraction configuration item among feature extraction process in real time without forming any configuration file, such as journey
Graphic user interface is popped up in program process in real time, guidance user carries out the selection of feature extraction configuration item, to obtain spy
Sign extracts configuration item.
An exemplary embodiment of the present invention can change as needed each spy independently of feature extraction main program
Sign extracts configuration item, so as to carry out effective " abstract " and " expression " to feature extraction according to scene, both changes without substance
Become feature extraction main program, while neatly can independently write or increase data processing function, enhance the flexibility of programming with
The reusability of code.Accordingly, for different databases, as long as defined feature extracts configuration item as needed, so that it may utilize
Same feature extraction main program and corresponding data processing function enhance the flexibility of programming, ease for maintenance and code
Reusability.
Fig. 2 shows the examples of the feature extraction configuration item stored in configuration file.
Configuration file shown in Fig. 2 shares 10 rows, and wherein the first row defines 6 fields in data record: age (year
Age), job (work), housing (house), contact (contact person), birthday (birthday) and y (label);Second row arrives
Tenth row defines the feature extraction configuration item for each feature, wherein may include source field item and processing method item.This
Outside, in order to effectively further manage the extraction of each feature, the feature name of also settable each feature, also, can also needle
Corresponding processing parameter item is arranged to certain processing methods.
As shown in Fig. 2, every row is divided into three column or four column from the second row to the tenth row.First row specifies extracted feature
Title, as seen from Figure 2 this 9 feature names be respectively " F_AGE ", " F_JOB ", " F_HOUSING ", " F_CONTACT ",
"F_YEAR","F_MONTH","F_YEAR","F_PROFILE",".label".For each feature, secondary series specifies phase
The source field item answered, that is, the feature extracted are originated from which of data record or which field;Third column specify place
Reason method item, i.e., always the reference of the field of source item to the intermediate treatment method for exporting feature specifies tune by the reference
Data processing function, data processing function can be programmed good software module, routine, library function etc..4th column
(if any) parameter corresponding with processing method item is specified.Specifically, it in example shown in Fig. 2, is set in feature
The title of institute's extraction feature is set;The source field of institute's extraction feature is provided in depends;Specified in method for
Predetermined characteristic value is obtained, for source field specified by depends, it should execute which type of data processing (that is, calling
Which data processing function);The format for the data that data processing method is related to is provided in args.
Here, processing method item is used to call the corresponding letter for executing pre-determined draw processing for the field value of source field
Number, as an example, not a limit, is given below data processing corresponding to some processing methods, wherein certain processing methods are special
It is not directed to machine learning field:
1.Direct (is directly extracted): being exported as former state to source field, example: " 1 " -> " 1 ".
2.ExpNormalizer (index is discrete): it is that the log value at bottom exports that logarithm source field, which takes 2, example: " 2 " ->
“1”。
3.Combine (field combination): after being divided to multiple source fields with " | ", combination output, example: " 1 ", " 2 " -> " 1
|2”。
4.DataCalc (date intervals): the time interval (day is unit) on two dates, example: " 1900-01- are calculated
02 ", " 1900-01-10 " -> " 9 ".
5.GetYearOfDate (time on date): the time in interception date field, example: " 1900-01-02 " ->
“1900”。
6.GetMonthOfDate (month on date): the month in interception date field, example: " 1900-01-02 " ->
“01”。
7.GetDayOfDate (date gift): the gift in interception date field, example: " 1900-01-02 " -> " 02 ".
8.NumberFloor (lower to be rounded): logarithm type-word section carries out lower rounding, example: " 7.89 " -> " 7 ".
9.LabelDirect (numeral mark): the sample labeling method in machine learning, directly output source field conduct
Label (label), label must be integers.
10.LabelBeta (field mark): the sample labeling method in machine learning, if containing in source field
" pos " is then labeled as positive sample, is otherwise negative sample.
11.LabelBinary (classification marker): the numeral mark method in machine learning, if source field is " 1 ",
It is otherwise negative sample labeled as positive sample.
It should be noted that it is merely illustrative above in conjunction with Data field names, the processing method definition of Fig. 2 description etc., it can
To carry out different designs as needed.
In one example, it can be performed based on the configuration project set in configuration file for example shown in Fig. 2
Program code, such as can use special analysis program and the configuration project set in configuration file is parsed to be formed
Corresponding executable program code, which, which is performed, to execute what processing method item was specified to source field
Data processing, and the characteristic value obtained is given to defined feature.In one example, each configuration obtained by parsing
The executable program code that project obtains can be used as a complete structure and be saved, to complete the process of subsequent execution.
Fig. 1 is returned to, after completing feature extraction configuration item obtaining step S120, proceeds to step S130.
In step s 130, data processing is executed to obtain based on field value of the feature extraction configuration item to the data record
Take the characteristic value of the predetermined characteristic.Here, as an example, the executable program as obtained from parsing configuration file can be run
Code, alternatively, main program can be extracted come operation characteristic according to the feature extraction configuration item inputted in real time, thus for the number read
Scheduled data processing is executed according to the field value of relevant sources field in record to obtain corresponding characteristic value.
Specifically, still by taking feature extraction configuration item shown in Fig. 2 as an example, by executing step S130, then by each record
In age (age) field as former state output be assigned to feature F_AGE, similarly, by each record work (job) field,
Output is assigned to feature F_JOB, F_HOUSING, F_ as former state respectively for house (housing) field, contact person (contact) field
CONTACT;Year (YYMM), the moon (mm) and day (dd) in birthday (birthday) field of YYMM-mm-dd format is extracted
To be assigned to feature F_YEAR, FMONTH and F_DATE;Age (age) field and work (job) field are exported as former state together
Give feature F_PROFILE;And mark (y) field is directly output to feature label.
Thus each feature extracted can be combined to feature vector, or other feature is combined to form feature vector.These
Feature vector can be used for subsequent any data statistics, analysis, calculating and/or other processing.
As an example, described eigenvector can be used as the training sample in machine learning.Each data record is performed both by
Features described above extracts, and then forms training sample set.Training sample set can be applied to machine learning algorithm or other algorithms with
Carry out data mining.
An exemplary embodiment of the present invention, by the configuration item purpose compressed structure taken out, data dependence is only
It is only limited in currently processed data record.Correspondingly, simply data record sheet can be carried out based on capable file cutting,
And then feature extraction concurrently is realized to each row fragment marked off.That is, in the characteristic value obtaining step, it can be to described
Pieces of data in data record records or is performed in parallel data processing by a plurality of each group of data record formed.For example, In
In one example, in characteristic value obtaining step, feature extraction is carried out to each row data record with behavior unit, that is, traversal every
Each column of data record are to execute data processing according to the feature source field and processing method configured.Here, as showing
Example can be using distributed computing cluster come to each in the offline application scenarios for historical data to carry out feature extraction
Row executes characteristic value obtaining step.
Fig. 3 is shown when data record is the sample data in machine learning, executes feature extraction process in a distributed manner
Example, wherein sample data source can be data record sheet, and each of these row is recorded as a data, it is each column pair
Answer a field.Here, data record sheet can be carried out each row fragment being obtained, then for each based on capable file cutting
The characteristic value acquisition of a row fragment can execute parallel.For example, can be by each working node in distributed computing cluster come simultaneously
The characteristic value of each row fragment is extracted capablely.
In another example, other than the feature extraction to each row executes parallel, the inside being expert at can be to obtain
Feature be unit data processing is performed in parallel to obtain the characteristic value of feature for each feature.
It should be noted that in Fig. 1, data record obtaining step and feature extraction configuration item obtaining step spatially by
Sequence is listed, but it is not intended that temporal ordinal relation.In fact, being taken out for data record obtaining step and feature
Take configuration item obtaining step execution sequence there is no limit, in the case where not violating context logic relationship, each step can
To carry out or be executed according to reverse order parallel.
It describes according to an embodiment of the present invention feature extraction to be arranged by graphic user interface by user with reference to the accompanying drawing
The method example of configuration item.It should be noted that graphic user interface here is only as an example, any other shape also can be used in the present invention
The input interface of formula.The feature extraction configuration item being arranged by the interface can be used to form corresponding configuration file so as to subsequent
Each feature extraction configuration item is read from the configuration file, the feature extraction configuration item that can also will be arranged by the interface
Feature extraction main program is directly applied to without generating any configuration file.
Fig. 4 A shows graphical user circle according to an exemplary embodiment of the present invention for being configured for feature extraction
The example in face 200, the graphic user interface 200 of Fig. 4 A can be applied to carry out the Modeling Platform of model training, can also suitably repair
It is applied to the scene of any other feature extraction after changing.Wherein, 201 bank basic data of input table can indicate the original of bank
Beginning data, target value 202y indicate that the label of training sample, 203 bankdata_out of output table indicate the mark sheet extracted.
In above-mentioned graphic user interface 200, can at least show data record can be as each of source field
The feature extraction configuration item of field and the predetermined characteristic of setting.In addition, as an example, may also display other about data source or
The information of data output.Particularly, as shown in Figure 4 A, left area shows each field of data record in input table, packet
Include field name 204 and field attribute 205;Right area shows the configuration page of configuration feature, as an example, the configuration page
It may include for showing the content options of feature extraction configuration item for the imported interface of the selection manually selected, wherein each
For the hand-manipulating of needle to a specific feature, be correspondingly configured with this feature comes source item 206, processing method 207 and feature name 208.
As an example, can be operated according to user to the setting for each field that left area is shown, correspondingly in right side region
Domain shows each feature configuration project of user setting.In one example, user can matching of showing of manual editing's right area
Set project.
Particularly, can first on a graphical user interface (for example, left area) show data record each field,
When user chooses the field of (for example, choosing by clicking) some or certain displays, user is chosen in the configuration page
Field be set as the source field of setting and processing method list display exist and while the source field is selected
On graphic user interface, here, as an example, processing method list be displayed at user selection source field nearby so as to
The processing method that will be shown in the configuration page is therefrom selected in user;Here, in the processing method list, all processing
Method can be in state of activation;Alternatively, can only include the processing method that can be applied to the source field item chosen;Alternatively,
It may include whole processing methods but the processing method that will apply is shown as state of activation and will be unable to the processing method of application
It is shown as disabled status.
While Fig. 4 B shows the single field (for example, " age " field) 301 in left area and is easily selected by a user, to
The example of the partial graphical user interface 300 of user's display processing method list 302.For example, " age " field when the user clicks
When 301, it is selective that processing method list 302 is popped up on right side near " age " field.It can be arranged in processing method list 302
All processing methods out, and the processing method that user is currently selected is highlighted.In addition, can also be only in processing method list
Display can be applied to the processing method of " age " field of selection in 302, alternatively, only will in processing method list 302
The processing method of " age " field applied to selection activated (for example, be shown as optional state or highlight state) and
Other processing methods are shown as illegal state.
While Fig. 4 C shows multiple fields 401,402,403 in left area and is easily selected by a user, shown to user
The example of the partial graphical user interface 400 of processing method list 404.This indicates that user can choose more than one in left side
Source field 401,402 and 403 correspondingly can pop up processing method list 404, choose for user and answer these source fields
Processing method.Similarly, mode appropriate can be used to pop up processing method list 404, also, processing method list
404 can need not include that all processing methods correspondingly can be dynamically adjusted and handled according to the source field that left side selects
The processing method shown in method list 404.
Content options in addition to above-mentioned display feature extraction configuration item are for manually selecting (for example, being clicked by mouse
Mode) selection imported interface except, can also be using other forms for the interface of feature extraction configuration item, example to be arranged
Such as, it for the text editing interface of manual editing's configuration file, allows users to directly write in text editing interface and " match
Set file ", since configuration file itself has the repeatability in content, can by text editing operations (for example, duplication, paste,
Dragging etc.) it is rapidly completed writing for " configuration file ".
Fig. 5 shows exemplary graphical user circle with the region that can carry out text editing to feature extraction configuration item
Face 500.The left side of graphic user interface 500 has similarity with graphic user interface shown in Fig. 4 B and Fig. 4 C, only figure
The right area of user interface 500 shows the text editing interface 501 for manual editing's configuration file, and user can be in text
Manual editing's feature extraction configures project, including configuration feature item title, source field item, processing method in editing interface 501
Item, processing parameter item etc..By the text editing operations that are executed in text editing interface (such as duplication, paste, dragging etc.),
User being capable of high efficiency progress feature extraction configuration item purpose setting.
Above two graphic user interface can be simultaneously displayed on screen, can also be shown separately according to the user's choice
On screen, for example, the changing interface operation input in response to user is cut between text editing interface and the imported interface of selection
(display switching or activation switching) is changed, the feature extraction configuration item setting result under interface, which is synchronously displayed, before the handover cuts
Under interface after changing.Correspondingly, convenience of the user using two kinds of configuration interfaces operationally, is more effectively arranged multiple spies
Extraction mode is levied, for example, user can complete representativeness by the selection input mode such as click first in the imported interface of selection
Feature extraction configuration, then switch under text editing interface, since the result being arranged before can synchronously be shown in text
In editing interface, user quickly completes the extraction item setting of big measure feature in combination with operations such as duplication stickups.
Features described above, which extracts mode, can be applied to any suitable scene, below will be using machine learning field as example pair
It is described.
In existing machine learning field, in order to carry out model based on a large amount of structuring or unstructured data
Training, test or application, generally require to expend more manpower in the Feature Engineering stage, for example, it is desired to the preparatory needle of programming personnel
The extraction code of each feature is write to specific feature extraction rule.Correspondingly, make in Modeling Platform etc. for client
In modeling product, generally require input Modeling Platform be extract training data (that is, extract good feature to
Amount), and user is difficult to flexibly set or adjust the object and rule about feature extraction so that the use of Modeling Platform by
Limitation.
According to another embodiment of the present invention, a kind of machine learning method using features described above abstracting method is provided,
The machine learning method can be applied to the system that Modeling Platform etc. carries out data modeling convenient for user (for example, business personnel)
Always.It is illustrated below with reference to Fig. 6.
Fig. 6 shows the machine learning side of the Feature Extraction Method according to an embodiment of the present invention for applying above-described embodiment
The overview flow chart of method 600.Here, as an example, the program for realizing the method can be encapsulated as to special software package (example
Such as, the library lib), for example, step S610, S620 and S630 can be encapsulated as to individual software package, thus whether call online or
The software package is called offline, the feature extraction service of consistency can be realized, and is overcome in the prior art due to online offline
The defect that environment is different and causes feature extraction result inconsistent.In addition, step S640 can be also encapsulated as to individual software package,
To whether call or call offline the software package online, machine learning can be carried out based on the feature of extraction.
In step S610, data record obtaining step is executed, obtains data record.
In step S620, feature extraction configuration item obtaining step is executed, obtains to be used to limit and how remember from the data
The feature extraction configuration item of predetermined characteristic is extracted in record, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field
Item and processing method item, source field item are used to the field restriction of data record involved in every kind of predetermined characteristic be next
Source field, processing method item is for the specified reference to the data processing function for being previously programmed as executable code, wherein described
The field value for the source field that data processing function is used to limit for origin source field item executes pre- for extracting described every kind
Determine the data processing of feature.
In step S630, characteristic value obtaining step is executed, based on feature extraction configuration item to the word of the data record
Segment value executes data processing to obtain the characteristic value of the predetermined characteristic.
The specific implementation of above step S610, S620 and S630 and function can refer in conjunction with Fig. 1 describe step S110,
S120 and S130, which is not described herein again.
It in step S640, executes sample and obtains step, be at least partially based on the spy that the characteristic value obtaining step obtains
Value indicative forms feature vector, the sample as machine learning.
It pair in one example, is to obtain whole dimensions of feature vector to such feature extraction of data record, i.e.,
It is recorded in every data, is based on feature extraction configuration item, data processing is executed for the relevant field of data record, is obtained each
The characteristic value of feature, the eigenvalue cluster of these each dimensions form complete machine learning sample altogether.
In another example, the characteristic value of obtained each dimension can be the feature of partial dimensional, Ke Yihe
The feature of other dimensions combines, and forms last feature vector.Here to the form of other feature or source, there is no limit,
It can be from external, either can be and obtained locally with similar or different Feature Extraction Method.
In step S650, machine learning step is executed, machine learning is carried out based on the sample.Here, institute can be based on
State at least one among sample progress model training, model measurement and model application.
Here, it when carrying out model training, for the machine learning algorithm specifically used, is not particularly limited, can be
Such as the various machine learning sides such as neural network, Bayesian network, support vector machines, decision tree, genetic algorithm, expert system
Method.It should be noted that after establishing pattern drawing based on training data, it can be for the data for being used in testing for model performance
Record, obtains test sample using same Feature Extraction Method, which is input in the model that training obtains,
Can judgment models performance.In addition, can also be to the data record that will be predicted using model, using same feature extraction
Method will be input in model, to obtain corresponding prediction result come the sample that is applied using sample.Here, for model
Targeted problem there is no limit, it is different according to being executed for task, can be for example whether have for factory's workpiece it is scarce
Sunken judgement, the environment of plant whether An Quan judgement, the judgement, etc. of someone creditworthiness.
Feature Extraction Method according to an embodiment of the present invention is utilized in machine learning method according to an embodiment of the present invention, special
The feature extraction and sample set for not being suitable for big data obtain, additionally it is possible to which the user convenient for Modeling Platform directly participates in machine
Each process of device study, for example, the foundation of model, training and application process.
Another embodiment according to the present invention provides a kind of feature extraction dress that feature extraction is carried out for data record
It sets, comprising: data record acquiring unit is configured to obtain data record;Feature extraction configuration item acquiring unit, is configured to obtain
For limit how from the data record extract predetermined characteristic feature extraction configuration item, wherein the spy of every kind of predetermined characteristic
It includes source field item and processing method item that sign, which extracts configuration item, and source field item is used for will be involved by every kind of predetermined characteristic
The field restriction of data record be source field, processing method item is for specified to the data for being previously programmed as executable code
Handle the reference of function, wherein the field for the source field that the data processing function is used to limit for origin source field item
Value executes the data processing for extracting every kind of predetermined characteristic;Characteristic value acquiring unit is configured to feature extraction and matches
It sets item and data processing is executed to obtain the characteristic value of the predetermined characteristic to the field value of the data record.
Another embodiment according to the present invention provides a kind of machine learning device, may include: that data record obtains list
Member is configured to obtain data record;Feature extraction configuration item acquiring unit is configured to obtain for how limiting from the data
Record extracts the feature extraction configuration item of predetermined characteristic, wherein the feature extraction configuration item of every kind of predetermined characteristic includes carrying out source word
Section item and processing method item, source field item for being by the field restriction of data record involved in every kind of predetermined characteristic
Source field, processing method item is for the specified reference to the data processing function for being previously programmed as executable code, wherein institute
State data processing function for for origin source field item limit source field field value execution it is every kind described for extracting
The data processing of predetermined characteristic;Characteristic value acquiring unit is configured to feature extraction configuration item to the word of the data record
Segment value executes data processing to obtain the characteristic value of the predetermined characteristic;Training sample obtaining unit is configured at least partly base
In the characteristic value that the characteristic value acquiring unit obtains, feature vector, the sample as machine learning are formed;And machine learning
Unit is configured to the sample and carries out machine learning.
It should be noted that features described above draw-out device and machine learning device can be completely dependent on the operation of computer program to realize
Corresponding function, that is, each unit is as module corresponding with each step in the function structure with computer program, so that entirely
Device is called by special software package (for example, the library lib), in the corresponding feature extraction of online or offline realization or machine
Device learning functionality.
On the other hand, above-mentioned each unit can also pass through hardware, software, firmware, middleware, microcode or its any group
It closes to realize.When with software, firmware, middleware or microcode realization when, for carry out required task program code or
Code section can store in the computer-readable medium of such as storage medium.Processor can carry out required task.
Here, the embodiment of the present invention can be implemented as computing device, including storage unit and processor, deposit in storage unit
Set of computer-executable instructions conjunction is contained, when the set of computer-executable instructions, which is closed, to be executed by the processor, in execution
State Feature Extraction Method and/or machine learning method.
Fig. 7 shows the configuration block diagram of computing device 1100 according to an embodiment of the present invention.
As shown in fig. 7, computing device 1100 includes central processing unit 1110, memory 1130, display 1140, network
Interface 1150 and the input equipment 1200 that can be connected via wired or wireless way.Memory 1130, display 1140,
Network interface 1150, input equipment 1200 are connected to central processing unit 1110 via bus 1120.Memory 1130 includes interior
1131 and external memory 1132 are deposited, in the normal operation of computing device 1100, operating system and each is populated in memory 1131
Kind application program;External memory 1132 can be ROM, hard disk or solid-state disk, can store BIOS, data, program etc. above.
The meter of the Feature Extraction Method and/or machine learning method that can implement the embodiment of the present invention is stored in memory
Calculation machine instruction set, when the computer instruction set is executed by central processing unit, so that executing spy according to an embodiment of the present invention
Levy abstracting method and/or machine learning method.It should be noted that central processing unit here can be and physically or logically be distributed
Computing cluster, and be not limited to the calculating equipment of single machine.
Particularly, an embodiment according to the present invention provides a kind of meter that feature extraction is carried out for data record
Device, including storage unit and processor are calculated, set of computer-executable instructions conjunction is stored in storage unit, when the computer
When executable instruction set is executed by the processor, following step: data record obtaining step is executed, obtains data record;
Feature extraction configuration item obtaining step obtains to be used to limit and how match from the feature extraction that the data record extracts predetermined characteristic
Set item, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, and source field item is used
In being source field by the field restriction of data record involved in every kind of predetermined characteristic, processing method item is for specified pair
It is previously programmed as the reference of the data processing function of executable code, wherein the data processing function is used to be directed to origin source
The field value for the source field that field item limits executes the data processing for extracting every kind of predetermined characteristic;And characteristic value
Obtaining step executes data processing based on field value of the feature extraction configuration item to the data record to obtain the predetermined spy
The characteristic value of sign.
An embodiment according to the present invention, provide it is a kind of carry out machine learning computing device, including storage unit and
Processor is stored with set of computer-executable instructions conjunction in storage unit, when the set of computer-executable instructions close it is described
When processor executes, following step: data record obtaining step is executed, obtains data record;Feature extraction configuration item obtains step
Suddenly, the feature extraction configuration item for being used to limit and how extracting predetermined characteristic from the data record is obtained, wherein every kind of predetermined spy
The feature extraction configuration item of sign includes source field item and processing method item, and source field item is used for every kind of predetermined characteristic
The field restriction of related data record is source field, and processing method item is for specified to being previously programmed as executable code
Data processing function reference, wherein the data processing function be used for for origin source field item limit source field
Field value execute the data processing for extracting every kind of predetermined characteristic;Characteristic value obtaining step, is matched based on feature extraction
It sets item and data processing is executed to obtain the characteristic value of the predetermined characteristic to the field value of the data record;Sample is walked
Suddenly, it is at least partially based on the characteristic value that the characteristic value obtaining step obtains, forms feature vector, the sample as machine learning
This;And machine learning step, machine learning is carried out based on the sample.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.Therefore, protection scope of the present invention is answered
This is subject to the protection scope in claims.
Claims (10)
1. a kind of method for carrying out feature extraction for data record, comprising:
Data record obtaining step obtains data record;
Feature extraction configuration item obtaining step is obtained to be used to limit and how be taken out from the feature that the data record extracts predetermined characteristic
Take configuration item, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, source field
Item is for being source field by the field restriction of data record involved in every kind of predetermined characteristic, and processing method item is for referring to
The fixed reference to the data processing function for being previously programmed as executable code, wherein the data processing function be used for for by
The field value for the source field that source field item limits executes the data processing for extracting every kind of predetermined characteristic;And
Characteristic value obtaining step executes data processing based on field value of the feature extraction configuration item to the data record to obtain
The characteristic value of the predetermined characteristic.
2. according to the method described in claim 1, wherein, the feature extraction configuration item obtaining step includes: special from being provided with
The configuration file that sign extracts configuration item reads feature extraction configuration item or operated according to the input of user matches to obtain feature extraction
Set item, wherein configuration file is stored locally or remotely reception.
3. according to the method described in claim 1, wherein, the feature extraction configuration item obtaining step includes:
The interface for feature extraction configuration item to be arranged is shown to user;
The input executed on the interface according to user operates to generate the configuration file provided with feature extraction configuration item;With
And
Feature extraction configuration item is read from the configuration file of generation.
4. according to the method described in claim 3, wherein, the interface for feature extraction configuration item to be arranged is graphical user circle
Face, the graphic user interface include for the text editing interface of manual editing's configuration file and/or for showing that feature is taken out
Take the content options of configuration item for the imported interface of the selection manually selected.
5. according to the method described in claim 4, wherein, in the feature extraction configuration item obtaining step, in response to user
Changing interface operation input switch between text editing interface and the imported interface of selection, the feature under interface before the handover
Configuration item setting result is extracted to be synchronously displayed under the interface after switching.
6. method described in any claim in -5 according to claim 1, wherein the feature of every kind of predetermined characteristic is taken out
Taking configuration item further includes storage location mark, is used to indicate design factor corresponding with the characteristic value of every kind of predetermined characteristic and exists
Storage region in memory.
7. the machine learning method that a kind of computer executes, comprising:
Data record obtaining step obtains data record;
Feature extraction configuration item obtaining step is obtained to be used to limit and how be taken out from the feature that the data record extracts predetermined characteristic
Take configuration item, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, source field
Item is for being source field by the field restriction of data record involved in every kind of predetermined characteristic, and processing method item is for referring to
The fixed reference to the data processing function for being previously programmed as executable code, wherein the data processing function be used for for by
The field value for the source field that source field item limits executes the data processing for extracting every kind of predetermined characteristic;
Characteristic value obtaining step executes data processing based on field value of the feature extraction configuration item to the data record to obtain
The characteristic value of the predetermined characteristic;
Sample obtains step, is at least partially based on the characteristic value that the characteristic value obtaining step obtains, and forms feature vector, as
The sample of machine learning;And
Machine learning step carries out machine learning based on the sample.
8. a kind of computing device for carrying out feature extraction for data record, including storage unit and processor, in storage unit
It is stored with set of computer-executable instructions conjunction, when the set of computer-executable instructions, which is closed, to be executed by the processor, is executed
Following step:
Data record obtaining step obtains data record;
Feature extraction configuration item obtaining step is obtained to be used to limit and how be taken out from the feature that the data record extracts predetermined characteristic
Take configuration item, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, source field
Item is for being source field by the field restriction of data record involved in every kind of predetermined characteristic, and processing method item is for referring to
The fixed reference to the data processing function for being previously programmed as executable code, wherein the data processing function be used for for by
The field value for the source field that source field item limits executes the data processing for extracting every kind of predetermined characteristic;And
Characteristic value obtaining step executes data processing based on field value of the feature extraction configuration item to the data record to obtain
The characteristic value of the predetermined characteristic.
9. a kind of computing device for carrying out machine learning, including storage unit and processor are stored with computer in storage unit
Executable instruction set executes following step when the set of computer-executable instructions, which is closed, to be executed by the processor:
Data record obtaining step obtains data record;
Feature extraction configuration item obtaining step is obtained to be used to limit and how be taken out from the feature that the data record extracts predetermined characteristic
Take configuration item, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, source field
Item is for being source field by the field restriction of data record involved in every kind of predetermined characteristic, and processing method item is for referring to
The fixed reference to the data processing function for being previously programmed as executable code, wherein the data processing function be used for for by
The field value for the source field that source field item limits executes the data processing for extracting every kind of predetermined characteristic;
Characteristic value obtaining step executes data processing based on field value of the feature extraction configuration item to the data record to obtain
The characteristic value of the predetermined characteristic;
Sample obtains step, is at least partially based on the characteristic value that the characteristic value obtaining step obtains, and forms feature vector, as
The sample of machine learning;And
Machine learning step carries out machine learning based on the sample.
10. a kind of feature extraction device for carrying out feature extraction for data record, comprising:
Data record acquiring unit is configured to obtain data record;
Feature extraction configuration item acquiring unit is configured to obtain for how limiting from data record extraction predetermined characteristic
Feature extraction configuration item, wherein the feature extraction configuration item of every kind of predetermined characteristic includes source field item and processing method item, is come
Source field item is used to the field restriction of data record involved in every kind of predetermined characteristic be source field, processing method item
For the specified reference to the data processing function for being previously programmed as executable code, wherein the data processing function is used for
The data processing for extracting every kind of predetermined characteristic is executed for the field value for the source field that origin source field item limits;
Characteristic value acquiring unit is configured to feature extraction configuration item and executes data processing to the field value of the data record
To obtain the characteristic value of the predetermined characteristic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910743847.1A CN110442417A (en) | 2016-01-08 | 2016-01-08 | Feature Extraction Method, machine learning method and its device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610011587.5A CN105677353A (en) | 2016-01-08 | 2016-01-08 | Feature extraction method and machine learning method and device thereof |
CN201910743847.1A CN110442417A (en) | 2016-01-08 | 2016-01-08 | Feature Extraction Method, machine learning method and its device |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610011587.5A Division CN105677353A (en) | 2016-01-08 | 2016-01-08 | Feature extraction method and machine learning method and device thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110442417A true CN110442417A (en) | 2019-11-12 |
Family
ID=56299543
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910743847.1A Pending CN110442417A (en) | 2016-01-08 | 2016-01-08 | Feature Extraction Method, machine learning method and its device |
CN201610011587.5A Pending CN105677353A (en) | 2016-01-08 | 2016-01-08 | Feature extraction method and machine learning method and device thereof |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610011587.5A Pending CN105677353A (en) | 2016-01-08 | 2016-01-08 | Feature extraction method and machine learning method and device thereof |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110442417A (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610239B (en) * | 2016-09-27 | 2024-04-12 | 第四范式(北京)技术有限公司 | Feature processing method and feature processing system for machine learning |
CN108154237B (en) * | 2016-12-06 | 2022-04-05 | 华为技术有限公司 | Data processing system and method |
CN106779088B (en) * | 2016-12-06 | 2019-04-23 | 第四范式(北京)技术有限公司 | Execute the method and system of machine learning process |
CN107169574A (en) * | 2017-05-05 | 2017-09-15 | 第四范式(北京)技术有限公司 | Using nested machine learning model come the method and system of perform prediction |
CN107169573A (en) * | 2017-05-05 | 2017-09-15 | 第四范式(北京)技术有限公司 | Using composite machine learning model come the method and system of perform prediction |
CN107273131A (en) * | 2017-06-22 | 2017-10-20 | 艾凯克斯(嘉兴)信息科技有限公司 | A kind of machine learning method applied to Configurable BOM |
CN113220688A (en) * | 2017-07-04 | 2021-08-06 | 第四范式(北京)技术有限公司 | Method and device for splicing data records |
CN107766946B (en) * | 2017-09-28 | 2020-06-23 | 第四范式(北京)技术有限公司 | Method and system for generating combined features of machine learning samples |
CN108008942B (en) * | 2017-11-16 | 2020-04-07 | 第四范式(北京)技术有限公司 | Method and system for processing data records |
CN108090516A (en) * | 2017-12-27 | 2018-05-29 | 第四范式(北京)技术有限公司 | Automatically generate the method and system of the feature of machine learning sample |
CN108228861B (en) * | 2018-01-12 | 2020-09-01 | 第四范式(北京)技术有限公司 | Method and system for performing feature engineering for machine learning |
CN108681426B (en) * | 2018-05-25 | 2020-08-11 | 第四范式(北京)技术有限公司 | Method and system for performing feature processing on data |
CN110209902B (en) * | 2018-08-17 | 2023-11-14 | 第四范式(北京)技术有限公司 | Method and system for visualizing feature generation process in machine learning process |
CN109144648B (en) * | 2018-08-21 | 2020-06-23 | 第四范式(北京)技术有限公司 | Method and system for uniformly performing feature extraction |
CN111273953B (en) * | 2018-11-19 | 2021-07-16 | Oppo广东移动通信有限公司 | Model processing method, device, terminal and storage medium |
CN110427222A (en) * | 2019-06-24 | 2019-11-08 | 北京达佳互联信息技术有限公司 | Data load method, device, electronic equipment and storage medium |
CN110334131A (en) * | 2019-07-09 | 2019-10-15 | 西安点告网络科技有限公司 | The method and apparatus of feature extraction for machine learning model |
CN110569271B (en) * | 2019-09-17 | 2022-11-15 | 第四范式(北京)技术有限公司 | Data processing method and system for extracting features |
CN110633078B (en) * | 2019-09-20 | 2020-12-15 | 第四范式(北京)技术有限公司 | Method and device for automatically generating feature calculation codes |
CN110795424A (en) * | 2019-09-30 | 2020-02-14 | 北京淇瑀信息科技有限公司 | Feature engineering variable data request processing method and device and electronic equipment |
CN110851500B (en) * | 2019-11-07 | 2022-10-28 | 北京集奥聚合科技有限公司 | Method for generating expert characteristic dimension required by machine learning modeling |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060271533A1 (en) * | 2005-05-26 | 2006-11-30 | Kabushiki Kaisha Toshiba | Method and apparatus for generating time-series data from Web pages |
CN101958987A (en) * | 2009-07-14 | 2011-01-26 | 中国电信股份有限公司 | Method and system for dynamically converting telecommunications service data |
CN102243649A (en) * | 2011-06-07 | 2011-11-16 | 上海交通大学 | Semi-automatic information extraction processing device of ontology |
CN102622354A (en) * | 2011-01-27 | 2012-08-01 | 北京世纪读秀技术有限公司 | Aggregated data quick searching method based on feature vector |
CN103914478A (en) * | 2013-01-06 | 2014-07-09 | 阿里巴巴集团控股有限公司 | Webpage training method and system and webpage prediction method and system |
CN104424263A (en) * | 2013-08-29 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Data recording method and data recording device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100280990A1 (en) * | 2009-04-30 | 2010-11-04 | Castellanos Maria G | Etl for process data warehouse |
CN101763261B (en) * | 2009-12-28 | 2013-01-23 | 山东中创软件商用中间件股份有限公司 | Method and system for extracting, converting and loading data |
CN104881488B (en) * | 2015-06-05 | 2017-04-05 | 焦点科技股份有限公司 | Configurable information extraction method based on relation table |
-
2016
- 2016-01-08 CN CN201910743847.1A patent/CN110442417A/en active Pending
- 2016-01-08 CN CN201610011587.5A patent/CN105677353A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060271533A1 (en) * | 2005-05-26 | 2006-11-30 | Kabushiki Kaisha Toshiba | Method and apparatus for generating time-series data from Web pages |
CN101958987A (en) * | 2009-07-14 | 2011-01-26 | 中国电信股份有限公司 | Method and system for dynamically converting telecommunications service data |
CN102622354A (en) * | 2011-01-27 | 2012-08-01 | 北京世纪读秀技术有限公司 | Aggregated data quick searching method based on feature vector |
CN102243649A (en) * | 2011-06-07 | 2011-11-16 | 上海交通大学 | Semi-automatic information extraction processing device of ontology |
CN103914478A (en) * | 2013-01-06 | 2014-07-09 | 阿里巴巴集团控股有限公司 | Webpage training method and system and webpage prediction method and system |
CN104424263A (en) * | 2013-08-29 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Data recording method and data recording device |
Also Published As
Publication number | Publication date |
---|---|
CN105677353A (en) | 2016-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442417A (en) | Feature Extraction Method, machine learning method and its device | |
CN105487864B (en) | The method and apparatus of Code automatic build | |
US10769721B2 (en) | Intelligent product requirement configurator | |
US10248720B1 (en) | Systems and methods for preparing raw data for use in data visualizations | |
CN113939829A (en) | Data sampling for model exploration | |
CN107578140A (en) | Guide analysis system and method | |
WO2018079225A1 (en) | Automatic prediction system, automatic prediction method and automatic prediction program | |
Pandey et al. | Examining the Role of Enterprise Resource Planning (ERP) in Improving Business Operations in Companies | |
US20220351004A1 (en) | Industry specific machine learning applications | |
US9298686B2 (en) | System and method for simulating discrete financial forecast calculations | |
CN104541297A (en) | Extensibility for sales predictor (SPE) | |
CN112463986A (en) | Information storage method and device | |
Efford et al. | Package ‘secr’ | |
US20220351051A1 (en) | Analysis system, apparatus, control method, and program | |
Winters | Practical predictive analytics | |
CN114692889A (en) | Meta-feature training model for machine learning algorithm | |
CN108701153B (en) | Method, system and computer readable storage medium for responding to natural language query | |
Van Orshoven et al. | Upgrading geographic information systems to spatio-temporal decision support systems | |
WO2020060720A1 (en) | Analyzing natural language expressions in a data visualization user interface | |
CN110333844B (en) | Calculation formula processing method and device | |
Budaev et al. | Development of the Web Service «Analysis of Demographic Indicators of the Region» | |
CN111199287A (en) | Feature engineering real-time recommendation method and device and electronic equipment | |
Gervas | Analysis of User Interface design methods | |
CN113821296B (en) | Visual interface generation method, electronic equipment and storage medium | |
Loureiro et al. | Predicting multiple domain queue waiting time via machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |