CN108960269A - Characteristic-acquisition method, device and the calculating equipment of data set - Google Patents

Characteristic-acquisition method, device and the calculating equipment of data set Download PDF

Info

Publication number
CN108960269A
CN108960269A CN201810284529.9A CN201810284529A CN108960269A CN 108960269 A CN108960269 A CN 108960269A CN 201810284529 A CN201810284529 A CN 201810284529A CN 108960269 A CN108960269 A CN 108960269A
Authority
CN
China
Prior art keywords
data set
feature
deep learning
learning model
sample data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810284529.9A
Other languages
Chinese (zh)
Other versions
CN108960269B (en
Inventor
袁锦程
赵闻飙
王维强
许辽萨
叶芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810284529.9A priority Critical patent/CN108960269B/en
Publication of CN108960269A publication Critical patent/CN108960269A/en
Application granted granted Critical
Publication of CN108960269B publication Critical patent/CN108960269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

This specification provides the characteristic-acquisition method, device and calculating equipment of a kind of data set, the program can carry out automation Feature Engineering using deep learning, it specifically advances with raw data set and corresponding setting feature trains deep learning model, feature set is obtained when needing to be directed to sample data set, it can be based on the similitude of statistical information, matched raw data set is found out, thus the feature set of the deep learning model output sample data set trained using raw data set.

Description

Characteristic-acquisition method, device and the calculating equipment of data set
Technical field
This specification embodiment is related to the characteristic-acquisition method in machine learning techniques field more particularly to data set, device And calculate equipment.
Background technique
In machine learning techniques field, " data and feature determine the upper limit of machine learning, and model is only approaching this A upper limit ".That is, good data and being characterized in that all models perform to ultimate attainment premise.It is characterized in taking out in data The information useful to prediction of result taken out, Feature Engineering refers to using specialty background knowledge and skill processing data, maximum Feature is extracted from data limit for model use, the quality of Feature Engineering will will affect the estimated performance of entire model.
Summary of the invention
To overcome the problems in correlation technique, present description provides the characteristic-acquisition method of data set, device and Calculate equipment.
According to this specification embodiment in a first aspect, provide a kind of characteristic-acquisition method of data set, the method packet It includes:
Sample data set is obtained, determines the statistical information of the sample data set;
Using the similitude with the statistical information, search and the matched raw data set of the sample data set, acquisition The corresponding deep learning model of the raw data set, the model parameter of the deep learning model advance with the original number It is obtained according to collection and the training of corresponding setting feature;
Using the sample data set as input, the feature of the sample data set is exported using the deep learning model Collection.
Optionally, the statistical information includes at least following one or more: total amount of data, black sample proportion, attribute Very poor, attribute value the interquartile-range IQR of number, the average value of attribute value, the variance of attribute value, the covariance of attribute value, attribute value, The degree of bias of attribute value or the kurtosis of attribute value.
Optionally, the statistical information of the sample data set and the difference of the raw data set are lower than given threshold.
Optionally, the sample is exported using the deep learning model using the sample data set as input described Before the feature set of notebook data collection, the method also includes:
It shows the deep learning model, and shows that the parameter for the deep learning model adjusts interface, passes through The parameter adjustment value of the adjustment interface captures user input, adjusts the deep learning model according to the parameter adjustment value Model parameter.
Optionally, further includes:
In the database for being stored with the raw data set and corresponding deep learning model, increases and be directed to the sample The record of the corresponding relationship of data set and the deep learning model.
Optionally, further includes:
Test data set is obtained, the prediction accuracy of feature in the feature set, root are calculated according to the test data set Feature is screened according to calculated result.
Optionally, further includes:
Show supplementary features input interface, the supplementary features inputted by the interface captures user, and described in the calculating It is added in the feature set before prediction accuracy.
Optionally, before being added in the feature set, the method also includes:
Judge the linear relationship of feature in the supplementary features and the feature set, and deletes a pair with linear relationship One type feature in feature.
According to the second aspect of this specification embodiment, a kind of feature acquisition device of data set, described device packet are provided It includes:
Module is obtained, is used for: obtaining sample data set, determines the statistical information of the sample data set;
Searching module is used for: using the similitude with the statistical information, being searched and the matched original of the sample data set Beginning data set, obtains the corresponding deep learning model of the raw data set, and the model parameter of the deep learning model is preparatory It is obtained using the raw data set and the training of corresponding setting feature;
Output module is used for: using the sample data set as input, exporting the sample using the deep learning model The feature set of notebook data collection.
Optionally, the statistical information includes at least following one or more: total amount of data, black sample proportion, attribute Very poor, attribute value the interquartile-range IQR of number, the average value of attribute value, the variance of attribute value, the covariance of attribute value, attribute value, The degree of bias of attribute value or the kurtosis of attribute value.
Optionally, the statistical information of the sample data set and the difference of the raw data set are lower than given threshold.
Optionally, described device further includes parameter adjustment module, is used for:
Described using the sample data set as input, the sample data set is exported using the deep learning model Feature set before, show the deep learning model, and show that the parameter for the deep learning model adjusts interface, By the parameter adjustment value of the adjustment interface captures user input, the deep learning mould is adjusted according to the parameter adjustment value The model parameter of type.
Optionally, further include that record increases module, be used for:
In the database for being stored with the raw data set and corresponding deep learning model, increases and be directed to the sample The record of the corresponding relationship of data set and the deep learning model.
Optionally, further include Feature Selection module, be used for:
Test data set is obtained, the prediction accuracy of feature in the feature set, root are calculated according to the test data set Feature is screened according to calculated result.
Optionally, further include that supplementary features obtain module, be used for:
Show supplementary features input interface, the supplementary features inputted by the interface captures user, and described in the calculating It is added in the feature set before prediction accuracy.
Optionally, further include feature processing block, be used for:
Before being added in the feature set, the linear pass of feature in the supplementary features and the feature set is judged System, and delete the one type feature in a pair of of feature with linear relationship.
According to the third aspect of this specification embodiment, a kind of calculating equipment is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
Sample data set is obtained, determines the statistical information of the sample data set;
Using the similitude with the statistical information, search and the matched raw data set of the sample data set, acquisition The corresponding deep learning model of the raw data set, the model parameter of the deep learning model advance with the original number It is obtained according to collection and the training of corresponding setting feature;
Using the sample data set as input, the feature of the sample data set is exported using the deep learning model Collection.
The technical solution that the embodiment of this specification provides can include the following benefits:
In this specification embodiment, automation Feature Engineering can be carried out using deep learning, specifically, can be sharp in advance Deep learning model is trained with raw data set and corresponding setting feature, obtains feature when needing to be directed to sample data set Collection, can find out matched raw data set, to be trained using raw data set based on the similitude of statistical information Deep learning model output sample data set feature set.The present embodiment can substantially reduce the workload of user, be promoted special The efficiency for levying engineering, by the learning ability using deep learning model itself, without the understanding by user to business scenario With experience selected characteristic, accurately feature can also can be exported.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not This specification can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the reality for meeting this specification Example is applied, and is used to explain the principle of this specification together with specification.
Fig. 1 is a kind of this specification feature acquisition schematic diagram of a scenario of data set shown according to an exemplary embodiment.
Fig. 2 is a kind of this specification process of the characteristic-acquisition method of data set shown according to an exemplary embodiment Figure.
Fig. 3 is a kind of hardware structure diagram of calculating equipment where the feature acquisition device of this specification data set.
Fig. 4 is a kind of this specification block diagram of the feature acquisition device of data set shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with this specification.On the contrary, they are only and such as institute The example of the consistent device and method of some aspects be described in detail in attached claims, this specification.
It is only to be not intended to be limiting this explanation merely for for the purpose of describing particular embodiments in the term that this specification uses Book.The "an" of used singular, " described " and "the" are also intended to packet in this specification and in the appended claims Most forms are included, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein is Refer to and includes that one or more associated any or all of project listed may combine.
It will be appreciated that though various information may be described using term first, second, third, etc. in this specification, but These information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not taking off In the case where this specification range, the first information can also be referred to as the second information, and similarly, the second information can also be claimed For the first information.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " or " in response to determination ".
In machine learning task, after obtaining sample data set, it usually needs first carry out Feature Engineering, later retraining mould Type, Feature Engineering are a part most time-consuming, most heavy but most indispensable in machine learning task.Based on this, this explanation The feature that book embodiment provides a kind of data set obtains scheme, and the program can carry out automation feature work using deep learning Journey, specifically advances with raw data set and corresponding setting feature trains deep learning model, when needing for sample Data set obtains feature set, matched raw data set can be found out, thus using original based on the similitude of statistical information The feature set for the deep learning model output sample data set that data set is trained.Next this specification embodiment is carried out It is described in detail.
As shown in Figure 1, being that a kind of this specification embodiment feature of data set shown according to an exemplary embodiment obtains Schematic diagram of a scenario is taken, two stages are shown in Fig. 1: the preparation stage of deep learning model and feature obtain the stage.
The preparation stage of deep learning model in this specification embodiment is to be able to precipitate more set raw data sets And corresponding deep learning model, the corresponding deep learning model are using raw data set and for the initial data What the setting feature training of collection obtained.Wherein, the feature needs for describing raw data set are pre-designed, so that deep learning mould Type can learn to obtain the rule of raw data set and character pair, and for the ease of distinguishing and describing, the present embodiment is referred to as original The setting feature of data set.In practical application, raw data set can be collected to obtain from PostgreSQL database, these PostgreSQL databases It is provided with the data set of some classics, it is a variety of common that these data sets are related to computer vision, natural language or speech recognition etc. Business scenario.In other examples, raw data set is also possible to combine business scenario to need by technical staff, utilizes own number It is collected according to the modes such as library or other databases.
Wherein, deep learning model can be understood as the neural network of very deep layer, and the neural network of the present embodiment can wrap Include full Connection Neural Network, convolutional neural networks (CNN, Convolutional Neural Network), Recognition with Recurrent Neural Network (RNN, Recurrent Neural Network) or time recurrent neural network (Long Short-Term Memory, LSTM) Deng.In the present embodiment, it is contemplated that different neural networks may be adapted to the data set of different characteristics, therefore can prepare multiple be based on The deep learning model that different neural networks are constituted.Further, phase can be chosen according to the characteristics of different raw data sets The deep learning model answered, using the setting feature of raw data set as the learning tasks of deep learning model, to deep learning Model is trained, and training process can be understood as the adjustment process to parameter in neural network, can be true after training The optimal multiple parameters of model (parameter set) are made, the deep learning model that training obtains is the model for adjusting parameter.
In this specification embodiment, raw data set and corresponding deep learning model can be stored, optionally, The corresponding relationship that a database is exclusively used in storage raw data set and deep learning model can be constructed.Optionally, for depth The storage content of learning model, can be includes model classification (neural network classification used by characterization model, such as CNN Or RNN etc.) and the model parameter that trains.
Is obtained for feature, as shown in Fig. 2, being that a kind of feature of data set shown in this specification embodiment obtains the stage Take the flow chart of method, comprising the following steps:
In step 202, sample data set is obtained, determines the statistical information of the sample data set;
In step 204, it using the similitude with the statistical information, searches matched original with the sample data set Data set, obtains the corresponding deep learning model of the raw data set, and the model parameter of the deep learning model is sharp in advance It is obtained with the raw data set and the training of corresponding setting feature;
In step 206, using the sample data set as input, the sample is exported using the deep learning model The feature set of data set.
The data set for needing to obtain feature is known as sample data set by the present embodiment.In order to realize automation and standard The feature of sample data set really is obtained, storage content above-mentioned is based on, the present embodiment can be found out and sample data set phase Like higher raw data set is spent, since sample data set and the raw data set are more similar, raw data set is utilized Corresponding deep learning model, can export the feature of accurate sample data set.The present embodiment can substantially reduce use The workload at family, the efficiency of lifting feature engineering, by the learning ability using deep learning model itself, without relying on user Understanding and experience selected characteristic to business scenario, can also can export accurately feature.
Wherein, many datas have been generally comprised in data set, how rapidly and accurately to have been determined similar between data set Property, the present embodiment is measured using statistical information.Optionally, statistical information includes at least following one or more: data are total The pole of amount, black sample proportion, attribute number, the average value of attribute value, the variance of attribute value, the covariance of attribute value, attribute value The interquartile-range IQR of difference, attribute value, the degree of bias of attribute value or the kurtosis of attribute value.Specifically, sample data set and raw data set In generally comprise many datas, each data is the description as described in an event or object, reflects event or object at certain The performance of aspect or the item of property, referred to as attribute.For example, the data of a relevant user, wherein contain the age of user, The specifying informations such as gender, occupation, average annual income or average annual transaction amount, age, gender, occupation, average annual income or average annual transaction The amount of money is above-mentioned attribute, and each single item specifying information of the user carried in data as corresponds to the attribute value of attribute.This reality In terms of example is applied by calculating above-mentioned total amount of data, black sample proportion and attribute etc. statistical informations, can effectively determine data Similitude between collection.
Specifically, the measurement standard of similitude can be, the statistical information and the initial data of sample data set are calculated The difference of collection, and set the similar given threshold of characterization the two, statistical information and the raw data set when sample data set Difference be lower than given threshold, then can determine whether that sample data set and raw data set similarity are higher, the two can match.Specifically Given threshold can according to need flexible configuration, optionally, in the case where considering a variety of statistical informations, can be directed to every kind Corresponding threshold value is arranged in statistical information, for example, the difference of total amount of data is lower than 10% lower than the difference of 5%, black sample proportion, belongs to Property value average value difference be lower than 20% etc., also, sample data set matched with raw data set can be integrate it is various The difference of statistical information and determine, for example total amount of data and black sample proportion are paid the utmost attention to, and be can also be and are believed for every kind of statistics Breath setting weight, by the difference of every kind of statistical information multiplied by carrying out matched judgement after weight.
As an example, in practical application, a variety of attributes as involved in data for ease of calculation can be by all categories Property value quantization, and use Unified coding, or unified normalization.By taking the normalization of whole attribute values as an example, matching judgement is being carried out In the process, the average value that can be the attribute value of each single item attribute to all data of sample data set, each single item attribute value Average value and initial data concentrate data each single item attribute value average value to compare one by one, if having 80%, (given threshold, can be flexible Configuration) more than attribute average value within 20% (given threshold, flexibly configurable), then can determine that: in attribute value Average value this index on, the two is similar.By taking variance index as an example, it can be to each of all data of sample data set The variance and initial data of each single item attribute value are concentrated data each single item attribute value variance one by the variance of the attribute value of item attribute One compares, if there is the variance difference of the attribute of 80% (given threshold, flexibly configurable) or more 30%, (given threshold, can spirit Configuration living) within, then can determine that: in this index of the variance of attribute value, the two is similar.In practical application, technical staff Flexible configuration the decision procedure of statistical information similitude, the present embodiment can be not construed as limiting this as needed.
Database purchase has the corresponding relationship of raw data set Yu deep learning model, optionally, can also correspond to storage The statistical information of raw data set, can rapidly read original when needing to carry out feature extraction to sample data set The statistical information of data set, and both carry out whether matched judgement.
By above-mentioned processing, find with after the matched raw data set of sample data set, can be by the sample number According to collection as input, the feature set of the sample data set is exported using the corresponding deep learning model of raw data set.It is practical It in, can be according to statistical information, find out a raw data set the most matched, it will be understood that lookup and sample During the matched raw data set of data set, it is also possible to have two or more raw data sets and sample data set It more matches, technical staff, which can according to need, selects the deep learning model of one of raw data set to carry out feature set It obtains, also can according to need the acquisition for carrying out feature set using the deep learning model of multiple raw data sets, each depth Learning model can export a set of feature set, user can according to need the feature set needed for selecting it carry out using.
In practical application, the deep learning model of selected taking-up is possible to not meet user demand or user has pair The needs that the parameter of deep learning model is adjusted, in the present embodiment, described using the sample data set as input, benefit Before the feature set for exporting the sample data set with the deep learning model, the method can also include:
It shows the deep learning model, and shows that the parameter for the deep learning model adjusts interface, passes through The parameter adjustment value of the adjustment interface captures user input, adjusts the deep learning model according to the parameter adjustment value Model parameter.
The parameter adjustment interface of the present embodiment can be realized using modes such as visualization windows, can be mentioned in the adjustment interface The interactive functions such as input frame are provided with, user can be adjusted for the parameter of selected deep learning model, pass through the tune After whole interface gets the parameter adjustment value of user's input, the mould of the deep learning model is adjusted according to the parameter adjustment value Shape parameter, so that feature acquisition can more meet user demand.
It include many features in the feature set of deep learning model output, however, these features are possible to preferably Description event or object, it is also possible to it cannot describe well, it can also be to feature in feature set in the present embodiment based on this Prediction accuracy determined, to carry out Feature Selection, reject uncorrelated or redundancy feature, the number for reducing feature, It reduces the time of model training and improves the accuracy of model.In view of in practical application, for the feature set obtained automatically, It is possible that user devises other features, optionally, the present embodiment has also showed that supplementary features input interface, passes through the interface The supplementary features of user's input are obtained, and are added in the feature set before calculating the prediction accuracy, so that subsequent Feature Selection when, comprehensive selection can be carried out in conjunction with the supplementary features and the feature that obtains automatically that user provides.
For user provide supplementary features, it is possible to occur with feature set in feature there are syntenies the case where, collinearly Property refer between independent variable there is relatively strong linear relationship, there are a pair of of features of linear relationship, this may be to pre- to feature It surveys result to have a negative impact, so that model deficient in stability.Based on this, supplementary features are being added to it in the feature set Before, the method also includes:
Judge the linear relationship of feature in the supplementary features and the feature set, and deletes a pair with linear relationship One type feature in feature.Wherein, the mode that whether there is linear relationship between judging characteristic, can use Pearson's phase Relationship number.Pearson correlation coefficients are that one kind is simple, can help to understand the method for relationship between feature and response variable, the party What method was measured is the linear dependence between variable, and value interval as a result is [- 1,1], and -1 indicates complete negatively correlated (this Variable decline, that will rise) ,+1, which indicates complete, is positively correlated, and 0 indicates without linear correlation.
After above-mentioned processing, Screening Treatment can be carried out for feature in feature set.Optionally, available test number According to collection, the prediction accuracy of feature in the feature set is calculated according to the test data set, is filtered out according to calculated result pre- Survey feature and displaying that accuracy is higher than given threshold.
Wherein, the mode for calculating the prediction accuracy of feature in the feature set, can according to need flexible choice, as Example, can using class GBDT (Gradient Boosting Decison Tree, gradient decline tree) scoring functions for The significance level of each feature is given a mark, and then the prediction accuracy of this feature is determined according to marking result.
As an example, the basic handling mode of the scoring functions of class GBDT is as follows:
Step 1 enumerates each leaf node all available features since depth is 0 tree
Step 2 arranges the training sample for belonging to the node according to this feature value ascending order for each feature, passes through line Property scanning mode determine the best splitting point of this feature, and record the maximum return of this feature (when using best splitting point Income)
Step 3 selects the feature of Income Maximum as disruptive features, uses the best splitting point of this feature as division position It sets, which is grown the two new leaf nodes in left and right, and be associated with corresponding sample set for each new node
Step 1 is returned to, recurrence goes to until meeting specified conditions
For the income divided every time, concrete mode is: assuming that present node is denoted as C, left child nodes are remembered after division For L, right child nodes are denoted as R, then the target function value that the income that the division obtains is defined as present node subtracts left and right two The sum of target function value of child nodes.It is ranked up finally by gain, obtains the different degree of feature, the numerical representation method The prediction accuracy of this feature.
In other examples, Feature Selection can also be carried out by the way of cross validation, alternatively, can also use AUC (Area under curve, model-evaluation index) is the assessment that standard carries out prediction accuracy, and is based on assessment result pair Feature is screened, such as can delete the feature etc. so that AUC decline.In other examples, IV can also be calculated Other various ways of (Information Value, information value) value or PSI value etc. can according to need spirit in practical application Configuration living.
Optionally, the above-mentioned process screened to feature in feature set can be before supplementary features addition, can also To be after supplementary features addition.That is, after can be supplementary features being incorporated into feature set, then carry out feature sieve It selects, can according to need flexible configuration in practical application.
In other examples, it after can also be to Feature Selection in feature set, be closed with the supplementary features of user's input And.In such cases, after removing synteny, result in supplementary features and feature set after feature merger can also be compared whether Generate thousand quartile effects.If generating thousand quartile effects, the effect of optimization of the raw data set of lane database can be used T-test (Student's t test) is compared, if effect (with AUC, F1score or KS value etc. is used as criterion) Significantly (for example, pvalue is 0.05), and export and consulted to user, if effect is not significant, spy can also be carried out again Sign screening, to ensure to filter out effective feature.
Optionally, the present embodiment can be shown the feature that finishing screen is selected, and can verify selected feature Whether being capable of Accurate Prediction.As an example, specific verification mode, can be using practical business mould corresponding to sample data set Type carries out the various ways such as cross validation.If the feature filtered out is preferable, can also be stored with the raw data set and In the database of corresponding deep learning model, increases and closed for the sample data set and the corresponding of the deep learning model The record of system enables database continuous precipitation to cover raw data set and corresponding deep learning model, constantly to mention more High the present embodiment feature obtains the accuracy of scheme.
Corresponding with the embodiment of the characteristic-acquisition method of aforementioned data collection, this specification additionally provides the feature of data set Acquisition device and its applied embodiment for calculating equipment.
The embodiment of the feature acquisition device of this specification data set can using on the computing device, such as computer or Server apparatus etc..Installation practice can be by software realization, can also be real by way of hardware or software and hardware combining It is existing.It taking software implementation as an example, is the processing obtained by the feature of data set where it as the device on a logical meaning Computer program instructions corresponding in nonvolatile memory are read into memory what operation was formed by device.From hardware view Speech, as shown in figure 3, to calculate a kind of hardware structure diagram of equipment where the feature acquisition device of this specification data set, in addition to Except processor 310 shown in Fig. 3, memory 330, network interface 320 and nonvolatile memory 340, device in embodiment Server where 331 etc. calculates equipment, can also include other hardware generally according to the actual functional capability of the calculating equipment, right This is repeated no more.
As shown in figure 4, Fig. 4 is a kind of this specification feature acquisition dress of data set shown according to an exemplary embodiment The block diagram set, described device include:
Module 41 is obtained, is used for: obtaining sample data set, determines the statistical information of the sample data set;
Searching module 42, is used for: using the similitude with the statistical information, searching matched with the sample data set Raw data set, obtains the corresponding deep learning model of the raw data set, and the model parameter of the deep learning model is pre- It is obtained first with the raw data set and the training of corresponding setting feature;
Output module 43, is used for: using the sample data set as input, using described in deep learning model output The feature set of sample data set.
Optionally, the statistical information includes at least following one or more: total amount of data, black sample proportion, attribute Very poor, attribute value the interquartile-range IQR of number, the average value of attribute value, the variance of attribute value, the covariance of attribute value, attribute value, The degree of bias of attribute value or the kurtosis of attribute value.
Optionally, the statistical information of the sample data set and the difference of the raw data set are lower than given threshold.
Optionally, described device further includes parameter adjustment module, is used for:
Described using the sample data set as input, the sample data set is exported using the deep learning model Feature set before, show the deep learning model, and show that the parameter for the deep learning model adjusts interface, By the parameter adjustment value of the adjustment interface captures user input, the deep learning mould is adjusted according to the parameter adjustment value The model parameter of type.
Optionally, described device further includes that record increases module, is used for:
In the database for being stored with the raw data set and corresponding deep learning model, increases and be directed to the sample The record of the corresponding relationship of data set and the deep learning model.
Optionally, described device further includes Feature Selection module, is used for:
Test data set is obtained, the prediction accuracy of feature in the feature set, root are calculated according to the test data set Feature is filtered out according to calculated result and is shown.
Optionally, described device further includes that supplementary features obtain module, is used for:
Show supplementary features input interface, the supplementary features inputted by the interface captures user, and described in the calculating It is added in the feature set before prediction accuracy.
Optionally, described device further includes feature processing block, is used for:
Before being added in the feature set, the linear pass of feature in the supplementary features and the feature set is judged System, and delete the one type feature in a pair of of feature with linear relationship.
Correspondingly, this specification also provides a kind of calculating equipment, the calculating equipment includes processor;At storage Manage the memory of device executable instruction;Wherein, the processor is configured to:
Sample data set is obtained, determines the statistical information of the sample data set;
Using the similitude with the statistical information, search and the matched raw data set of the sample data set, acquisition The corresponding deep learning model of the raw data set, the model parameter of the deep learning model advance with the original number It is obtained according to collection and the training of corresponding setting feature;
Using the sample data set as input, the feature of the sample data set is exported using the deep learning model Collection.
The function of modules and the realization process of effect are specifically detailed in the above method and correspond to step in above-mentioned apparatus Realization process, details are not described herein.
For device embodiment, since it corresponds essentially to embodiment of the method, so related place is referring to method reality Apply the part explanation of example.The apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The module of explanation may or may not be physically separated, and the component shown as module can be or can also be with It is not physical module, it can it is in one place, or may be distributed on multiple network modules.It can be according to actual The purpose for needing to select some or all of the modules therein to realize this specification scheme.Those of ordinary skill in the art are not In the case where making the creative labor, it can understand and implement.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
Those skilled in the art will readily occur to this specification after considering specification and practicing the invention applied here Other embodiments.This specification is intended to cover any variations, uses, or adaptations of this specification, these modifications, Purposes or adaptive change follow the general principle of this specification and do not apply in the art including this specification Common knowledge or conventional techniques.The description and examples are only to be considered as illustrative, the true scope of this specification and Spirit is indicated by the following claims.
It should be understood that this specification is not limited to the precise structure that has been described above and shown in the drawings, And various modifications and changes may be made without departing from the scope thereof.The range of this specification is only limited by the attached claims System.
The foregoing is merely the preferred embodiments of this specification, all in this explanation not to limit this specification Within the spirit and principle of book, any modification, equivalent substitution, improvement and etc. done should be included in the model of this specification protection Within enclosing.

Claims (17)

1. a kind of characteristic-acquisition method of data set, which comprises
Sample data set is obtained, determines the statistical information of the sample data set;
Using the similitude with the statistical information, search with the matched raw data set of the sample data set, described in acquisition The corresponding deep learning model of raw data set, the model parameter of the deep learning model advance with the raw data set And corresponding setting feature training obtains;
Using the sample data set as input, the feature set of the sample data set is exported using the deep learning model.
2. according to the method described in claim 1, the statistical information include at least it is following one or more: it is total amount of data, black Sample proportion, attribute number, the average value of attribute value, the variance of attribute value, the covariance of attribute value, attribute value it is very poor, belong to The interquartile-range IQR of property value, the degree of bias of attribute value or the kurtosis of attribute value.
3. method according to claim 1 or 2, the difference of the statistical information of the sample data set and the raw data set Value is lower than given threshold.
4. according to the method described in claim 1, utilizing the deep learning using the sample data set as input described Before model exports the feature set of the sample data set, the method also includes:
It shows the deep learning model, and shows that the parameter for the deep learning model adjusts interface, by described The parameter adjustment value for adjusting interface captures user input, the model of the deep learning model is adjusted according to the parameter adjustment value Parameter.
5. according to the method described in claim 1, further include:
In the database for being stored with the raw data set and corresponding deep learning model, increases and be directed to the sample data The record of the corresponding relationship of collection and the deep learning model.
6. according to the method described in claim 1, further include:
Test data set is obtained, the prediction accuracy of feature in the feature set is calculated according to the test data set, according to meter It calculates result and screens feature.
7. according to the method described in claim 6, further include:
It shows supplementary features input interface, the supplementary features inputted by the interface captures user, and is calculating the prediction It is added in the feature set before accuracy.
8. according to the method described in claim 7, before being added in the feature set, the method also includes:
Judge the linear relationship of feature in the supplementary features and the feature set, and deletes a pair of of feature with linear relationship In one type feature.
9. a kind of feature acquisition device of data set, described device include:
Module is obtained, is used for: obtaining sample data set, determines the statistical information of the sample data set;
Searching module is used for: using the similitude with the statistical information, being searched and the matched original number of the sample data set According to collection, the corresponding deep learning model of the raw data set is obtained, the model parameter of the deep learning model advances with The raw data set and the training of corresponding setting feature obtain;
Output module is used for: using the sample data set as input, exporting the sample number using the deep learning model According to the feature set of collection.
10. device according to claim 9, the statistical information includes at least following one or more: total amount of data, black Sample proportion, attribute number, the average value of attribute value, the variance of attribute value, the covariance of attribute value, attribute value it is very poor, belong to The interquartile-range IQR of property value, the degree of bias of attribute value or the kurtosis of attribute value.
11. device according to claim 9 or 10, the statistical information of the sample data set and the raw data set Difference is lower than given threshold.
12. device according to claim 9, described device further includes parameter adjustment module, is used for:
Described using the sample data set as input, the spy of the sample data set is exported using the deep learning model Before collection, the deep learning model is shown, and show that the parameter for the deep learning model adjusts interface, passes through The parameter adjustment value of the adjustment interface captures user input, adjusts the deep learning model according to the parameter adjustment value Model parameter.
13. device according to claim 9 further includes that record increases module, is used for:
In the database for being stored with the raw data set and corresponding deep learning model, increases and be directed to the sample data The record of the corresponding relationship of collection and the deep learning model.
14. device according to claim 9 further includes Feature Selection module, is used for:
Test data set is obtained, the prediction accuracy of feature in the feature set is calculated according to the test data set, according to meter It calculates result and screens feature.
15. device according to claim 14 further includes that supplementary features obtain module, is used for:
It shows supplementary features input interface, the supplementary features inputted by the interface captures user, and is calculating the prediction It is added in the feature set before accuracy.
16. device according to claim 15 further includes feature processing block, is used for:
Before being added in the feature set, the linear relationship of feature in the supplementary features and the feature set is judged, and Delete the one type feature in a pair of of feature with linear relationship.
17. a kind of calculating equipment, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
Sample data set is obtained, determines the statistical information of the sample data set;
Using the similitude with the statistical information, search with the matched raw data set of the sample data set, described in acquisition The corresponding deep learning model of raw data set, the model parameter of the deep learning model advance with the raw data set And corresponding setting feature training obtains;
Using the sample data set as input, the feature set of the sample data set is exported using the deep learning model.
CN201810284529.9A 2018-04-02 2018-04-02 Feature acquisition method and device for data set and computing equipment Active CN108960269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810284529.9A CN108960269B (en) 2018-04-02 2018-04-02 Feature acquisition method and device for data set and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810284529.9A CN108960269B (en) 2018-04-02 2018-04-02 Feature acquisition method and device for data set and computing equipment

Publications (2)

Publication Number Publication Date
CN108960269A true CN108960269A (en) 2018-12-07
CN108960269B CN108960269B (en) 2022-05-27

Family

ID=64498650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810284529.9A Active CN108960269B (en) 2018-04-02 2018-04-02 Feature acquisition method and device for data set and computing equipment

Country Status (1)

Country Link
CN (1) CN108960269B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275880A (en) * 2019-05-21 2019-09-24 阿里巴巴集团控股有限公司 Data analysing method, device, server and readable storage medium storing program for executing
CN110781174A (en) * 2019-10-15 2020-02-11 支付宝(杭州)信息技术有限公司 Feature engineering modeling method and system using pca and feature intersection
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
CN113254742A (en) * 2021-07-14 2021-08-13 深圳市赛野展览展示有限公司 Display device based on 5G deep learning artificial intelligence
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN114912544A (en) * 2022-06-06 2022-08-16 北京百度网讯科技有限公司 Automatic characteristic engineering model training method and automatic characteristic engineering method
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463233A (en) * 2014-12-30 2015-03-25 深圳市捷顺科技实业股份有限公司 Vehicle logo recognition method and device
CN104504412A (en) * 2014-11-28 2015-04-08 苏州大学 Method and system for extracting and identifying handwriting stroke features
CN106446931A (en) * 2016-08-30 2017-02-22 苏州大学 Feature extraction and classification method and system based on support vector data description
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504412A (en) * 2014-11-28 2015-04-08 苏州大学 Method and system for extracting and identifying handwriting stroke features
CN104463233A (en) * 2014-12-30 2015-03-25 深圳市捷顺科技实业股份有限公司 Vehicle logo recognition method and device
CN106446931A (en) * 2016-08-30 2017-02-22 苏州大学 Feature extraction and classification method and system based on support vector data description
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN110275880A (en) * 2019-05-21 2019-09-24 阿里巴巴集团控股有限公司 Data analysing method, device, server and readable storage medium storing program for executing
CN110781174A (en) * 2019-10-15 2020-02-11 支付宝(杭州)信息技术有限公司 Feature engineering modeling method and system using pca and feature intersection
CN113254742A (en) * 2021-07-14 2021-08-13 深圳市赛野展览展示有限公司 Display device based on 5G deep learning artificial intelligence
CN113254742B (en) * 2021-07-14 2021-11-30 深圳市赛野展览展示有限公司 Display device based on 5G deep learning artificial intelligence
CN114912544A (en) * 2022-06-06 2022-08-16 北京百度网讯科技有限公司 Automatic characteristic engineering model training method and automatic characteristic engineering method
CN114912544B (en) * 2022-06-06 2023-11-14 北京百度网讯科技有限公司 Training method of automatic feature engineering model and automatic feature engineering method
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN115346561B (en) * 2022-08-15 2023-11-24 南京医科大学附属脑科医院 Depression emotion assessment and prediction method and system based on voice characteristics

Also Published As

Publication number Publication date
CN108960269B (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN108960269A (en) Characteristic-acquisition method, device and the calculating equipment of data set
CN109857889B (en) Image retrieval method, device and equipment and readable storage medium
CN106021364B (en) Foundation, image searching method and the device of picture searching dependency prediction model
CN108898479B (en) Credit evaluation model construction method and device
CN110674841B (en) Logging curve identification method based on clustering algorithm
CN103324677B (en) Hierarchical fast image global positioning system (GPS) position estimation method
CN105303150B (en) Realize the method and system of image procossing
CN105022835A (en) Public safety recognition method and system for crowd sensing big data
US10387805B2 (en) System and method for ranking news feeds
CN109903053B (en) Anti-fraud method for behavior recognition based on sensor data
WO2022062419A1 (en) Target re-identification method and system based on non-supervised pyramid similarity learning
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN107229731A (en) Method and apparatus for grouped data
CN110110663A (en) A kind of age recognition methods and system based on face character
CN107229614A (en) Method and apparatus for grouped data
CN110046264A (en) A kind of automatic classification method towards mobile phone document
CN109308324A (en) A kind of image search method and system based on hand drawing style recommendation
CN113344050A (en) Lithology intelligent identification method and system based on deep learning
CN108647729A (en) A kind of user's portrait acquisition methods
CN110634060A (en) User credit risk assessment method, system, device and storage medium
CN108229432A (en) Face calibration method and device
CN107368526A (en) A kind of data processing method and device
CN109670423A (en) A kind of image identification system based on deep learning, method and medium
CN115393666A (en) Small sample expansion method and system based on prototype completion in image classification
CN103425748B (en) A kind of document resources advise the method for digging and device of word

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201021

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201021

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant