CN108021984A - Determine the method and system of the feature importance of machine learning sample - Google Patents

Determine the method and system of the feature importance of machine learning sample Download PDF

Info

Publication number
CN108021984A
CN108021984A CN201610935697.0A CN201610935697A CN108021984A CN 108021984 A CN108021984 A CN 108021984A CN 201610935697 A CN201610935697 A CN 201610935697A CN 108021984 A CN108021984 A CN 108021984A
Authority
CN
China
Prior art keywords
feature
model
machine learning
pool model
importance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610935697.0A
Other languages
Chinese (zh)
Inventor
罗远飞
涂威威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202110542599.1A priority Critical patent/CN113435602A/en
Priority to CN201610935697.0A priority patent/CN108021984A/en
Publication of CN108021984A publication Critical patent/CN108021984A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A kind of method and system of the feature importance of definite machine learning sample are provided, the described method includes:(A) historgraphic data recording is obtained, wherein, the historgraphic data recording includes the mark and at least one attribute information on Machine Learning Problems;(B) using the historgraphic data recording obtained, at least one feature pool model is trained, wherein, feature pool model refers to provide the machine learning model of the prediction result on Machine Learning Problems based at least a portion feature among each feature;(C) effect of at least one feature pool model is obtained, and the importance of each feature is determined according to the effect of at least one feature pool model of acquisition, wherein, in step (B), by performing discretization computing at least one continuous feature among at least a portion feature come training characteristics pool model.By described method and system, the importance of each feature in machine learning sample can be effectively determined.

Description

Determine the method and system of the feature importance of machine learning sample
Technical field
All things considered of the present invention is related to artificial intelligence field, more specifically to a kind of for machine learning sample Feature importance determines method and system.
Background technology
With the appearance of mass data, artificial intelligence technology is developed rapidly, and in order to be excavated from mass data Bid value based on data record, it is necessary to produce the sample suitable for machine learning.
Here, per data, record can be seen as the description as described in an event or object, corresponding to an example or sample Example.In data record, include each item of the performance or property of reflection event or object in terms of certain, these items can claim For " attribute ".
In practice, extraction of the selecting of the prediction effect of machine learning model and model, available data and feature etc. has Close.How the feature of machine learning sample is gone out from each attributes extraction of original data record, it will to machine learning model Effect brings very big influence.Correspondingly, no matter from the perspective of model training or model understand, all it is sought after knowing machine The significance level of each feature of device learning sample.For example, it can be calculated each according to the tree-model trained based on XGBoost The expectation division gain of feature, then calculates feature importance.Although aforesaid way can consider the interaction between feature, Training cost is high, and different parameters have a great influence feature importance.
In fact, the importance of feature is difficult to intuitively determine, generally requires technical staff and not only grasp knowing for machine learning Know, it is also necessary to there is deep understanding to actual prediction problem, and forecasting problem often combines the different practice warps of different industries Test, cause to be extremely difficult to satisfied effect.
The content of the invention
The exemplary embodiment of the present invention, which is intended to overcome, to be difficult to effectively determine each of machine learning sample in the prior art The defects of importance of a feature.
Exemplary embodiment according to the present invention, there is provided a kind of importance of each feature of definite machine learning sample Method, including:(A) historgraphic data recording is obtained, wherein, the historgraphic data recording includes the mark on Machine Learning Problems With at least one attribute information of each feature for generating machine learning sample;(B) historgraphic data recording obtained is utilized, At least one feature pool model of training, wherein, feature pool model refers to special based at least a portion among each feature Levy to provide the machine learning model of the prediction result on Machine Learning Problems;(C) at least one feature pool mould is obtained The effect of type, and the importance of each feature is determined according to the effect of at least one feature pool model of acquisition, Wherein, in step (B), by performing discretization computing at least one continuous feature among at least a portion feature Carry out training characteristics pool model.
Alternatively, in the method, in step (C), according to feature pool model in original test data collection and conversion The difference between effect in test data set come determine the feature pool model based on individual features importance, its In, conversion test data set refers to replace by the value of its importance for concentrating original test data target signature to be determined The data set for being changed to one of following item and obtaining:Null value, random number, by the original value upset order by target signature after Obtained value.
Alternatively, in the method, at least one feature pool model includes a whole characteristic model, wherein, Whole characteristic models refer to provide the prediction knot on Machine Learning Problems based on whole features among each feature The machine learning model of fruit.
Alternatively, in the method, at least one feature pool model is carried including multiple based on different characteristic group For the machine learning model of the prediction result on Machine Learning Problems, wherein, in step (C), according to described at least one Difference between effect of the feature pool model on original test data collection determines the importance of each feature.
Alternatively, in the method, at least one feature pool model includes one or more main feature pool models And at least one subcharacter pool model corresponding with each main feature pool model respectively, wherein, subcharacter pool model refers to base In its corresponding main feature pool model based on feature among it is surplus in addition to its importance target signature to be determined Remaining feature provides the machine learning model of the prediction result on Machine Learning Problems, wherein, in step (C), according to master Difference between the effect of feature pool model and its corresponding each subcharacter pool model on original test data collection is come true The importance of fixed corresponding target signature.
Alternatively, in the method, at least one feature pool model includes multiple single characteristic models, wherein, it is single Characteristic model refers to provide on machine learning based on the target signature to be determined of its importance among each feature The machine learning model of the prediction result of problem, wherein, in step (C), according to single characteristic model in original test data collection On effect between difference determine the importance of corresponding target signature.
Alternatively, in the method, the discretization computing includes basic branch mailbox computing and at least one additional arithmetic.
Alternatively, in the method, among computing of at least one additional arithmetic including following species at least A kind of computing:Logarithm operation, exponent arithmetic, signed magnitude arithmetic(al), Gaussian transformation computing.
Alternatively, in the method, at least one additional arithmetic includes and basic branch mailbox computing branch mailbox mode phase Same but different branch mailbox parameter additional branch mailbox computing;Alternatively, at least one additional arithmetic includes and basic branch mailbox computing point The different additional branch mailbox computing of case mode.
Alternatively, in the method, basic branch mailbox computing and additional branch mailbox computing correspond respectively to different in width etc. Wide branch mailbox computing or different depth etc. deep branch mailbox.
Alternatively, in the method, the different in width or different depth numerically form Geometric Sequence or equal difference Ordered series of numbers.
Alternatively, in the method, the step of performing basic branch mailbox computing and/or additional branch mailbox computing includes:Additionally Setting peels off case so that the continuous feature with outlier is assigned to the case that peels off.
Alternatively, in the method, in step (B), (logistic regressive) is returned based on logarithm probability Algorithm carrys out training characteristics pool model.
Alternatively, in the method, the effect of feature pool model includes the AUC of feature pool model.
Alternatively, in the method, the original test data collection is made of the historgraphic data recording obtained, wherein, In step (B), the historgraphic data recording of acquisition is divided into multigroup historgraphic data recording to train each feature pool step by step Model, also, step (B) further includes:It is directed to down using the feature pool model after currently group historgraphic data recording training One group of historgraphic data recording perform prediction with obtain it is corresponding with the next group of historgraphic data recording be grouped AUC, it is and comprehensive each A packet AUC obtains the AUC of feature pool model, wherein, obtaining be grouped corresponding with the next group of historgraphic data recording After AUC, continue training after the current group historgraphic data recording training using the next group of historgraphic data recording Feature pool model.
Alternatively, in the method, in step (B), using after currently group historgraphic data recording training When feature pool model is to be directed to next group of historgraphic data recording perform prediction, when the next group of historgraphic data recording includes lacking For produce feature pool model based at least a portion feature attribute information missing historgraphic data recording when, based on One of lower processing obtains corresponding with the next group of historgraphic data recording being grouped AUC:Merely with the next group of history number Packet AUC is calculated according to the prediction result of other historgraphic data recordings in record in addition to lacking historgraphic data recording;Profit Packet AUC is calculated with the prediction result of whole historgraphic data recordings of the next group of historgraphic data recording, wherein, it will lack The prediction result of historgraphic data recording uses as default, and the default value is based on the value range of prediction result come definite or base Determined in the indicia distribution of the historgraphic data recording of acquisition;It will utilize in the next group of historgraphic data recording except missing is gone through The AUC that the prediction result of other historgraphic data recordings beyond history data record calculates exists with other described historgraphic data recordings Shared ratio is multiplied to obtain packet AUC in the next group of historgraphic data recording.
Alternatively, in the method, in step (B), based on logarithm probability regression algorithm come training characteristics pond mould During type, the regular terms for the setting of discontinuous feature is different from for the regular terms that continuous feature is set.
Alternatively, in the method, step (B) further includes:Provide a user for the following of configuration feature pool model The interface of at least one project among project:Feature pool model based at least a portion feature, feature pool model calculation Method species, the algorithm parameter of feature pool model, the computing species of discretization computing, the operational parameter of discretization computing, also, In step (B), feature pool model is respectively trained by the project of the interface configurations according to user.
Alternatively, in the method, in step (B), come in response to instruction of the user on determining feature importance Provide a user the interface.
Alternatively, the method further includes:(D) each feature determined with patterned way to user's displaying is important Property.
Alternatively, in the method, in step (D), each feature is shown according to the order of the importance of feature, And/or a part of feature among each feature is highlighted, wherein, a part of feature includes Key character corresponding with high importance, inessential feature corresponding with small significance and/or corresponding with abnormal importance different Chang Tezheng.
In accordance with an alternative illustrative embodiment of the present invention, there is provided a kind of each feature of definite machine learning sample it is important The system of property, including:Data record acquisition device, for obtaining historgraphic data recording, wherein, the historgraphic data recording includes On the mark of Machine Learning Problems and at least one attribute information of each feature for generating machine learning sample;Model Training device, for the historgraphic data recording using acquisition, trains at least one feature pool model, wherein, feature pool model is Refer to based at least a portion feature among each feature to provide the machine of the prediction result on Machine Learning Problems Learning model;Importance determining device, for obtaining the effect of at least one feature pool model, and according to acquisition The effect of at least one feature pool model determines the importance of each feature, wherein, model training apparatus passes through to institute State at least one continuous feature among at least a portion feature and perform discretization computing and carry out training characteristics pool model.
Alternatively, in the system, importance determining device according to feature pool model in original test data collection and change The difference changed between the effect in test data set determine the feature pool model based on individual features importance, its In, conversion test data set refers to replace by the value of its importance for concentrating original test data target signature to be determined The data set for being changed to one of following item and obtaining:Null value, random number, by the original value upset order by target signature after Obtained value.
Alternatively, in the system, at least one feature pool model includes a whole characteristic model, wherein, Whole characteristic models refer to provide the prediction knot on Machine Learning Problems based on whole features among each feature The machine learning model of fruit.
Alternatively, in the system, at least one feature pool model is carried including multiple based on different characteristic group For the machine learning model of the prediction result on Machine Learning Problems, wherein, importance determining device is according to described at least one Difference between effect of a feature pool model on original test data collection determines the importance of each feature.
Alternatively, in the system, at least one feature pool model includes one or more main feature pool models And at least one subcharacter pool model corresponding with each main feature pool model respectively, wherein, subcharacter pool model refers to base In its corresponding main feature pool model based on feature among it is surplus in addition to its importance target signature to be determined Remaining feature provides the machine learning model of the prediction result on Machine Learning Problems, wherein, importance determining device according to Difference between the effect of main feature pool model and its corresponding each subcharacter pool model on original test data collection is come Determine the importance of corresponding target signature.
Alternatively, in the system, at least one feature pool model includes multiple single characteristic models, wherein, it is single Characteristic model refers to provide on machine learning based on the target signature to be determined of its importance among each feature The machine learning model of the prediction result of problem, wherein, importance determining device is according to single characteristic model in original test data The difference between effect on collection determines the importance of corresponding target signature.
Alternatively, in the system, the discretization computing includes basic branch mailbox computing and at least one additional arithmetic.
Alternatively, in the system, among computing of at least one additional arithmetic including following species at least A kind of computing:Logarithm operation, exponent arithmetic, signed magnitude arithmetic(al), Gaussian transformation computing.
Alternatively, in the system, at least one additional arithmetic includes and basic branch mailbox computing branch mailbox mode phase Same but different branch mailbox parameter additional branch mailbox computing;Alternatively, at least one additional arithmetic includes and basic branch mailbox computing point The different additional branch mailbox computing of case mode.
Alternatively, in the system, basic branch mailbox computing and additional branch mailbox computing correspond respectively to different in width etc. Wide branch mailbox computing or different depth etc. deep branch mailbox.
Alternatively, in the system, the different in width or different depth numerically form Geometric Sequence or equal difference Ordered series of numbers.
Alternatively, in the system, the step of performing basic branch mailbox computing and/or additional branch mailbox computing includes:Additionally Setting peels off case so that the continuous feature with outlier is assigned to the case that peels off.
Alternatively, in the system, model training apparatus based on logarithm probability regression algorithm come training characteristics pool model.
Alternatively, in the system, the effect of feature pool model includes the AUC of feature pool model.
Alternatively, in the system, the original test data collection is made of the historgraphic data recording obtained, wherein, The historgraphic data recording of acquisition is divided into multigroup historgraphic data recording to train each feature pool step by step by model training apparatus Model, also, model training apparatus also uses the feature pool model after currently group historgraphic data recording training to be directed to down One group of historgraphic data recording perform prediction with obtain it is corresponding with the next group of historgraphic data recording be grouped AUC, it is and comprehensive each A packet AUC obtains the AUC of feature pool model, wherein, obtaining be grouped corresponding with the next group of historgraphic data recording After AUC, continue training after the current group historgraphic data recording training using the next group of historgraphic data recording Feature pool model.
Alternatively, in the system, model training apparatus is using after currently group historgraphic data recording training When feature pool model is to be directed to next group of historgraphic data recording perform prediction, when the next group of historgraphic data recording includes lacking For produce feature pool model based at least a portion feature attribute information missing historgraphic data recording when, based on One of lower processing obtains corresponding with the next group of historgraphic data recording being grouped AUC:Merely with the next group of history number Packet AUC is calculated according to the prediction result of other historgraphic data recordings in record in addition to lacking historgraphic data recording;Profit Packet AUC is calculated with the prediction result of whole historgraphic data recordings of the next group of historgraphic data recording, wherein, it will lack The prediction result of historgraphic data recording uses as default, and the default value is based on the value range of prediction result come definite or base Determined in the indicia distribution of the historgraphic data recording of acquisition;It will utilize in the next group of historgraphic data recording except missing is gone through The AUC that the prediction result of other historgraphic data recordings beyond history data record calculates exists with other described historgraphic data recordings Shared ratio is multiplied to obtain packet AUC in the next group of historgraphic data recording.
Alternatively, in the system, model training apparatus based on logarithm probability regression algorithm come training characteristics pond mould During type, the regular terms for the setting of discontinuous feature is different from for the regular terms that continuous feature is set.
Alternatively, the system also includes:Display device, wherein, model training apparatus also controls display device to user The interface at least one project being used among the following items of configuration feature pool model is provided:Feature pool model is based on extremely Few a part of feature, the algorithm species of feature pool model, the algorithm parameter of feature pool model, discretization computing computing species, The operational parameter of discretization computing, also, model training apparatus is instructed respectively according to user by the project of the interface configurations Practice feature pool model.
Alternatively, in the system, model training apparatus comes in response to instruction of the user on determining feature importance Control display device provides a user the interface.
Alternatively, in the system, each feature that display device is also determined with patterned way to user's displaying Importance.
Alternatively, in the system, display device shows each feature according to the order of the importance of feature, and And/or person, a part of feature among each feature is highlighted, wherein, a part of feature include with High importance corresponding key character, inessential feature corresponding with small significance and/or exception corresponding with abnormal importance Feature.
In accordance with an alternative illustrative embodiment of the present invention, there is provided a kind of each feature of definite machine learning sample it is important Property computing device, including storage unit and processor are stored with set of computer-executable instructions conjunction in storage unit, when described When set of computer-executable instructions conjunction is performed by the processor, following step is performed:(A) historgraphic data recording is obtained, wherein, The historgraphic data recording includes on the mark of Machine Learning Problems and each feature for generating machine learning sample At least one attribute information;(B) using the historgraphic data recording obtained, at least one feature pool model is trained, wherein, feature pool Model refers to provide the prediction result on Machine Learning Problems based at least a portion feature among each feature Machine learning model;(C) effect of at least one feature pool model is obtained, and according at least one spy of acquisition The effect for levying pool model determines the importance of each feature, wherein, in step (B), by described at least one At least one continuous feature among dtex sign performs discretization computing and carrys out training characteristics pool model.
Alternatively, in the computing device, in step (C), according to feature pool model in original test data collection and Conversion test data set on effect between difference come determine the feature pool model based on individual features importance, Wherein, conversion test data set refers to the value by its importance concentrated to original test data target signature to be determined The data set for replacing with one of following item and obtaining:Null value, random number, by by the original value upset of target signature order The value obtained afterwards.
Alternatively, in the computing device, at least one feature pool model includes a whole characteristic model, its In, whole characteristic models refer to provide the prediction on Machine Learning Problems based on whole features among each feature As a result machine learning model.
Alternatively, in the computing device, at least one feature pool model is based on different characteristic group including multiple To provide the machine learning model of the prediction result on Machine Learning Problems, wherein, in step (C), according to it is described at least Difference between effect of one feature pool model on original test data collection determines the importance of each feature.
Alternatively, in the computing device, at least one feature pool model includes one or more main feature pools Model and respectively at least one subcharacter pool model corresponding with each main feature pool model, wherein, subcharacter pool model is Refer to based on its corresponding main feature pool model based on feature among in addition to its importance target signature to be determined Residue character the machine learning model of the prediction result on Machine Learning Problems is provided, wherein, in the step (C), root According to the difference between the effect of main feature pool model and its corresponding each subcharacter pool model on original test data collection To determine the importance of corresponding target signature.
Alternatively, in the computing device, at least one feature pool model includes multiple single characteristic models, its In, single characteristic model refers to provide on machine based on the target signature to be determined of its importance among each feature The machine learning model of the prediction result of problem concerning study, wherein, in step (C), according to single characteristic model in original test number The importance of corresponding target signature is determined according to the difference between the effect on collection.
Alternatively, in the computing device, the discretization computing includes basic branch mailbox computing and at least one additional Computing.
Alternatively, in the computing device, at least one additional arithmetic is included among the computing of following species At least one computing:Logarithm operation, exponent arithmetic, signed magnitude arithmetic(al), Gaussian transformation computing.
Alternatively, in the computing device, at least one additional arithmetic includes and basic branch mailbox computing branch mailbox side The additional branch mailbox computing that formula is identical but branch mailbox parameter is different;Alternatively, at least one additional arithmetic includes transporting with basic branch mailbox The different additional branch mailbox computing of point counting case mode.
Alternatively, in the computing device, basic branch mailbox computing and additional branch mailbox computing correspond respectively to different in width Wide branch mailbox computing or different depth etc. deep branch mailbox.
Alternatively, in the computing device, the different in width or different depth numerically form Geometric Sequence or Arithmetic progression.
Alternatively, in the computing device, the step of performing basic branch mailbox computing and/or additional branch mailbox computing, includes: It is extra that the case that peels off is set so that the continuous feature with outlier is assigned to the case that peels off.
Alternatively, in the computing device, in step (B), based on logarithm probability regression algorithm come training characteristics pond Model.
Alternatively, in the computing device, the effect of feature pool model includes the AUC of feature pool model.
Alternatively, in the computing device, the original test data collection is made of the historgraphic data recording obtained, its In, in step (B), the historgraphic data recording of acquisition is divided into multigroup historgraphic data recording to train each feature step by step Pool model, also, step (B) further includes:It is directed to using the feature pool model after currently group historgraphic data recording training Next group of historgraphic data recording perform prediction with obtain it is corresponding with the next group of historgraphic data recording be grouped AUC, it is and comprehensive Each packet AUC obtains the AUC of feature pool model, wherein, obtaining divide corresponding with the next group of historgraphic data recording After group AUC, continue training by the current group historgraphic data recording training using the next group of historgraphic data recording Feature pool model afterwards.
Alternatively, in the computing device, in step (B), trained using by current group historgraphic data recording When feature pool model afterwards is to be directed to next group of historgraphic data recording perform prediction, when the next group of historgraphic data recording includes Lack for produce feature pool model based at least a portion feature attribute information missing historgraphic data recording when, base Obtain corresponding with the next group of historgraphic data recording being grouped AUC in one of following processing:Gone through merely with described next group The prediction result of other historgraphic data recordings in history data record in addition to lacking historgraphic data recording is grouped to calculate AUC;Packet AUC is calculated using the prediction result of whole historgraphic data recordings of the next group of historgraphic data recording, wherein, The prediction result for lacking historgraphic data recording is used as default, the default value is based on the value range of prediction result come really The indicia distribution of historgraphic data recording fixed or based on acquisition determines;Will utilize the next group of historgraphic data recording in except The AUC and other described historical datas that the prediction result of other historgraphic data recordings beyond missing historgraphic data recording calculates Ratio shared in the next group of historgraphic data recording is recorded in be multiplied to obtain packet AUC.
Alternatively, in the computing device, in step (B), based on logarithm probability regression algorithm come training characteristics During pool model, the regular terms for the setting of discontinuous feature is different from for the regular terms that continuous feature is set.
Alternatively, in the computing device, step (B) further includes:Provide a user for configuration feature pool model The interface of at least one project among following items:Feature pool model based at least a portion feature, feature pool model Algorithm species, the algorithm parameter of feature pool model, the computing species of discretization computing, the operational parameter of discretization computing, and And in step (B), feature pool model is respectively trained by the project of the interface configurations according to user.
Alternatively, in the computing device, in step (B), in response to finger of the user on determining feature importance Show to provide a user the interface.
Alternatively, in the computing device, when the set of computer-executable instructions, which is closed, to be performed by the processor, Also perform following step:(D) importance with patterned way to the definite each feature of user's displaying.
Alternatively, in the computing device, in step (D), shown according to the order of the importance of feature each Feature, and/or, a part of feature among each feature is highlighted, wherein, a part of feature Including key character corresponding with high importance, inessential feature corresponding with small significance and/or corresponding with abnormal importance Off-note.
In the method and system of the feature importance of definite machine learning sample according to an exemplary embodiment of the present invention, Each feature is correspondingly determined using the effect of the feature pool model based at least a portion feature of machine learning sample Importance, wherein, in training characteristics pool model, the continuous feature among at least a portion feature need to pass through discretization Processing, in this way, can effectively reflect the significance level of correlated characteristic by the effect of feature pool model, and then effectively draws The importance of each feature.
Brief description of the drawings
From the detailed description to the embodiment of the present invention below in conjunction with the accompanying drawings, these and/or other aspect of the invention and Advantage will become clearer and be easier to understand, wherein:
Fig. 1 shows the frame of the system of the feature importance of definite machine learning sample according to an exemplary embodiment of the present invention Figure;
Fig. 2 shows the stream of the method for the feature importance of definite machine learning sample according to an exemplary embodiment of the present invention Cheng Tu;
The method that Fig. 3 shows the feature importance of the definite machine learning sample of another exemplary embodiment according to the present invention Flow chart;
Fig. 4 shows the example at feature importance displaying interface according to an exemplary embodiment of the present invention;And
Fig. 5 shows the example at the feature importance displaying interface of another exemplary embodiment according to the present invention.
Embodiment
In order to make those skilled in the art more fully understand the present invention, with reference to the accompanying drawings and detailed description to this hair Bright exemplary embodiment is described in further detail.
In an exemplary embodiment of the present invention, feature importance is determined in the following manner:Based on machine learning sample This at least a portion feature carrys out training characteristics pool model, wherein, continuous feature need to pass through sliding-model control.On this basis, The prediction effect of feature based pool model weighs the importance of each feature.
Here, machine learning is the inevitable outcome that artificial intelligence study develops into certain phase, it is directed to passing through calculating Means, improve the performance of system itself using experience.In computer systems, " experience " is usually deposited in the form of " data " By machine learning algorithm, " model " can be being produced from data, that is to say, that be supplied to machine learning to calculate empirical data Method, can just be based on these empirical datas and produce model, when in face of news, model can provide corresponding judgement, i.e. prediction As a result.Whether training machine learning model, or be predicted using trained machine learning model, data are required for turning It is changed to the machine learning sample including various features.Machine learning can be implemented as " supervised learning ", " unsupervised learning " or The form of " semi-supervised learning ", it should be noted that the present invention is to specific machine learning algorithm and without specific limitation.In addition, also It should be noted that during training and application model, other means such as statistic algorithm are may also be combined with.
Fig. 1 shows the frame of the system of the feature importance of definite machine learning sample according to an exemplary embodiment of the present invention Figure.Particularly, the feature importance determines that system is imitated using the prediction of the feature pool model based at least a portion feature Fruit weighs the importance of each individual features, wherein, feature pool model based at least a portion original continuous feature need By sliding-model control.By the above-mentioned means, it can more efficiently determine the importance of each feature (particularly continuous feature).
System shown in Fig. 1 can be realized all by computer program with software mode, also can be filled by special hardware Put to realize, can also be realized by way of software and hardware combining.Correspondingly, each device for forming the system shown in Fig. 1 can To be to only rely on computer program the virtual module of realizing corresponding function or by hardware configuration to realize the work( The universal or special device of energy, can also be that operation has processor of corresponding computer program etc.., can be true using the system The importance of each feature of machine learning sample is made, these material informations help to carry out model training and/or model Explain.
As shown in Figure 1, data record acquisition device 100 is used to obtain historgraphic data recording, wherein, the historical data note Record includes the mark on Machine Learning Problems and at least one attribute letter of each feature for generating machine learning sample Breath.
Data that above-mentioned historgraphic data recording can be the data produced online, previously generate and store, can also be logical The data crossed input unit or transmission medium and received from external device (ED), for example, it may be the data that high in the clouds is received from client Or the data that client is received from high in the clouds.These data can relate to the information of personal, enterprise or tissue, for example, identity, Go through, occupation, assets, contact method, debt, income, the information such as get a profit, pay taxes.Alternatively, these data can also refer to business correlation The information of project, for example, on information such as the turnover of contract, both parties, subject matter, locos.It is it should be noted that of the invention Exemplary embodiment in the attribute information content mentioned can relate to the performance or property of any object or affairs in terms of certain, and It is not limited to that individual, object, tissue, unit, mechanism, project, event etc. are defined or described.
Data record acquisition device 100 can obtain structuring or the unstructured data of separate sources, for example, text data Or numeric data etc..The historgraphic data recording of acquisition can be used for forming machine learning sample, participate in the training of machine learning model And/or test.These data, which can derive from, it is expected inside the entity using machine learning, for example, applying machine from expectation The bank of study, enterprise, school etc.;These data also can derive from above-mentioned entity beyond, for example, from metadata provider, Internet (for example, social network sites), mobile operator, APP operator, express company, credit institution etc..Alternatively, in above-mentioned Use can be combined in portion's data and external data, to form the machine learning sample for carrying more information, so as to be more convenient for excavating out The higher feature of importance.
Above-mentioned data can be input to data record acquisition device 100 by input unit, or obtained and filled by data record 100 are put according to existing data to automatically generate, or can by data record acquisition device 100 from network (for example, on network Storage medium (for example, data warehouse)) obtain, in addition, the intermediate data switch of such as server can help to data Record acquisition device 100 and obtain corresponding data from external data source.Here, the data of acquisition can be by data record acquisition device The data conversion modules such as the text analysis model in 100 are converted to the form being easily processed.That is, data record obtains dress It can be the device with the ability for receiving and processing data record to put 100, can also only be to provide the number being already prepared to According to the device of record.It should be noted that data record acquisition device 100 can be configured as being made of software, hardware and/or firmware it is each A module, these moulds certain module in the block or whole modules can be integrated into one or common cooperation to complete specific function.
Model training apparatus 200 is used for using the historgraphic data recording obtained, at least one feature pool model of training, its In, feature pool model refers to provide on Machine Learning Problems based at least a portion feature among each feature The machine learning model of prediction result, wherein, model training apparatus 200 by among at least a portion feature at least One continuous feature performs discretization computing and carrys out training characteristics pool model.
Here, feature pool model is designed at least a portion feature based on machine learning sample, correspondingly, model instruction Practice the training sample that device 200 can produce feature pool model based on historgraphic data recording.Particularly, it is assumed that historical data is remembered Record has attribute information { p1,p2,…,pmAnd corresponding mark (wherein, m is positive integer), based on these attribute informations and mark Note, can produce machine learning sample corresponding with Machine Learning Problems, these machine learning samples will be applied to be directed to engineering The model training of habit problem and/or test.Particularly, the characteristic of above-mentioned machine learning sample is represented by { f1, f2,…,fn(wherein, n is positive integer), and the exemplary embodiment of the present invention is intended to determine characteristic { f1,f2,…,fnIt In each feature significance level.For this reason, model training apparatus 200 need to train based at least a portion feature come provide on The feature pool model of the prediction result of Machine Learning Problems, here, model training apparatus 200 can be from { f1,f2,…,fnAmong select Select feature of at least a portion feature as the training sample of feature pool model, and using the mark of corresponding historgraphic data recording as The mark of the training sample.Exemplary embodiment according to the present invention, the part among selected at least a portion feature Or all continuous feature need to pass through sliding-model control.Here, model training apparatus 200 can train one or more features pond mould Type, wherein, can be based on same characteristic features pool model (whole features that the same characteristic features pool model can be based on machine learning sample or A part of feature) original test data collection with conversion test data set on prediction effect difference draw individual features to integrate Importance, wherein, by the value for some target signatures concentrated to original test data carry out change bring acquisition conversion survey Data set is tried, in this way, prediction effect difference can reflect the predicting function of target signature, i.e. importance;Alternatively, it can be based on Prediction effect difference of the different characteristic pool model on same test data set (that is, original test data collection) draws phase to integrate The importance of feature is answered, here, different characteristic pool model is designed to based on different combinations of features, in this way, prediction effect Difference can reflect the respective predicting function of different characteristic, i.e. importance;Especially, machine learning sample can be directed to respectively Each feature train single characteristic model, correspondingly, the prediction effect of single characteristic model can represent the spy of its foundation The importance of sign.It should be noted that the mode that above two weighs feature importance can be used alone, also may be used in combination.
As described above, exemplary embodiment according to the present invention, in training characteristics pool model, model training apparatus 200 Can be by performing discretization computing at least one continuous feature come training characteristics pool model, here, model training apparatus 200 can Continuous feature is handled using any appropriate discretization mode, so as to based on after discretization continuous feature (or together with Other features) the feature pool model that is trained can preferably reflect the significance level of each feature.
Here, as an example, the discretization computing may include basic branch mailbox (binning) computing and at least one additional Computing, correspondingly, model training apparatus 200 can be in training characteristics pool models, for some companies of feature pool model foundation Each continuous feature among continuous feature, performs basic branch mailbox computing and at least one additional arithmetic respectively, with generation and respectively The continuous corresponding basic branch mailbox feature of feature and at least one supplementary features.
Here, among the feature of machine learning sample, there can be at least a portion attribute information based on data record Caused continuous feature, here, be continuously characterized in a kind of opposite feature of discrete features (for example, category feature), it takes Value can have certain successional numerical value, for example, distance, age, amount of money etc..Relatively, as an example, discrete features Value does not have continuity, for example, it may be " coming from Beijing ", " coming from Shanghai " or " coming from Tianjin ", " gender is man ", " property Wei female " etc. unordered classification feature.
Citing is got on very well, certain Continuous valued attributes in historgraphic data recording can be directly as the correspondence in machine learning sample Continuous feature, for example, can will be apart from attributes such as, age, the amount of money directly as corresponding continuous feature.In addition, also can be by right Some attributes (for example, connection attribute and/or Category Attributes) in historgraphic data recording are handled, corresponding continuous to obtain Feature, for example, using height with the ratio of weight as corresponding continuous feature.
It should be noted that in addition to the continuous feature that will carry out basic branch mailbox computing and additional arithmetic, the instruction of feature pool model Practice sample to may also include according to other continuous features and/or discrete features included by machine learning sample, wherein, it is described other Continuous feature can participate in the training of feature pool model in the case of without discretization computing.
As can be seen that exemplary embodiment according to the present invention, for by carry out basic branch mailbox computing each is continuous Feature, can also additionally perform at least one additional arithmetic, so as at the same time obtain it is multiple from different angles, scale/aspect To portray the feature of some attributes of original data record.
Here, branch mailbox computing refers to a kind of ad hoc fashion that continuous feature is carried out to discretization, i.e. by the value of continuous feature Domain is divided into multiple sections (that is, multiple chests), and determines corresponding branch mailbox characteristic value based on the chest of division.Branch mailbox computing Supervision branch mailbox and unsupervised branch mailbox, both types, which can be generally divided into, each includes some specific branch mailbox modes, example Such as, there is supervision branch mailbox to include minimum entropy branch mailbox, minimum description length branch mailbox etc., and unsupervised branch mailbox include wide branch mailbox, etc. it is deep Branch mailbox, branch mailbox based on k mean clusters etc..Under every kind of branch mailbox mode, corresponding branch mailbox parameter can be set, for example, width, depth Degree etc..It should be noted that exemplary embodiment according to the present invention, the branch mailbox computing performed by model training apparatus 200 does not limit point The species of case mode, does not limit the parameter of branch mailbox computing yet, also, the specific representation of the branch mailbox feature accordingly produced is not yet It is restricted.
In addition to performing basic branch mailbox computing, model training apparatus 200 can also perform at least one to the continuous feature A additional arithmetic, here, additional arithmetic can be arbitrary function computings, these functional operation can produce continuous feature or discrete spy Sign, for example, additional arithmetic can be logarithm operation, exponent arithmetic, signed magnitude arithmetic(al) etc..Especially, additional arithmetic can also be Branch mailbox computing (is known as " additional branch mailbox computing "), additional branch mailbox computing here and basic branch mailbox computing in branch mailbox mode and/or Had differences in terms of branch mailbox parameter.It can be seen from the above that at least one additional arithmetic can be the computing of identical or different species Each identical or different operational parameter of leisure is (for example, in the truth of a matter, branch mailbox computing in index, logarithm operation in exponent arithmetic Width in depth, branch mailbox computing etc.) under computing, here, the additional arithmetic can be with logarithm operation, exponent arithmetic, Signed magnitude arithmetic(al) etc. is the expression solution of main body or the combination of a variety of computings.
By the above-mentioned means, model training apparatus 200 can turn each among the continuous feature of at least a portion respectively Basic branch mailbox feature and corresponding at least one supplementary features are changed to, so as to improve the machine learning for feature pool model The validity of material, determines to provide preferable basis for follow-up feature importance.
Next, model training apparatus 200 can be produced including at least caused basic branch mailbox feature and at least one attached Add the training sample of feature, for training corresponding feature pool model.Here, in the training sample, except being instructed by model Practice outside basic branch mailbox feature and supplementary features that device 200 produces, may also include other arbitrary features, wherein, it is described its His feature can be the feature belonged in the machine learning sample that should be produced based on historgraphic data recording.
Model training apparatus 200 can be based on above-mentioned training sample come training characteristics pool model.Here, model training apparatus 200 can utilize appropriate machine learning algorithm (for example, logarithm probability returns), learn appropriate feature pool mould from training sample Type.
Importance determining device 300 is used to obtaining the effect of at least one feature pool model trained, and according to obtaining The effect of at least one feature pool model taken determines the importance of each feature.Here, importance determines to fill Put 300 can by the way that the feature trained pool model to be obtained to the effect of feature pool model applied to corresponding test data set, Also the effect of feature pool model can be received from its connected other party.
Particularly, feature pool model on test set performance can as the prediction effect of this feature pool model, and this One prediction effect can be used for weigh feature pool model based on feature group predictive ability.By weighing different characteristic pool model The difference on effect of difference on effect or same characteristic features pool model in different test features on original test data collection, can be comprehensive Close the importance for each feature for drawing machine learning sample.
Here, as an example, the effect of feature pool model may include AUC (ROC (the subject work spies of feature pool model Sign, Receiver Operating Characteristic) area under a curve, Area Under ROC Curve).
For example, it is assumed that the feature of certain feature pool model foundation is the characteristic { f of machine learning sample1,f2,…,fn} Among three feature { f1,f3,f5, also, continuous feature f therein1In the training sample of feature pool model be by from Dispersion processing, correspondingly, AUC of this feature pool model in test data set can reflect combinations of features { f1,f3,f5It is pre- Survey ability.Moreover, it is assumed that two features of also another feature pool model institute foundation are { f1,f3, similarly, continuous feature f1 Sliding-model control is have passed through, correspondingly, AUC of this feature pool model in test data set can reflect combinations of features { f1,f3 Predictive ability.On this basis, the difference between above-mentioned two AUC can be used for reflection feature f5Importance.
In another example, it is assumed that the feature of certain feature pool model foundation is the characteristic { f of machine learning sample1,f2,…, fnAmong three feature { f1,f3,f5, also, continuous feature f therein1In the training sample of feature pool model be by Sliding-model control, correspondingly, AUC of this feature pool model on original test data collection can reflect combinations of features { f1,f3, f5Predictive ability.Here, in order to determine target signature f5Importance, can be by each included by original test data collection Feature f in a test sample5Value handled obtain conversion test data set, and and then obtain feature pool model exist Convert the AUC in test data set.On this basis, the difference between above-mentioned two AUC can be used for reflection target signature f5's Importance.As an example, in conversion process, can be by the feature f in each original test sample5Value replace with null value, with Machine numerical value or by by feature f5Original value upset order after obtained value.
It is to be understood that above-mentioned each device can be individually configured the software for execution specific function, hardware, firmware or above-mentioned item Any combination.For example, these devices may correspond to dedicated integrated circuit, pure software code is can also correspond to, can also be right The unit or module that should be combined in software with hardware.In addition, the one or more functions that these devices are realized also can be by thing The component in entity device (for example, processor, client or server etc.) is managed to seek unity of action.
The feature importance of definite machine learning sample according to an exemplary embodiment of the present invention is described referring to Fig. 2 Method flow chart.Here, as an example, the method shown in Fig. 2 can be as shown in Figure 1 feature importance determine system to hold OK, it can also be realized completely by computer program with software mode, also Fig. 2 institutes can be performed by the computing device of particular configuration The method shown.For convenience, it is assumed that the feature importance of method as shown in Figure 1 shown in Fig. 2 determines system to perform.
As shown in the figure, in the step s 100, historgraphic data recording is obtained by data record acquisition device 100, wherein, it is described Historgraphic data recording is included on the mark of Machine Learning Problems and for generating each feature of machine learning sample at least One attribute information.
Here, historgraphic data recording is the true record on it is expected the Machine Learning Problems of prediction, it includes attribute letter Breath and mark two parts, such historgraphic data recording will be used to form machine learning sample, as the material of machine learning, and The exemplary embodiment of the present invention is intended in the machine learning sample of definite formation, the significance level of each feature.
Particularly, as an example, data record acquisition device 100 can manually, semi- or fully automated mode To gather historical data, or the original historical data of collection is handled so that the historgraphic data recording after processing has suitable When form or form.As an example, data record acquisition device 100 can gather historical data in bulk.
Here, data record acquisition device 100 can receive what user was manually entered by input unit (for example, work station) Historgraphic data recording.In addition, data record acquisition device 100 can from data source systems take out history by full automatic mode Data record, for example, by with software, firmware, hardware or its combination realize timer mechanism come systematically request data source And requested historical data is obtained from response.The data source may include one or more databases or other servers. Can be realized via internal network and/or external network it is full-automatic obtain the mode of data, wherein may include by internet come Transmit encrypted data.In the case where server, database, network etc. are configured as communicating with one another, can not do manually It is automatic in the case of pre- to carry out data acquisition, it should be noted that certain user still may be present in this manner inputs operation. Semiautomatic fashion is between manual mode and full-automatic mode.Difference lies in by user with full-automatic mode for semiautomatic fashion The trigger mechanism of activation instead of such as timer mechanism.In this case, specific situation input by user is being received Under, just produce the request of extraction data.When obtaining data every time, it is preferable that can be by the history data store of capture non-volatile In property memory.As an example, availability data warehouse is come after the initial data gathered during being stored in acquisition and processing Data.
The historgraphic data recording of above-mentioned acquisition can derive from identical or different data source, that is to say, that every history number Can also be the splicing result of different historgraphic data recordings according to record.For example, credit is opened to bank's application except obtaining client Outside the information data recording (it includes the attribute information fields such as income, educational background, post, Assets) filled in during card, as Example, data record acquisition device 100 can also obtain other data records of the client in the bank, for example, loan documentation, day Whether normal transaction data etc., these data records obtained can be together with being that to cheat the Sign mosaics of client be complete on the client Historgraphic data recording.In addition, data record acquisition device 100 can also obtain the number from other privately owned sources or common source According to for example, the data from metadata provider, the data from internet (for example, social network sites), transporting from mobile Seek the data of business, the data from APP operators, the data from express company, from data of credit institution etc. Deng.
Alternatively, data record acquisition device 100 can be by hardware cluster (Hadoop clusters, Spark clusters etc.) The data collected are stored and/or handled, for example, storage, classification and other off-line operations.In addition, data record obtains Take device 100 that also can the data of collection be carried out with online stream process.
As an example, it may include the data conversion modules such as text analysis model in data record acquisition device 100, accordingly The unstructured datas such as text can be converted to the knot for being easier to use by ground, in the step s 100, data record acquisition device 100 Structure data are to be further processed or quote subsequently.Text based data may include Email, document, net Page, figure, spreadsheet, call center's daily record, transaction reporting etc..
Next, in step s 200, by model training apparatus 200 using the historgraphic data recording obtained, training is at least One feature pool model, wherein, feature pool model refers to provide based at least a portion feature among each feature On the machine learning model of the prediction result of Machine Learning Problems, wherein, among at least a portion feature At least one continuous feature performs discretization computing and carrys out training characteristics pool model.
Here, model training apparatus 200 can be directed at least one continuous feature perform respectively it is any appropriate discrete Change computing, as an example, model training apparatus 200 can perform basic branch mailbox computing and at least one additional arithmetic, with produce with Each continuous corresponding basic branch mailbox feature of feature and at least one supplementary features, the basic branch mailbox feature of generation and at least one A supplementary features may make up at least a portion feature of the training sample of feature pool model as the feature after discretization.
As described above, continuous feature, as the feature in machine learning sample, it may result from historgraphic data recording extremely Few a part of attribute information, for example, the attribute information of the continuous value such as the distance of historgraphic data recording, age and amount of money can be direct As continuous feature, in another example, it can be obtained by the way that some attribute informations of historgraphic data recording are further processed Continuous feature, such as, can be using the ratio of height and weight as continuous feature.
After continuous feature is obtained, basic branch mailbox can be performed to the continuous feature of acquisition by model training apparatus 200 Computing, here, model training apparatus 200 can perform basic branch mailbox computing according to various branch mailbox modes and/or branch mailbox parameter.
By taking the wide branch mailbox under unsupervised as an example, it is assumed that the interval of continuous feature is [0,100], corresponding branch mailbox ginseng Number (that is, width) is 50, then can separate 2 chests, and in this case, the continuous feature that value is 61.5 corresponds to the 2nd Chest, if the two chests marked as 0 and 1, the continuous corresponding chest of feature is marked as 1.Or, it is assumed that branch mailbox Width is 10, then can separate 10 chests, and in this case, the continuous feature that value is 61.5 corresponds to the 7th chest, such as This ten chests of fruit marked as 0 to 9, then the continuous corresponding chest of feature is marked as 6.Or, it is assumed that branch mailbox width is 2, then can separate 50 chests, in this case, value be 61.5 continuous feature correspond to the 31st chest, if this five Ten chests marked as 0 to 49, then the continuous corresponding chest of feature is marked as 30.As an example, can be by counting online The mode of calculation determines the chest label of specific continuous feature and obtains corresponding characteristic value, without using searching mapping table Mode, so as to save memory space expense.
After by continuous Feature Mapping to multiple chests, corresponding characteristic value can be customized any value.Also It is to say, performs basic branch mailbox computing to produce the basic branch mailbox feature of various dimensions corresponding with continuous feature, wherein, as showing Example, each dimension may indicate that corresponding continuous feature whether has been assigned in corresponding chest, for example, representing continuous with " 1 " Feature has been assigned to corresponding chest, and is not assigned to corresponding chest with " 0 " to represent continuous feature, correspondingly, upper State in example, it is assumed that separated 10 chests, then basic branch mailbox feature can be the feature of 10 dimensions, be 61.5 with value Continuously the corresponding basic branch mailbox feature of feature is represented by [0,0,0,0,0,0,1,0,0,0].Alternatively, each dimension may indicate that pair The characteristic value for the corresponding continuous feature being assigned in the chest answered, correspondingly, in the examples described above, with the company that value is 61.5 Continue the corresponding basic branch mailbox feature of feature to be represented by [0,0,0,0,0,0,61.5,0,0,0];Alternatively, each dimension instruction pair The average value of the characteristic value for all continuous features being assigned in the chest answered;Alternatively, each dimension is indicated in corresponding chest The median of the characteristic value for all continuous features being assigned to;Alternatively, each dimension indicates the institute being assigned in corresponding chest There is the boundary value of the characteristic value of continuous feature, boundary value here can be upper boundary values or lower border value.
In addition, the value of basic branch mailbox feature can be also normalized, in order to perform computing.Assuming that will J-th of value for carrying out i-th of continuous feature of discretization computing is xij, its branch mailbox feature is represented by (BinID, x 'ij), its In, BinID indicates the label for the chest that continuous feature is assigned to, the value range of the label is 0,1 ..., B-1, wherein, B is The sum of chest, x 'ijFor xijNormalized value, features described above (BinID, x 'ij) represent in basic branch mailbox feature with marked as The feature value of the corresponding dimension of chest of BinID is x 'ij, the feature value of remaining dimension is 0.
Wherein, x 'ijIt can be represented such as following formula:
Wherein, maxiFor the maximum of i-th of continuous feature, miniFor the minimum value of i-th of continuous feature, also,
Wherein,For downward rounding operation symbol.
By taking the wide branch mailbox under unsupervised as an example, it is assumed that the interval of continuous feature is [0,100], is in branch mailbox width In the case of 50, according to above-mentioned calculating formula, the continuous feature that value is 61.5 may correspond to basic branch mailbox feature (1,0.23), and In the case where branch mailbox width is 10, according to above-mentioned calculating formula, the continuous feature that value is 61.5 may correspond to basic branch mailbox spy Levy (6,0.15).
Here, in order to obtain features described above (BinID, x 'ij), in step s 200, model training apparatus 200 can be according to upper Calculating formula is stated, for each xijValue carries out BinID and x 'ijComputing, alternatively, model training apparatus 200 can also produce in advance On the mapping table of the value range of each BinID, BinID corresponding with continuous feature is obtained by searching for the tables of data.
In addition, as an example, before basic branch mailbox computing is performed, can also by remove the outlier in continuous feature come Reduce the noise in historgraphic data recording.In this way, can further improve using branch mailbox feature to determine that feature is important The validity of property.
Particularly, the case that peels off can be additionally set so that the continuous feature with outlier is assigned to the case that peels off.Lift Example is got on very well, and for the continuous feature that interval is [0,1000], can be chosen a certain number of samples and be carried out pre- branch mailbox, for example, First carry out wide branch mailbox for 10 according to branch mailbox width, then record the sample size in each chest, for sample size compared with They, can be merged at least one case that peels off by the chest of few (for example, being less than threshold value).As an example, if located in both ends Case in sample size it is less, then the less chest of sample can be merged into the case that peels off, and remaining chest is retained, it is assumed that 0- Sample size in No. 10 chests is less, then 0-10 chests can be merged into the case that peels off, so as to be the company of [0,100] by value Continue feature universal formulation to the case that peels off.
In addition to performing above-mentioned basic branch mailbox computing, in step s 200, model training apparatus 200 is also directed to the quilt The continuous feature of basic branch mailbox computing is performed, it is corresponding to obtain to perform at least one additional arithmetic different from basic branch mailbox computing At least one supplementary features.
Here, the additional arithmetic can be arbitrary functional operation, these functional operation can have corresponding computing ginseng Number, also, the additional arithmetic performed for single continuous feature can be one or more computings, the multiple computing can be The computing of different types of computing or identical type but nonidentity operation parameter.
Especially, additional arithmetic can also indicate that branch mailbox computing, here, similar to basic branch mailbox feature, by adding branch mailbox The additional branch mailbox feature that computing produces can also be the feature of various dimensions, wherein, each dimension indicate in corresponding chest whether It has been assigned to corresponding continuous feature;Alternatively, each dimension indicates the corresponding continuous feature being assigned in corresponding chest Characteristic value;Alternatively, each dimension indicates the average value of the characteristic value for all continuous features being assigned in corresponding chest;Or Person, each dimension indicate the median of the characteristic value for all continuous features being assigned in corresponding chest;Alternatively, each dimension Indicate the boundary value of the characteristic value for all continuous features being assigned in corresponding chest.
Particularly, at least one additional arithmetic may include that but branch mailbox identical with basic branch mailbox computing branch mailbox mode is joined The different additional branch mailbox computing of number;Alternatively, at least one additional arithmetic may include with basic branch mailbox computing branch mailbox mode not Same additional branch mailbox computing.Here branch mailbox mode includes the various branch mailbox modes under supervision branch mailbox and/or unsupervised branch mailbox. For example, there is supervision branch mailbox to include minimum entropy branch mailbox, minimum description length branch mailbox etc., and unsupervised branch mailbox include wide branch mailbox, etc. Deep branch mailbox, branch mailbox based on k mean clusters etc..
As an example, basic branch mailbox computing and additional branch mailbox computing can correspond respectively to the wide branch mailbox fortune of different in width Calculate.That is, basic branch mailbox computing is identical with the branch mailbox mode that additional branch mailbox computing uses but the granularity of division is different, this makes The basic branch mailbox feature and additional branch mailbox feature that must produce can preferably portray the rule of original historgraphic data recording, so that more Be conducive to determine the importance of each feature.Especially, different in width can used by basic branch mailbox computing and additional branch mailbox computing Geometric Sequence is numerically formed, for example, basic branch mailbox computing can carry out wide branch mailbox according to the width of value 2, and line bonus Case computing can carry out wide branch mailbox according to the width of value 4, value 8, value 16 etc..Alternatively, basic branch mailbox computing and additional branch mailbox fortune Different in width can numerically form arithmetic progression used by calculation, for example, basic branch mailbox computing can according to the width of value 2 come into The wide branch mailbox of row, and additional branch mailbox computing can carry out wide branch mailbox according to the width of value 4, value 6, value 8 etc..
As another example, basic branch mailbox computing and additional branch mailbox computing can correspond respectively to different depth etc. deep branch mailbox Computing.That is, basic branch mailbox computing is identical with the branch mailbox mode that additional branch mailbox computing uses but the granularity of division is different, this The basic branch mailbox feature produced and additional branch mailbox feature is enabled preferably to portray the rule of original historgraphic data recording, so that It is more advantageous to determining the importance of each feature.Especially, different depth used by basic branch mailbox computing and additional branch mailbox computing Geometric Sequence can be numerically formed, for example, basic branch mailbox computing can carry out waiting deep branch mailbox according to the depth of value 10, and is added Branch mailbox computing can carry out etc. deep branch mailbox according to the depth of value 100, value 1000, value 10000 etc..Alternatively, basic branch mailbox computing and Different depth can numerically form arithmetic progression used by additional branch mailbox computing, for example, basic branch mailbox computing can be according to value 10 depth wait deep branch mailbox, and additional branch mailbox computing can carry out etc. dividing deeply according to the depth of value 20, value 30, value 40 etc. Case.
Exemplary embodiment according to the present invention, additional arithmetic may also include non-branch mailbox computing, for example, described at least one The computing of additional arithmetic including following species at least one of computing under each identical or different operational parameter of leisure of computing: Logarithm operation, exponent arithmetic, signed magnitude arithmetic(al), Gaussian transformation computing.It should be noted that additional arithmetic here from computing species and The limitation of operational parameter, can use any appropriate formula form, that is to say, that additional arithmetic can both have such as square fortune Simple form as calculation, it is possible to have complicated operation expression, for example, j-th of value for i-th of continuous feature xij, additional arithmetic can be performed to it according to the following formula to obtain supplementary features x "ij
x″ij=sign (xij)×log2(1+|xij|), wherein, sign is sign function.
In addition to above-mentioned basic branch mailbox feature and supplementary features, can also produce the training sample of feature pool model includes Other features, these features can pass through at least a portion attribute information to historgraphic data recording by model training apparatus 200 Carry out the various features processing such as directly extraction, discretization, field combination, extraction part field value, rounding and obtain.
Next, produced by model training apparatus 200 including features described above together with the feature pool model marked accordingly Training sample.Exemplary embodiment according to the present invention, can perform above-mentioned place under distributive parallel computation framework in memory Reason, distributive parallel computation framework here can have distributed parameters server.
In addition, as an example, the training sample produced can be directly used in the training managing of feature pool model.Specifically The step of coming, producing the training sample can be considered as a part for feature pool model training process, and correspondingly, training sample is not Need explicitly to be saved in hard disk, this processing mode can significantly improve the speed of service compared with traditional approach.
Next, can by model training apparatus 200 based on training sample come training characteristics pool model.Here, model training Device 200 can utilize appropriate machine learning algorithm (for example, logarithm probability returns), learn appropriate feature from training sample Pool model.As an example, feature pool model training sample both include continuous feature and also including discontinuous feature in the case of, Different regular terms can be set for continuous feature and discontinuous feature respectively, i.e. the regular terms set for continuous feature is not It is same as the regular terms set for discontinuous feature.
In the examples described above, the preferable feature pool model of relatively stable and prediction effect can be trained, in order to follow-up base The importance of each feature is effectively determined in the prediction effect of feature pool model.
Particularly, in step S300, at least one feature pool trained is obtained by importance determining device 300 The effect of model, and each spy of machine learning sample is determined according to the effect of at least one feature pool model of acquisition The importance of sign.
Here, importance determining device 300 can be applied to corresponding test data by the feature pool model that will be trained The effect of feature pool model can also be received to obtain the effect of feature pool model from its connected other party by collecting.
As an example, importance determining device 300 can be tested according to feature pool model in original test data collection and conversion The difference between effect on data set come determine the feature pool model based on individual features importance, wherein, become Test data set is changed to refer to replace with by the value of its importance for concentrating original test data target signature to be determined One of following item and obtain data set:Null value, random number, by being obtained after the original value upset order by target signature Value.
Here, each feature pool model can at least one feature based on machine learning sample, correspondingly, can obtain described Prediction effect of the feature pool model on original test data collection.In addition, can be by converting the target on original test data collection The value of feature come obtain the feature pool model conversion test data set on prediction effect.Above-mentioned two prediction effect Difference is that can be used to weigh the importance of target signature.
As an example, at least one feature pool model may include a whole characteristic model, wherein, whole character modules Type refers to whole features among each feature based on machine learning sample to provide the prediction knot on Machine Learning Problems The machine learning model of fruit, particularly, it is assumed that model training apparatus 200 trains a whole character modules in step s 200 Type, the whole characteristic model are trained to whole feature { f based on machine learning sample1,f2,…,fnProvide on machine The prediction result of problem concerning study.Importance determining device 300 can obtain the whole characteristic model on original test data collection Prediction effect is (for example, AUCall), original test data collection here may result from what is obtained by data record acquisition device 100 Other historgraphic data recordings.
In the example present, for definite { f1,f2,…,fnAmong either objective feature fiImportance (wherein, 1 ≤ i≤n), correspondingly original test data collection can be handled to obtain being directed to target signature fiConversion test data set, For example, by the feature f in each test sample of original test data collectioniValue replace with other values, for example, null value, with Machine numerical value or by feature fiValue upset between each test sample order after obtain value.Correspondingly, importance Determining device 300 can obtain test effect of the above-mentioned whole characteristic models in conversion test data set (for example, AUCi)。
After effect of whole characteristic models on original test data collection and conversion test data set is obtained respectively, Importance determining device 300 can be by difference (that is, the AUC between two effectsall-AUCi) as measurement target signature fiIt is important The reference of property.
It is illustrated above by being converted to original test data collection, so as to be determined by same feature pool model The example of the importance of each feature of its foundation.However, the exemplary embodiment of the present invention is not limited to this, can use Any appropriate mode comes the number of design feature pool model and the feature group of each feature pool model foundation, as long as these The prediction effect of feature pool model is it can be inferred that the importance of each feature.
For example, at least one feature pool model trained in step s 200 by model training apparatus 200 may include It is multiple to provide the machine learning model of the prediction result on Machine Learning Problems based on different characteristic group, correspondingly, in step In rapid S300, effect that importance determining device 300 can be according at least one feature pool model on original test data collection Difference between fruit determines the importance of each feature.
Here, at least one feature pool model include one or more main feature pool models and respectively with each master Feature pool model at least one subcharacter pool model accordingly, wherein, subcharacter pool model refers to be based on its corresponding main spy Residue character among the feature that is based on of sign pool model in addition to its importance target signature to be determined come provide on The machine learning model of the prediction result of Machine Learning Problems, correspondingly, importance determining device 300 can be according to main feature pool moulds Difference between the effect of type and its corresponding each subcharacter pool model on original test data collection is corresponding to determine The importance of target signature.
As an example, at least one feature pool model may include the whole character modules as main feature pool model Type and corresponding at least one subcharacter pool model, wherein, whole characteristic models refer to the whole based on machine learning sample Feature provides the machine learning model of the prediction result on Machine Learning Problems, and correspondingly, subcharacter pool model refers to base Residue character in addition to the target signature to be determined except its importance among whole features is provided on machine The machine learning model of the prediction result of problem concerning study, correspondingly, in step S300, importance determining device 300 can basis Difference between the effect of whole characteristic models and each subcharacter pool model on original test data collection is corresponding to determine The importance of target signature.
Particularly, it is assumed that model training apparatus 200 trains a whole characteristic model, the whole in step s 200 Characteristic model is trained to whole feature { f based on machine learning sample1,f2,…,fnProvide on Machine Learning Problems Prediction result.Importance determining device 300 can obtain the prediction effect of the whole characteristic model on original test data collection (for example, AUCall), original test data collection here may result from other history obtained by data record acquisition device 100 Data record.
In the example present, for definite { f1,f2,…,fnAmong either objective feature fiImportance (wherein, 1 ≤ i≤n), corresponding subcharacter pool model can be also additionally determined in step s 200, which is trained to be based on Except target signature fiOther features { f1,f2,…,fi-1,fi+1,…,fnProvide the prediction knot on Machine Learning Problems Fruit.Correspondingly, importance determining device 300 can obtain the prediction effect of the subcharacter pool model on original test data collection (for example, AUCi)。
Obtain respectively the effect of whole characteristic models and each subcharacter pool model on original test data collection it Afterwards, importance determining device 300 can be by difference (that is, the AUC between two effectsall-AUCi) as the measurement feature fi's The reference of importance.
Here, it should be noted that above-mentioned whole characteristic model is only as an example, not for limitation exemplary embodiment of the present Scope.In fact, in feature pool model, multiple main feature pool models may be present, each main feature pool model has respective Subcharacter pool model, that is to say, that each main feature pool model can at least a portion feature based on machine learning sample, this In, it can relate between different main feature pool models or be not related to common feature.
In addition, alternately, at least one feature trained in step s 200 by model training apparatus 200 Pool model may include multiple single characteristic models, wherein, single characteristic model refers among each feature based on machine learning sample Its importance target signature to be determined the machine learning model of the prediction result on Machine Learning Problems is provided, accordingly Ground, in step S300, effect that importance determining device 300 can be according to each single characteristic model on original test data collection Between difference determine the importance of corresponding target signature.
Particularly, it is assumed that model training apparatus 200 trains multiple single characteristic models, each Dan Te in step s 200 Sign model is trained to some feature { f based on machine learning sampleiProvide the prediction result on Machine Learning Problems. Here, the number of single characteristic model can be identical with the Characteristic Number of machine learning sample.Correspondingly, importance determining device 300 Can obtain each single characteristic model on same test data set (for example, original test data collection) prediction effect (for example, AUCi).Here, (preferably, it can perform basic branch mailbox computing and attached due to carrying out sliding-model control for continuous feature Add computing), it can be ensured that single characteristic model can reflect the predictive ability of each feature relatively stablely, correspondingly, obtain respectively After having taken effect of whole single characteristic models in identical test data set, importance determining device 300 can be based on each Difference between effect obtains the relative importance between corresponding each feature.
The method that definite feature importance according to an exemplary embodiment of the present invention is shown above by reference to Fig. 2, however, should Understand, the method shown in Fig. 2 is not intended to limit the specific implementation of exemplary embodiment of the present, and merely provides pass In the exemplary illustration of the basic conception of exemplary embodiment of the present, in fact, those skilled in the art can be according to any suitable When mode by shown in Fig. 2 scheme carry out modification and/or embody implement the present invention exemplary embodiment.Citing For, each step among the flow chart shown in Fig. 2 be not intended as sequential in terms of any restrictions, for example, step S200 and Step S300 need not be defined to stringent order and perform, and alternately, can be completed during training characteristics pool model A part of model test operation is with the effect of definite feature pool model.
Particularly, as described above, exemplary embodiment according to the present invention, in step s 200, is trained at least One feature pool model may include multiple to provide the machine of the prediction result on Machine Learning Problems based on different characteristic group Learning model, also, in step S300, effect that can be according at least one feature pool model on original test data collection Difference between fruit determines the importance of each feature.
Here, original test data collection can be made of the historgraphic data recording obtained, correspondingly, in step s 200, will The historgraphic data recording of acquisition is divided into multigroup historgraphic data recording to train each feature pool model, also, step step by step S200 is further included:Remembered using the feature pool model after currently group historgraphic data recording training to be directed to next group of historical data Record perform prediction with obtain it is corresponding with the next group of historgraphic data recording be grouped AUC, and comprehensive each packet AUC is obtained The AUC of feature pool model, wherein, obtain it is corresponding with the next group of historgraphic data recording be grouped AUC after, using institute Next group of historgraphic data recording is stated to continue feature pool model of the training after the current group historgraphic data recording training.
The method that Fig. 3 shows the feature importance of the definite machine learning sample of another exemplary embodiment according to the present invention Flow chart.Similarly, for convenience, it is assumed that the feature importance of method as shown in Figure 1 shown in Fig. 3 determines that system is come Perform.Also, as an example, feature pool model here can be the machine learning model based on logarithm probability regression algorithm, And the effect of feature pool model can be represented by AUC.
With reference to Fig. 3, in the step s 100, historgraphic data recording is obtained by data record acquisition device 100, wherein, it is described Historgraphic data recording is included on the mark of Machine Learning Problems and for generating each feature of machine learning sample at least One attribute information.Here, for simplicity, will not be described in great detail data record acquisition device 100 obtains each of historgraphic data recording Kind details.
Next, in step S210, the historgraphic data recording of acquisition is divided into multigroup go through by model training apparatus 200 History data record, these multigroup historgraphic data recordings marked off will progressively training characteristics pool models in batches.As optional side Formula, the training process can perform online, and in this case, the training sample of feature pool model need not be explicitly saved in Hard disk.
In step S220, the kth group history number as next group of historgraphic data recording is obtained by model training apparatus 200 According to record, wherein, k is positive integer.Exemplary embodiment according to the present invention, due to using multigroup historgraphic data recording come in batches Each feature pool model is progressively trained on ground, hence, it will be appreciated that:Before kth group historgraphic data recording is obtained, before basis K-1 batch historgraphic data recordings periodically trained each feature pool model, here, can be by special characteristic pond therein mould Type is expressed as LRk-1
In step S230, trained one or more features pool model is obtained by model training apparatus 200 respectively The acquired respective packets AUC under the test of kth group historgraphic data recording.With above-mentioned special characteristic pool model LRk-1Exemplified by, This feature pool model LR is used by model training apparatus 200k-1Come be directed to kth group historgraphic data recording perform prediction with obtain with Kth group historgraphic data recording is grouped AUC accordingly, i.e. AUCk.Particularly, in order to by kth group historgraphic data recording be used as survey Data set is tried, test sample need to be generated based on each bar historgraphic data recording among kth group historgraphic data recording, wherein, survey The characteristic of sample sheet is consistent with the characteristic of the training sample of feature pool model, i.e. model training apparatus 200 can be pressed Handled according to the Feature Engineering similar with training sample to obtain the characteristic of test sample, while give up historgraphic data recording Mark, so as to obtain the test sample of feature pool model.Then, model training apparatus 200 inputs obtained test sample special Pool model is levied, to obtain corresponding prediction result.Based on these prediction results, model training apparatus 200 can obtain the feature Pool model LRk-1For the packet AUC of kth group historgraphic data recordingk.By similar mode, model training apparatus 200 can obtain All feature pool models trained before taking are directed to the packet AUC of kth group historgraphic data recording, and preserve these packets AUC。
In practice, some attribute informations may be lacked in some historgraphic data recordings, and these attribute informations and feature The feature of pool model is related, in this case, in order to preferably obtain the AUC of the feature pool model, model training apparatus 200 can take corresponding reply to handle.
Particularly, gone through using the feature pool model after currently group historgraphic data recording training to be directed to next group During history data record perform prediction, it is based on when the next group of historgraphic data recording includes lacking for producing feature pool model At least a portion feature attribute information missing historgraphic data recording when, model training apparatus 200 can be based on following processing One of obtain corresponding with the next group of historgraphic data recording being grouped AUC:
The first situation:Model training apparatus 200 can be gone through merely with the next group of historgraphic data recording except lacking The prediction result of other historgraphic data recordings beyond history data record is grouped AUC to calculate.Particularly, it is assumed that kth group is gone through History data record includes 1000 historgraphic data recordings altogether, wherein, only 100 historgraphic data recordings include feature pool model The all properties information that characteristic is based on, i.e. there are 900 historgraphic data recordings to belong to missing historgraphic data recording.At this Kind in the case of, model training apparatus 200 can merely with described 100 have complete correlation attribute information historgraphic data recordings into Row prediction, and using based on the AUC that prediction result obtains as be grouped AUC.
The second situation:Model training apparatus 200 can utilize whole historical datas of the next group of historgraphic data recording The prediction result of record is grouped AUC to calculate, wherein, the prediction result for lacking historgraphic data recording is used as default, institute Default value is stated to determine come the indicia distribution for the historgraphic data recording for being determined or based on obtaining based on the value range of prediction result. Particularly, it is assumed that kth group historgraphic data recording includes 1000 historgraphic data recordings altogether, wherein, only 100 historical datas Record includes all properties information that the characteristic of feature pool model is based on, i.e. has 900 historgraphic data recordings to belong to scarce Lose historgraphic data recording.In this case, model training apparatus 200 can have complete correlation attribute information by described 100 Historgraphic data recording input feature vector pool model to be predicted, and the prediction result of 900 historgraphic data recordings is arranged to Default value, here, as an example, the default value can be determined based on the value range of prediction result, for example, in prediction result Value range be [0,1] in the case of, the default value can be arranged to median 0.5;Alternatively, the default value also can base Determined in the indicia distribution of the historgraphic data recording of acquisition, for example, it is assumed that in 1000 included by kth group historgraphic data recording In bar historgraphic data recording, 300 positive samples (that is, labeled as 1) are shared, then the default value can be arranged to the general of positive sample Rate, for example, 0.3.When obtaining the corresponding prediction result of whole 1000 historgraphic data recordings as described above, model training dress Put 200 can using based on the AUC that the prediction result obtains as packet AUC.
The third situation:Model training apparatus 200 will can be utilized in the next group of historgraphic data recording except missing is gone through The AUC that the prediction result of other historgraphic data recordings beyond history data record calculates exists with other described historgraphic data recordings Shared ratio is multiplied to obtain packet AUC in the next group of historgraphic data recording.Particularly, it is assumed that kth group history number Include 1000 historgraphic data recordings altogether according to record, wherein, only 100 historgraphic data recordings include the feature of feature pool model The all properties information that part is based on, i.e. there are 900 historgraphic data recordings to belong to missing historgraphic data recording.In this feelings Under condition, model training apparatus 200 can be by the described 100 historgraphic data recording input feature vector ponds with complete correlation attribute information Model obtains corresponding AUC based on obtained prediction result to be predicted, and then, model training apparatus 200 will can obtain The ratios (that is, 0.1) that are multiplied by shared by non-missing historgraphic data recording of AUC determine final packet AUC.
It should be noted that exemplary process mode when above-mentioned three kinds of situations are only as in the presence of missing historgraphic data recording, rather than For limiting the exemplary embodiment of the present invention.It is any to can also be applied to this hair with the similar or equivalent mode of above-mentioned three kinds of modes Bright exemplary embodiment.
After the test of feature pool model has been performed, in step S240, reading is based respectively on by model training apparatus 200 The kth group historgraphic data recording taken continues training by the one or more features pool model trained at present.
With above-mentioned special characteristic pool model LRk-1Exemplified by, in step S240, kth group is used by model training apparatus 200 Historgraphic data recording is come the feature pool model LR that continues model training to be updatedk.Particularly, in order to by kth group Historgraphic data recording is used as training dataset, need to be based on each bar historgraphic data recording next life among kth group historgraphic data recording Into training sample, i.e. model training apparatus 200 can be handled according to corresponding Feature Engineering to obtain the features of training sample Point, while the mark using the mark of historgraphic data recording as training sample, so as to obtain the training sample of feature pool model.Connect , model training apparatus 200 continues training characteristics pool model based on obtained training sample, with the feature pool model updated LRk.By similar mode, model training apparatus 200 can be trained using kth group historgraphic data recording before updating All feature pool models.
As can be seen that exemplary embodiment according to the present invention, can be same during training characteristics pool model stage by stage When obtain corresponding packet AUC, this make it that the training of model and test are highly efficient quick, realizes the optimization of whole system. (after tested, concentrated in fact, the correlation of the AUC obtained in examples detailed above and authentic testing AUC is very strong in specific data, phase Closing property can reach more than 0.85), therefore, as an example, feature pool can be determined based on the packet AUC obtained in the manner described above The importance of each feature of model.
Next, in step s 250, it is to determine the kth group historgraphic data recording obtained by model training apparatus 200 No is last group of historgraphic data recording marked off.If current kth group historgraphic data recording is determined in step s 250 Not last group of historgraphic data recording, then return to step S220 is to obtain the historgraphic data recording of next group of division, i.e. and kth+ 1 group of historgraphic data recording.If on the contrary, determine that current kth group historgraphic data recording is gone through for last group in step s 250 History data record, then proceed to step S310, in this step, by importance determining device 300 based on each spy preserved The packet AUC for levying pool model determines the importance of each feature of machine learning sample.
Particularly, in step S310, importance determining device 300 can be by each packet of each feature pool model AUC is integrated, to draw the AUC for the performance for representing individual features pool model.
After the performance (that is, AUC) of each feature pool model is obtained, importance determining device 300 can be by feature pool The performance of model is regarded as the feature group involved by this feature pool model (that is, in importance machine learning sample to be determined At least a portion feature among each feature) importance reference, and pass through the poor performance between each feature pool model of synthesis The different importance ranking to extrapolate between the importance of each target signature or each target signature.
Similarly, it should be noted that:Flow chart shown in Fig. 3 is also not intended to limit the details in the processing such as sequential, and only uses In as example come explain the present invention exemplary embodiment.As an example, training/test of each feature pool model is can be with Concurrently and/or online perform.
Exemplary embodiment according to the present invention, can be effectively true for the machine learning sample used in machine learning The importance degree of fixed wherein included each feature, so as to help preferably to carry out model training and/or model explanation.
Alternately, the feature importance shown in Fig. 1 determines that system can further include display device (not shown), accordingly Ground, in the step S200 shown in Fig. 2, can be controlled display device to provide a user for configuring spy by model training apparatus 200 The interface of at least one project among the following items of sign pool model:Feature pool model based at least a portion feature, The algorithm species of feature pool model, the algorithm parameter of feature pool model, the computing species of discretization computing, the fortune of discretization computing Calculate parameter.In addition, in this step, model training apparatus 200 can be distinguished according to user by the project of the interface configurations Training characteristics pool model.Here, as an example, in step s 200, may be in response to finger of the user on determining feature importance Show to provide a user the interface.For example, in the training process of machine learning model, in order to determine corresponding machine learning The important implementations of each feature in training sample, it is each it is expected to obtain that user can make instruction during Feature Engineering The importance of feature.For this reason, exemplary embodiment according to the present invention, can be on the related boundary of other of Feature Engineering or modeling procedure The control of such as feature significance operator is provided a user under face, when user clicks on the control, you can to user displaying on The interface of configuration feature pool model, in the interface, can set each projects such as the algorithm of feature pool model, feature, regular terms, Particularly, can also set on how the continuous feature to feature pool model carries out the project of discretization (for example, branch mailbox computing Various parameters etc.).For example, alternately, the regular terms of continuous feature and discontinuous feature can be respectively set, can also be distinguished The different weights of regular terms corresponding to different continuous features are set.
Here, the display device can be simple display screen, in this case, the feature importance determine be System may also include easy to user by the interface come configuration item purpose input unit (for example, keyboard, mouse, microphone, shooting Device etc.);Alternatively, the display device can be the touch display screen for having touch-input function, and in this case, user The project configuration on interface directly can be completed by the touch-screen.
In addition, determine that system obtains machine learning sample in feature importance according to an exemplary embodiment of the present invention After the importance of each feature, it can also be believed by the importance of each feature determined by patterned way to user's displaying Breath.
Fig. 4 shows the example at feature importance displaying interface according to an exemplary embodiment of the present invention, boundary shown in Fig. 4 In face, the report of feature importance analysis is illustrated, wherein, list feature importance ranking and some other additional letters Breath, as an example, when clicking on or being moved to the bar of some feature, can also additionally show the sample information on this feature Or attribute information etc..
Alternately, each feature can be shown according to the sequence of importance of feature, and/or, to described each A part of feature among a feature is highlighted, wherein, a part of feature includes corresponding with high importance heavy Want feature, inessential feature corresponding with small significance and/or off-note corresponding with abnormal importance.
Fig. 5 shows the example at the feature importance displaying interface of another exemplary embodiment according to the present invention, shown in Fig. 5 Interface in, each feature of machine learning sample is not only shown according to the order of importance, also pair with abnormal importance pair The off-note answered is highlighted, alternatively, it is further provided the possible cause of the off-note is occurred, is enhanced User-interaction experience.
It should be understood that:In existing machine learning field, it is required for writing code by programmer in most cases to complete Machine-learning process, has even developed the software systems of some such as Modeling Platforms, is still faced with to be difficult to benefit and removes The problem of business personnel beyond machine learning expert.However, exemplary embodiment according to the present invention, can effectively certainly The dynamic importance for determining each feature in machine learning sample so that decrease using the threshold of machine learning.In addition, root According to the exemplary embodiment of the present invention, it can also pass through definite knot of the interactive mode of close friend to user's displaying on feature importance Fruit and/or on determination mode correlation set, further enhancing the ease for use of machine learning platform, correspondingly, possess compared with The user of high machine learning techniques ability can conveniently set up and/or adjust the details in determination process, and ordinary user also may be used Key character, insignificant feature and/or off-note being visually known among machine learning sample etc..
It should be noted that feature importance system according to an exemplary embodiment of the present invention can be completely dependent on the fortune of computer program Go to realize corresponding function, i.e. each device is corresponding to each step in the function structure of computer program so that whole system System is called by special software kit (for example, lib storehouses), to realize corresponding function.
On the other hand, each device in feature importance system can also be by hardware, software, firmware, middleware, micro- Code or its any combination are realized.When being realized with software, firmware, middleware or microcode, for performing corresponding operating Program code or code segment can be stored in the computer-readable medium of such as storage medium so that processor can pass through reading Take and run corresponding program code or code segment to perform corresponding operation.
Here, exemplary embodiment of the invention is also implemented as computing device, which includes storage unit And processor, set of computer-executable instructions conjunction is stored with storage unit, when the set of computer-executable instructions is closed by institute When stating processor execution, perform features described above importance and determine method.
Particularly, the computing device can be deployed in server or client, can also be deployed in distributed network On node apparatus in network environment.In addition, the computing device can be PC computers, board device, personal digital assistant, intelligence Can mobile phone, web applications or other be able to carry out the device of above-metioned instruction set.
Here, the computing device is not necessarily single computing device, can also be it is any can be alone or in combination Perform the device of above-metioned instruction (or instruction set) or the aggregate of circuit.Computing device can also be integrated control system or system A part for manager, or can be configured as with Local or Remote (for example, via be wirelessly transferred) with the portable of interface inter-link Formula electronic device.
In the computing device, processor may include central processing unit (CPU), graphics processor (GPU), may be programmed and patrol Collect device, dedicated processor systems, microcontroller or microprocessor.As an example, not a limit, processor may also include simulation Processor, digital processing unit, microprocessor, polycaryon processor, processor array, network processing unit etc..
The above-mentioned some operations determined on feature importance described in method can be realized by software mode, some Operation can be realized by hardware mode, in addition, can also realize these operations by way of software and hardware combining.
Processor can run the instruction being stored in one of storage unit or code, wherein, the storage unit can be with Store data.Instruction and data can be also sent and received via Network Interface Unit and by network, wherein, the network connects Mouth device can use any of transport protocol.
Storage unit can be integral to the processor and be integrated, for example, RAM or flash memory are arranged in integrated circuit microprocessor etc. Within.In addition, storage unit may include independent device, such as, exterior dish driving, storage array or any Database Systems can Other storage devices used.Storage unit and processor can be coupled operationally, or can for example by I/O ports, Network connection etc. communicates so that processor can read the file being stored in storage unit.
In addition, the computing device may also include video display (such as, liquid crystal display) and user mutual interface is (all Such as, keyboard, mouse, touch input device etc.).The all component of computing device can be connected to each other via bus and/or network.
The above-mentioned operation determined on feature importance involved by method can be described as the function of various interconnections or coupling Block or function diagram.However, these functional blocks or function diagram can be equably integrated into single logic device or according to non- Exact border is operated.
Particularly, as described above, each feature of definite machine learning sample according to an exemplary embodiment of the present invention The computing device of importance may include storage unit and processor, be stored with set of computer-executable instructions in storage unit Close, when the set of computer-executable instructions, which is closed, to be performed by the processor, perform following step:(A) historical data is obtained Record, wherein, the historgraphic data recording is included on the mark of Machine Learning Problems and for generating machine learning sample At least one attribute information of each feature;(B) using the historgraphic data recording obtained, at least one feature pool model is trained, Wherein, feature pool model refers to provide on Machine Learning Problems based at least a portion feature among each feature Prediction result machine learning model;(C) effect of at least one feature pool model is obtained, and according to acquisition The effect of at least one feature pool model determines the importance of each feature, wherein, in step (B), by institute State at least one continuous feature among at least a portion feature and perform discretization computing and carry out training characteristics pool model.
It should be noted that having been combined Fig. 2 to Fig. 5 above, to describe feature importance according to an exemplary embodiment of the present invention true Determine each processing details of method, will not be described in great detail processing details when computing device performs each step here.
It is described above each exemplary embodiment of the present invention, it should be appreciated that foregoing description is only exemplary, not Exhaustive, and present invention is also not necessarily limited to disclosed each exemplary embodiment.Without departing from scope and spirit of the present invention In the case of, many modifications and changes will be apparent from for those skilled in the art.Therefore, originally The protection domain of invention should be subject to the scope of claim.

Claims (10)

1. a kind of method of the importance of each feature of definite machine learning sample, including:
(A) historgraphic data recording is obtained, wherein, the historgraphic data recording is including the mark on Machine Learning Problems and is used for Generate at least one attribute information of each feature of machine learning sample;
(B) using the historgraphic data recording obtained, at least one feature pool model is trained, wherein, feature pool model refers to be based on At least a portion feature among each feature provides the machine learning mould of the prediction result on Machine Learning Problems Type;
(C) effect of at least one feature pool model is obtained, and according at least one feature pool model of acquisition Effect determines the importance of each feature,
Wherein, in step (B), by performing discretization at least one continuous feature among at least a portion feature Computing carrys out training characteristics pool model.
2. the method for claim 1, wherein in step (C), according to feature pool model in original test data collection and Conversion test data set on effect between difference come determine the feature pool model based on individual features importance,
Wherein, conversion test data set refers to by its importance concentrated to original test data target signature to be determined The data set that value replaces with one of following item and obtains:Null value, random number, by the way that the original value of target signature is upset The value obtained after order.
3. the method for claim 1, wherein at least one feature pool model is based on different characteristic group including multiple To provide the machine learning model of the prediction result on Machine Learning Problems,
Wherein, in step (C), according between effect of at least one feature pool model on original test data collection Difference determines the importance of each feature.
4. method as claimed in claim 3, wherein, at least one feature pool model includes one or more main feature pools Model and respectively at least one subcharacter pool model corresponding with each main feature pool model, wherein, subcharacter pool model is Refer to based on its corresponding main feature pool model based on feature among in addition to its importance target signature to be determined Residue character the machine learning model of the prediction result on Machine Learning Problems is provided,
Wherein, in step (C), according to main feature pool model and its corresponding each subcharacter pool model in original test number The importance of corresponding target signature is determined according to the difference between the effect on collection.
5. method as claimed in claim 3, wherein, at least one feature pool model includes multiple single characteristic models, its In, single characteristic model refers to provide on machine based on the target signature to be determined of its importance among each feature The machine learning model of the prediction result of problem concerning study,
Wherein, in step (C), phase is determined according to the difference between effect of single characteristic model on original test data collection The importance for the target signature answered.
6. the method for claim 1, wherein the discretization computing includes basic branch mailbox computing and at least one additional Computing.
7. method as claimed in claim 6, wherein, at least one additional arithmetic includes and basic branch mailbox computing branch mailbox side The additional branch mailbox computing that formula is identical but branch mailbox parameter is different;Alternatively, at least one additional arithmetic includes transporting with basic branch mailbox The different additional branch mailbox computing of point counting case mode.
8. the method for claim 1, wherein step (B) further includes:Provide a user for configuration feature pool model The interface of at least one project among following items:Feature pool model based at least a portion feature, feature pool model Algorithm species, the algorithm parameter of feature pool model, the computing species of discretization computing, the operational parameter of discretization computing,
Also, in step (B), feature pool model is respectively trained by the project of the interface configurations according to user.
9. a kind of system of the importance of each feature of definite machine learning sample, including:
Data record acquisition device, for obtaining historgraphic data recording, wherein, the historgraphic data recording is included on engineering At least one attribute information of the mark of habit problem and each feature for generating machine learning sample;
Model training apparatus, for the historgraphic data recording using acquisition, trains at least one feature pool model, wherein, feature Pool model refers to provide the prediction knot on Machine Learning Problems based at least a portion feature among each feature The machine learning model of fruit;
Importance determining device, for obtaining the effect of at least one feature pool model, and according to acquisition at least The effect of one feature pool model determines the importance of each feature,
Wherein, model training apparatus at least one continuous feature among at least a portion feature by performing discretization Computing carrys out training characteristics pool model.
10. a kind of computing device of the importance of each feature of definite machine learning sample, including storage unit and processor, Set of computer-executable instructions conjunction is stored with storage unit, closes when the set of computer-executable instructions and is held by the processor During row, following step is performed:
(A) historgraphic data recording is obtained, wherein, the historgraphic data recording is including the mark on Machine Learning Problems and is used for Generate at least one attribute information of each feature of machine learning sample;
(B) using the historgraphic data recording obtained, at least one feature pool model is trained, wherein, feature pool model refers to be based on At least a portion feature among each feature provides the machine learning mould of the prediction result on Machine Learning Problems Type;
(C) effect of at least one feature pool model is obtained, and according at least one feature pool model of acquisition Effect determines the importance of each feature,
Wherein, in step (B), by performing discretization at least one continuous feature among at least a portion feature Computing carrys out training characteristics pool model.
CN201610935697.0A 2016-11-01 2016-11-01 Determine the method and system of the feature importance of machine learning sample Pending CN108021984A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110542599.1A CN113435602A (en) 2016-11-01 2016-11-01 Method and system for determining feature importance of machine learning sample
CN201610935697.0A CN108021984A (en) 2016-11-01 2016-11-01 Determine the method and system of the feature importance of machine learning sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610935697.0A CN108021984A (en) 2016-11-01 2016-11-01 Determine the method and system of the feature importance of machine learning sample

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202110542599.1A Division CN113435602A (en) 2016-11-01 2016-11-01 Method and system for determining feature importance of machine learning sample

Publications (1)

Publication Number Publication Date
CN108021984A true CN108021984A (en) 2018-05-11

Family

ID=62070586

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110542599.1A Pending CN113435602A (en) 2016-11-01 2016-11-01 Method and system for determining feature importance of machine learning sample
CN201610935697.0A Pending CN108021984A (en) 2016-11-01 2016-11-01 Determine the method and system of the feature importance of machine learning sample

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202110542599.1A Pending CN113435602A (en) 2016-11-01 2016-11-01 Method and system for determining feature importance of machine learning sample

Country Status (1)

Country Link
CN (2) CN113435602A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034398A (en) * 2018-08-10 2018-12-18 深圳前海微众银行股份有限公司 Feature selection approach, device and storage medium based on federation's training
CN109165683A (en) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 Sample predictions method, apparatus and storage medium based on federation's training
CN109360084A (en) * 2018-09-27 2019-02-19 平安科技(深圳)有限公司 Appraisal procedure and device, storage medium, the computer equipment of reference default risk
CN109408583A (en) * 2018-09-25 2019-03-01 平安科技(深圳)有限公司 Data processing method and device, computer readable storage medium, electronic equipment
CN109657285A (en) * 2018-11-27 2019-04-19 中国科学院空间应用工程与技术中心 The detection method of turbine rotor transient stress
CN109783337A (en) * 2018-12-19 2019-05-21 北京达佳互联信息技术有限公司 Model service method, system, device and computer readable storage medium
CN109784721A (en) * 2019-01-15 2019-05-21 东莞市友才网络科技有限公司 A kind of plateform system of employment data analysis and data mining analysis
CN109800048A (en) * 2019-01-22 2019-05-24 深圳魔数智擎科技有限公司 Result methods of exhibiting, computer readable storage medium and the computer equipment of model
CN110660485A (en) * 2019-08-20 2020-01-07 南京医渡云医学技术有限公司 Method and device for acquiring influence of clinical index
CN110708285A (en) * 2019-08-30 2020-01-17 中国平安人寿保险股份有限公司 Flow monitoring method, device, medium and electronic equipment
CN110717597A (en) * 2018-06-26 2020-01-21 第四范式(北京)技术有限公司 Method and device for acquiring time sequence characteristics by using machine learning model
CN110751285A (en) * 2018-07-23 2020-02-04 第四范式(北京)技术有限公司 Training method and system and prediction method and system of neural network model
CN110956272A (en) * 2019-11-01 2020-04-03 第四范式(北京)技术有限公司 Method and system for realizing data processing
CN111401475A (en) * 2020-04-15 2020-07-10 支付宝(杭州)信息技术有限公司 Method and system for generating attack sample
CN111797995A (en) * 2020-06-29 2020-10-20 第四范式(北京)技术有限公司 Method and device for generating interpretation report of model prediction sample
CN112580817A (en) * 2019-09-30 2021-03-30 脸谱公司 Managing machine learning features
CN112819034A (en) * 2021-01-12 2021-05-18 平安科技(深圳)有限公司 Data binning threshold calculation method and device, computer equipment and storage medium
WO2021139115A1 (en) * 2020-05-26 2021-07-15 平安科技(深圳)有限公司 Feature selection method, apparatus and device, and storage medium
CN113128694A (en) * 2019-12-31 2021-07-16 北京超星未来科技有限公司 Method, device and system for data acquisition and data processing in machine learning
CN117705141A (en) * 2024-02-06 2024-03-15 腾讯科技(深圳)有限公司 Yaw recognition method, yaw recognition device, computer readable medium and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI806425B (en) * 2022-02-14 2023-06-21 宏碁股份有限公司 Feature selection method

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717597A (en) * 2018-06-26 2020-01-21 第四范式(北京)技术有限公司 Method and device for acquiring time sequence characteristics by using machine learning model
CN110751285A (en) * 2018-07-23 2020-02-04 第四范式(北京)技术有限公司 Training method and system and prediction method and system of neural network model
CN110751285B (en) * 2018-07-23 2024-01-23 第四范式(北京)技术有限公司 Training method and system and prediction method and system for neural network model
CN109034398A (en) * 2018-08-10 2018-12-18 深圳前海微众银行股份有限公司 Feature selection approach, device and storage medium based on federation's training
CN109165683A (en) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 Sample predictions method, apparatus and storage medium based on federation's training
CN109034398B (en) * 2018-08-10 2023-09-12 深圳前海微众银行股份有限公司 Gradient lifting tree model construction method and device based on federal training and storage medium
CN109165683B (en) * 2018-08-10 2023-09-12 深圳前海微众银行股份有限公司 Sample prediction method, device and storage medium based on federal training
CN109408583A (en) * 2018-09-25 2019-03-01 平安科技(深圳)有限公司 Data processing method and device, computer readable storage medium, electronic equipment
CN109408583B (en) * 2018-09-25 2023-04-07 平安科技(深圳)有限公司 Data processing method and device, computer readable storage medium and electronic equipment
CN109360084A (en) * 2018-09-27 2019-02-19 平安科技(深圳)有限公司 Appraisal procedure and device, storage medium, the computer equipment of reference default risk
CN109657285A (en) * 2018-11-27 2019-04-19 中国科学院空间应用工程与技术中心 The detection method of turbine rotor transient stress
CN109783337A (en) * 2018-12-19 2019-05-21 北京达佳互联信息技术有限公司 Model service method, system, device and computer readable storage medium
CN109783337B (en) * 2018-12-19 2022-08-30 北京达佳互联信息技术有限公司 Model service method, system, apparatus and computer readable storage medium
CN109784721A (en) * 2019-01-15 2019-05-21 东莞市友才网络科技有限公司 A kind of plateform system of employment data analysis and data mining analysis
CN109784721B (en) * 2019-01-15 2021-01-26 广东度才子集团有限公司 Employment data analysis and data mining analysis platform system
CN109800048A (en) * 2019-01-22 2019-05-24 深圳魔数智擎科技有限公司 Result methods of exhibiting, computer readable storage medium and the computer equipment of model
CN110660485A (en) * 2019-08-20 2020-01-07 南京医渡云医学技术有限公司 Method and device for acquiring influence of clinical index
CN110708285A (en) * 2019-08-30 2020-01-17 中国平安人寿保险股份有限公司 Flow monitoring method, device, medium and electronic equipment
CN112580817A (en) * 2019-09-30 2021-03-30 脸谱公司 Managing machine learning features
CN110956272B (en) * 2019-11-01 2023-08-08 第四范式(北京)技术有限公司 Method and system for realizing data processing
CN110956272A (en) * 2019-11-01 2020-04-03 第四范式(北京)技术有限公司 Method and system for realizing data processing
CN113128694A (en) * 2019-12-31 2021-07-16 北京超星未来科技有限公司 Method, device and system for data acquisition and data processing in machine learning
CN111401475A (en) * 2020-04-15 2020-07-10 支付宝(杭州)信息技术有限公司 Method and system for generating attack sample
WO2021139115A1 (en) * 2020-05-26 2021-07-15 平安科技(深圳)有限公司 Feature selection method, apparatus and device, and storage medium
CN111797995A (en) * 2020-06-29 2020-10-20 第四范式(北京)技术有限公司 Method and device for generating interpretation report of model prediction sample
CN111797995B (en) * 2020-06-29 2024-01-26 第四范式(北京)技术有限公司 Method and device for generating interpretation report of model prediction sample
CN112819034A (en) * 2021-01-12 2021-05-18 平安科技(深圳)有限公司 Data binning threshold calculation method and device, computer equipment and storage medium
CN117705141A (en) * 2024-02-06 2024-03-15 腾讯科技(深圳)有限公司 Yaw recognition method, yaw recognition device, computer readable medium and electronic equipment
CN117705141B (en) * 2024-02-06 2024-05-07 腾讯科技(深圳)有限公司 Yaw recognition method, yaw recognition device, computer readable medium and electronic equipment

Also Published As

Publication number Publication date
CN113435602A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN108021984A (en) Determine the method and system of the feature importance of machine learning sample
CN107392319A (en) Generate the method and system of the assemblage characteristic of machine learning sample
US11074511B2 (en) System and method for graph pattern analysis
CN107704871A (en) Generate the method and system of the assemblage characteristic of machine learning sample
CN107871166A (en) For the characteristic processing method and characteristics processing system of machine learning
CN107729915A (en) For the method and system for the key character for determining machine learning sample
CN104798043B (en) A kind of data processing method and computer system
CN110462612A (en) The method and apparatus for carrying out machine learning using the network at network node with ageng and ranking then being carried out to network node
EP3394744A1 (en) System and method for rapid development and deployment of reusable analytic code for use in computerized data modeling and analysis
CN110309119A (en) It is uploaded for realizing data, system, method and apparatus disclosed in processing and predicted query API
CN105528387A (en) Segmentation discovery, evaluation and implementation platform
CN107909087A (en) Generate the method and system of the assemblage characteristic of machine learning sample
CN107851106A (en) It is the resource scaling of the automatic requirement drive serviced for relational database
CN108108820A (en) For selecting the method and system of the feature of machine learning sample
CN107679549A (en) Generate the method and system of the assemblage characteristic of machine learning sample
CN107578140A (en) Guide analysis system and method
CN107169574A (en) Using nested machine learning model come the method and system of perform prediction
CN107273979A (en) The method and system of machine learning prediction are performed based on service class
Manohar et al. Utilizing big data analytics to improve education
Khan et al. Gray method for multiple attribute decision making with incomplete weight information under the pythagorean fuzzy setting
Winters Practical predictive analytics
US11295325B2 (en) Benefit surrender prediction
CN107368506A (en) Unstructured data analysis system and method
Labijak-Kowalska et al. Exact and stochastic methods for robustness analysis in the context of Imprecise Data Envelopment Analysis
CN110503306A (en) A kind of Satisfaction index visible processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180511