CN110472802A - A kind of data characteristics appraisal procedure, device and equipment - Google Patents

A kind of data characteristics appraisal procedure, device and equipment Download PDF

Info

Publication number
CN110472802A
CN110472802A CN201810435231.3A CN201810435231A CN110472802A CN 110472802 A CN110472802 A CN 110472802A CN 201810435231 A CN201810435231 A CN 201810435231A CN 110472802 A CN110472802 A CN 110472802A
Authority
CN
China
Prior art keywords
value
characteristic
data sample
assessed
characteristic variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810435231.3A
Other languages
Chinese (zh)
Other versions
CN110472802B (en
Inventor
刘腾飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810435231.3A priority Critical patent/CN110472802B/en
Publication of CN110472802A publication Critical patent/CN110472802A/en
Application granted granted Critical
Publication of CN110472802B publication Critical patent/CN110472802B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Abstract

This specification embodiment discloses a kind of data characteristics appraisal procedure, device and equipment.Data reconstruction is carried out to data sample to be assessed by using the characteristic value of characteristic variable, generate analog data sample, then it is scored using data model analog data sample, it may thereby determine that the value for changing some characteristic variable will cause much variations for the scoring of data sample, and then the influence of scoring of this feature variable-value for data sample to be assessed can be learnt according to the variation of scoring, influence of i.e. each characteristic variable for scoring can be embodied by quantifiable signature contributions value, so that the user of data model carries out follow-up business decision and processing according to signature contributions value.

Description

A kind of data characteristics appraisal procedure, device and equipment
Technical field
This specification is related to field of computer technology more particularly to a kind of data characteristics appraisal procedure, device and equipment.
Background technique
In the application of artificial intelligence technology, machine learning model is widely used in data classification and abnormality detection.
In this manner, data sample generally comprises multiple characteristic variables, and trained data model is based on multiple spies Sign variable gives a mark to data sample.In this process, for the user of data model, data model often as One "black box" is the same, although can provide testing result, aid decision, for data mould for different data samples Why type can provide such conclusion, wherein each feature play the role of having it is much.It is often unclear.
Based on this, a kind of more effective data characteristics evaluation scheme is needed.
Summary of the invention
This specification embodiment provides a kind of data characteristics appraisal procedure, device and equipment, as follows for solving the problems, such as: with A kind of more effective data characteristics evaluation scheme is provided.
Based on this, this specification embodiment provides a kind of data characteristics appraisal procedure, comprising:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple Characteristic variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced in advance The characteristic value of obtained characteristic variable generates another analog data sample;
Obtain scoring of the data model for the analog data sample, according to the scoring of the data sample to be assessed and The scoring of the analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is commented Estimate.
Meanwhile the embodiment of this specification also provides a kind of data characteristics assessment device, comprising:
Grading module obtains scoring of the data model for data sample to be assessed, wherein the data sample to be assessed Including multiple characteristic variables and corresponding value;
Generation module, for any feature variable, by the corresponding value of characteristic variable described in the data sample to be assessed The characteristic value for replacing with the characteristic variable being previously obtained generates another analog data sample;
Computing module obtains scoring of the data model for the analog data sample, according to the data sample to be assessed This scoring and the scoring of the analog data sample, calculates the signature contributions value of characteristic variable described in the data sample;
Evaluation module becomes the feature of the data sample to be assessed according to the size of the signature contributions value of characteristic variable Amount is assessed.
Corresponding, this specification embodiment also provides a kind of data characteristics assessment equipment, comprising:
Memory is stored with data characteristics appraisal procedure;
Processor calls the data characteristics appraisal procedure in memory, and executes:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple Characteristic variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced in advance The characteristic value of obtained characteristic variable generates another analog data sample;
Obtain scoring of the data model for the analog data sample, according to the scoring of the data sample to be assessed and The scoring of the analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is commented Estimate.
Corresponding, the embodiment of this specification also provides a kind of nonvolatile computer storage media, is stored with computer Executable instruction, the computer executable instructions setting are as follows:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple Characteristic variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced in advance The characteristic value of obtained characteristic variable generates another analog data sample;
Obtain scoring of the data model for the analog data sample, according to the scoring of the data sample to be assessed and The scoring of the analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is commented Estimate.
This specification embodiment use at least one above-mentioned technical solution can reach it is following the utility model has the advantages that
Data reconstruction is carried out to data sample to be assessed by using the characteristic value of characteristic variable, generates analogue data sample This, then scores to analog data sample using data model, may thereby determine that in the value for changing some characteristic variable Much variations will be caused for the scoring of data sample, and then this feature variable-value pair can be learnt according to the variation of scoring In the influence of the scoring of data sample to be assessed, influence of each characteristic variable for scoring can pass through quantifiable feature tribute Value is offered to be embodied.Therefore, the user of data model can assess the contribution of each feature, and for each Why data have such scoring and generate key feature information, to refer to.In this manner, model user is not required to Traffic issues are understood in depth, calculating process is also unrelated with the usage scenario of data model, while calculating process can also With full automation, it is not required to very important person's intervention, efficiency is substantially improved.
Detailed description of the invention
Fig. 1 is the flow diagram of data characteristics evaluation scheme provided by this specification embodiment;
Fig. 2 is a kind of schematic diagram of constructing analog data sample provided by this specification embodiment;
Fig. 3 is the schematic diagram of output key feature information provided by this specification embodiment;
Fig. 4 is the schematic block diagram of output key feature information provided by this specification embodiment;
Fig. 5 is the structural schematic diagram that data characteristics provided by this specification embodiment assesses device.
Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one Section Example, instead of all the embodiments.The embodiment of base in this manual, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall in the protection scope of this application.
As previously mentioned, machine learning has been widely used for various businesses scene, for example, machine learning can be used to help Judge that a transaction is exception, judge whether risky some operation of user is, whether judges the user for applying providing a loan Can refund etc..Its mode is generally based on training sample training and obtains a data model, reuses this data model pair Data carry out abnormality detection.Wherein, training sample can be target and be also possible to no target.
In this process, for the user of many data model results, machine learning model is often as one "black box" is the same, although testing result can be provided, why aid decision can provide such conclusion for model, It is often unclear.It is this can not be explanatory, reduce the friendliness of the data model, also reduce the easy-to-use of system Property.Therefore, in order to business personnel's interpretation model as a result, can generally provide some simple explanations on the basis of model result Illustrate, that is, the key feature information of model result, for illustrate be which factor cause data model can be in this way Scoring or classification, can preferably do operational decision making in this way with auxiliary activities personnel.
Current, mainly can be used for generating key feature information just like under type:
1, key feature information is generated based on training label: training label refers to the target to be predicted of machine mould, than Such as, judge that a transaction is wash sale, the transaction whether false (value is only/is not) is exactly the mark of model Label.When having trained label in training data, different characteristic variable can be calculated for number of targets according to training label According to separating capacity, alternatively, the stronger model of interpretation can be selected.This mode heavy dependence label data is deposited If not having label data in true scene, it be not available.
2, based on business rule by artificially formulating key feature information,.For example, under the application scenarios provided a loan at one, industry In the subjective experience of business personnel, similar user is either with or without work, and how much is user's annual income, and user becomes either with or without information such as house properties Amount, often has a major impact final mask result.So be based on these professional knowledges, can taking human as some business of setting Rule only focuses on these important characteristic informations, and exports the value of these important feature variables as key feature information.This Kind mode needs the user of model to have stronger professional knowledge, while this method can not automate, that is, same model An application scenarios are changed just to need to reformulate rule, low efficiency.
Based on above content, this specification embodiment provides a kind of data characteristics evaluation scheme, by initial data sample This minor modifications generate analog data sample, realize that quantization measures some characteristic variable to the shadow of data model score result The degree of sound can go out the feature big for the data sample influence degree with accurate evaluation and become hence for arbitrary data sample Amount, without understanding application scenarios and implementation can be automated, more effectively.
As shown in FIG. 1, FIG. 1 is the flow diagram of data characteristics evaluation scheme provided by this specification embodiment, packets Include following steps:
S101 obtains scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes Multiple characteristic variables and corresponding value.
In machine learning model, in general data sample is the vector comprising multiple characteristic variables.Pass through one The training sample of fixed number amount carries out model training using preset algorithm, obtains an accuracy or accuracy meets expection Data model.Then, it can using the trained data model for unknown data sample (data sample i.e. to be assessed This) it is detected.Mode generally be according to each feature vector it is worth go out a score value (score value can be according to reality Demand chooses whether normalized), then it is determined according to the score value.
For example, being assessed using isolated forest (Isolation Forest) algorithm a transaction data, sentence Breaking, whether it is abnormal data.Generally, transaction data can be considered as to the vector comprising m dimensional characteristics variable, wherein Each characteristic variable its can be about transaction amount, the transaction frequency, dealing man on the day of transaction count or with previous transaction The information such as time interval, i.e., each feature vector has corresponding value.In other words, for the i-th transaction data Ti, there is m spy Variable is levied to describe this transaction information: C1, C2..., Cm.That is, Ti={ C1, C2..., Cm}.Given one utilizes Isolation When the trained data model of Forest algorithm, T is inputtedi={ C1, C2..., Cm, then data model can make a call to one to this transaction A score value, for indicating the intensity of anomaly of transaction, marking can be a normalization score value between 0 and 1, and score value is higher It is more abnormal.If not assessing rule, possibly can not learn why data model can get corresponding score value.
S103 replaces the corresponding value of characteristic variable described in the data sample to be assessed for any feature variable Characteristic value for the characteristic variable being previously obtained, generates another analog data sample.
In unsupervised Outlier Detection Algorithm, it is however generally that, have following rule: a) abnormal data is in total data In be few;B) abnormal data is different with other most of data.
To be to assess influence of each characteristic variable to its appraisal result, basic thinking in an abnormal data To keep other characteristic variable values constant, feature same in data sample to be assessed is become with the characteristic value of the characteristic variable Value under amount is replaced.Wherein, the characteristic value of characteristic variable is previously obtained in general, is all data sample values In most commonly seen or most representational that value, integrally there is general representativeness for data.Its mode can To be to be empirically derived, alternatively, being counted to obtain to training sample (i.e. the data sample of training data model).
It is readily appreciated that, if the characteristic variable of a data sample is representative characteristic value, the data are little It may be abnormal data, data model just should comply with normal data range (for example, in Isolation to its scoring In Forest algorithm, zero) score value of the data sample should just level off to.
Therefore, we can be used characteristic value and go to replace corresponding value in data sample to be assessed, and obtain a simulation Data sample, for example, for giving a transaction data Ti={ C1, C2..., Cm, for TiIn j-th of characteristic variable CjAnd Speech, if its characteristic value is Cj', then Cj ' replacement C can be usedjOriginal value,,Generate analog data sample Tij={ C1,C2,…, Cj' ..., Cm, T hereijIn addition to the value of j-th of feature is Cj' other than, other characteristic informations are kept not with former transaction data Become.Based on TiThere is m characteristic variable, so that m analog data sample can be generated in we.In each analog data sample In, with former transaction data TiIt compares, only different in the value of some characteristic variable, other values remain unchanged.
As shown in Fig. 2, Fig. 2 is a kind of schematic diagram of constructing analog data sample provided by this specification embodiment.It is former The transaction data sample to be assessed to begin includes four characteristic variables: buyer's gender, transaction amount, buyer's daylight trading number are bought Family and last transaction event interval, respective characteristic value be respectively 0 (representing women), 75 (representing average transaction amount), 1.2 (representing each buyer's Day Trading number), 22 (represent each buyer be averaged transaction duration), for for data T to be assessedi For, Ti={ 1,1000,20,2 }, it is thus possible to construct corresponding four analog data samples, Ti1=0,1000,20, 2 }, Ti2={ 1,75,20,2 }, Ti3={ 1,1000,1.2,2 }, Ti4={ 1,1000,20,22 }.
S105 obtains scoring of the data model for the analog data sample, according to the data sample to be assessed The scoring of scoring and the analog data sample, calculates the signature contributions value of characteristic variable described in the data sample.
After obtaining m analog data sample, it can same data model gives above-mentioned m analogue data before use Sample scores respectively.Continuous precedent, generals are by data model for former transaction data TiMarking be expressed as Si, by model for Ti1, Ti2..., TimMarking be expressed as Si1, Si2..., Sim(that is, TijMarking be Sij), to obtain all giving a mark it Afterwards, we can calculate the contribution that each characteristic variable scores for initial data, remember for the i-th transaction data Ti J-th of characteristic variable signature contributions value be Vij, VijFor measuring TiIn the value of j-th of characteristic variable be substituted for this feature After the characteristic value of variable, data model follows the marking of analog data sample the difference of the marking for initial data.Vij It can be an absolute contribution value, be also possible to a relative contribution value, the difference of calculation method according to actual needs can be with Voluntarily adjust.For example, absolute contribution value Vij=| Si-Sij|。
It using aforesaid way, is calculated for each characteristic variable, characterization is respectively special respectively by m of available quantization Levy the signature contributions value V of variableij
S107, according to the size of the signature contributions value of characteristic variable, to the characteristic variable of the data sample to be assessed into Row assessment.
Based on above content, can learn for the data sample of m dimension, just because of its each characteristic variable Value deviates from characteristic value, and causing data model to score it, abnormal (it is inclined in Isolation Forest to show as score value It is high), thus signature contributions value VijIt is bigger, it reflects under the data model, j-th of variable is to data in data sample to be assessed The influence of scoring is bigger.For abnormal data, the value deviation characteristic value just because of its certain characteristic variable is too far, makes It is identified as exception at it, thus in suc scheme, can effectively find the abnormal characteristic variable of some of them value, To be directed to any unknown data, the data to be assessed of abnormal data are especially confirmed as by data model by those, are ok Effectively the characteristic variable in data is assessed.
In the above scheme, data reconstruction is carried out to data sample to be assessed by using the characteristic value of characteristic variable, it is raw It at analog data sample, is then scored using data model analog data sample, may thereby determine that and changing some The value of characteristic variable will cause the scoring of data sample much variations, and then can learn the spy according to the variation of scoring Influence of the variable-value for the scoring of data sample to be assessed is levied, i.e., influence of each characteristic variable for scoring can pass through Quantifiable signature contributions value is embodied.Therefore, the user of data model can carry out the contribution of each feature Assessment.In this manner, model user understands in depth without need for traffic issues, and calculating process is also with data model Usage scenario it is unrelated, while calculating process can also be fully automated, and be not required to very important person's intervention, and efficiency is substantially improved.
In practical applications, for the characteristic value of the characteristic variable in S103, it is however generally that, it can be by the use of data model Person is rule of thumb determined in advance, and can also be counted and be obtained according to the training sample of data model, concrete mode is as follows:
Obtain all training samples of the data model;Determine that all training samples are each under the characteristic variable Self-corresponding value;It is calculated according to all training samples corresponding value under the characteristic variable and generates the feature change The characteristic value of amount.
I.e. for each characteristic value, it should be counted and be obtained according to each numerical value of this feature variable in training sample, because It is obtained for the data model for scoring based on training sample, the value of each characteristic variable commenting for data model in training sample Dividing has larger impact.
As a kind of specific embodiment, the statistical value can obtain in the following way: according to the entirety Training sample corresponding value under the characteristic variable generates statistical value;The statistical value is determined as the characteristic variable Characteristic value, wherein the statistical value includes at least one of median, mode or average.Median is value Sequence is located at the value at midpoint, and mode is the most value of frequency of occurrence, it is readily appreciated that, median, mode or average There is certain representativeness for the value of a data sample, which specifically takes, can determine according to actual needs.
For example, generally then can choose mode under the characteristic variable of a discrete type (for example, gender, educational background etc.) As characteristic value.For example, in transaction data TiIn, j-th of characteristic variable CjValue be one of 0,1 or 2, intermediate value 1 Be in sample value at most, then characteristic value Cj' it is 1.
Further, all training samples can also be also represented using the training sample of part of representative, then used Statistic in the training sample of part carrys out characteristic feature value, for the selected mode of part training sample, can use as follows Mode:
According to all training sample corresponding values under the characteristic variable, from all training samples Select part training sample;According to the part training sample, corresponding value generates statistical value under the characteristic variable.
For example, for some characteristic variable Cj, determine its all training samples value interval be [0,100], then may be used It to be set based on experience, takes and is located in the middle 20% section, i.e., value interval [40,60] is used as and represents section, all CjValue The training sample for falling into the section is then confirmed as the representative part training sample.To be instructed according to this part The statistic for practicing sample goes to determine characteristic value, replaces all training samples using part training sample, can reduce in determination Calculation amount during characteristic value improves computational efficiency.
Further, it in above scheme, selecting part training sample, can also be carried out by the way of branch mailbox, It specifically includes as follows:
The minimum value and maximum value for obtaining all training sample corresponding values under the characteristic variable, determine Value interval;According to fixed value length, equivalent branch mailbox is carried out to the value interval, generates several branch mailbox value intervals; Determine the value quantity that each branch mailbox value interval is included;By training sample corresponding to the maximum branch mailbox value interval of value quantity Originally it is determined as the part training sample.
I.e. for value is the characteristic variable (for example, transaction amount) of continuous type, which can be carried out discrete Processing.I.e., it is first determined the value interval of training sample, then (length of branch mailbox can according to actual needs certainly for equivalent branch mailbox Row determines) it is several branch mailbox value intervals, then, choosing the most section of value is the value area that can most represent this feature variable Between, in turn, it can be counted to obtain the characteristic value of this feature variable according to the value in the value interval.
In practical applications, for signature contributions value VijCalculation, it is however generally that, it can use the following two kinds side Formula obtains:
The first, determines the absolute of the difference of the scoring of the data sample to be assessed and the scoring of the analog data sample Value;The absolute value of the difference is determined to the signature contributions value of the characteristic variable, Vij=| Si-Sij|, the V that this mode obtainsij It is properly termed as absolute feature contribution margin.
Second, the quotient of the absolute value of the difference and the scoring of the data sample to be assessed is determined as the feature and is become The signature contributions value of amount, i.e. Vij=| Si-Sij|/Si, the signature contributions value obtained under this mode is properly termed as relative characteristic value.
Further, it is also possible to absolute feature contribution margin carry out square, evolution, multiplied by some zoom factor or normalization etc. Etc. modes be further processed, these can be set according to actual needs, not constitute the restriction to this programme.
In practical applications, model user may be not intended to see and all characteristic variables in an abnormal data are commented Estimate situation, only hopes to know which characteristic variable causes data exception, thus, in the S107, root According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is assessed, comprising:
The characteristic variable in data sample to be assessed is ranked up according to the size of each signature contributions value, generates sequence knot Fruit;Since ranking results most before take the characteristic variable of specified quantity, determining it as influences the data sample to be assessed Scoring key characteristic variables.
Specifically, be to signature contributions value by sorting from large to small, n before determining (n can oneself as needed freely Setting) a characteristic variable is to influence maximum key characteristic variables to data sample to be assessed scoring.
Further, in practical applications, data model is when being scored or being classified for data sample to be assessed, Corresponding value can also be obtained according to the characteristic variable having determined, corresponding key feature information be generated, with data model Appraisal result, export together, mode is as follows: for any key characteristic variables, obtaining in data sample to be assessed it Corresponding value;The key feature information comprising whole key characteristic variables and corresponding value is generated, so that user is according to described Key feature information carries out business processing.
As shown in figure 3, Fig. 3 is the schematic diagram of output key feature information provided by this specification embodiment.To Mr. Yu A data T to be assessediFor, data model is determined as abnormal data for it, and it includes have from C1,To C10Ten characteristic variables, Through the above scheme, it is determined that for TiFor, maximum three characteristic variables of signature contributions value are respectively C1、C2And C3, take Value is respectively a, b and c.To export as shown in Figure 3 while data model output test result is "abnormal" Key feature information infocode (Ti)={ C1=a;C2=b;C3=c }.The user of model it is known that be because this three The value of a key characteristic variables causes data model and is specifically classified as exception to the transaction, can be more clearly understood Do operational decision making.Should during, for generally speaking, as shown in figure 4, Fig. 4 is defeated provided by this specification embodiment The schematic block diagram of key feature information out, as indicated at 4, whole process include data input, data reconstruction, calculate VijAnd output Tetra- parts infocode.
It should be noted that above scheme is illustrated when illustrating generally be directed to abnormal data, but in practical application In, it can be used for assessing the feature of arbitrary data using the above scheme.It is also unlimited for algorithm used by data model In Isolation Forest algorithm, algorithm used by only needing for the detection of data be based on the value to data characteristics into It is carried out on the basis of row Quantitative marking.
In addition, above scheme, which in constructing analog data sample, defines, only changes a characteristic variable, other values are not Become, but be also possible to change the respective value of the combination including multiple characteristic variables, and keeps other values constant simultaneously.From And obtained signature contributions value can be used for measuring the influence that the combination of this feature variable scores for data.In this mode Under, the above-mentioned combination including multiple characteristic variables can be considered as a compound characteristics variable.
Based on same thinking, the present invention also provides a kind of data characteristicses to assess device, as shown in figure 5, Fig. 5 is this explanation The structural schematic diagram of the assessment device of data characteristics provided by book embodiment, comprising:
Grading module 501 obtains scoring of the data model for data sample to be assessed, wherein the data to be assessed Sample includes multiple characteristic variables and corresponding value;
Generation module 503, it is for any feature variable, characteristic variable described in the data sample to be assessed is corresponding Value replaces with the characteristic value for the characteristic variable being previously obtained, and generates another analog data sample;
Computing module 505 obtains scoring of the data model for the analog data sample, according to the data to be assessed The scoring of sample and the scoring of the analog data sample, calculate the signature contributions of characteristic variable described in the data sample Value;
Evaluation module 507, according to the size of the signature contributions value of characteristic variable, to the feature of the data sample to be assessed Variable is assessed.
Further, shown device further includes characteristic value acquisition module 509, obtains the entirety training sample of the data model This;Determine all training sample corresponding values under the characteristic variable;According to all training samples in institute It states corresponding value under characteristic variable and calculates the characteristic value for generating the characteristic variable.
Further, the characteristic value acquisition module 509, it is each under the characteristic variable according to all training samples Self-corresponding value generates statistical value;The statistical value is determined as to the characteristic value of the characteristic variable, wherein the statistical value packet Include at least one of median, mode or average.
Further, the characteristic value acquisition module 509, it is each under the characteristic variable according to all training samples Self-corresponding value selects part training sample from all training samples;According to the part training sample in the spy It levies corresponding value under variable and generates statistical value.
Further, it is each under the characteristic variable to obtain all training samples for the characteristic value acquisition module 509 The minimum value and maximum value of self-corresponding value, determine value interval;According to fixed value length, the value interval is carried out Equivalent branch mailbox generates several branch mailbox value intervals;Determine the value quantity that each branch mailbox value interval is included;Most by value quantity Training sample corresponding to big branch mailbox value interval is determined as the part training sample.
Further, the computing module 505, determine the data sample to be assessed scoring and the analogue data sample The absolute value of the difference of this scoring;The absolute value of the difference is determined to the signature contributions value of the characteristic variable, alternatively, will be described The quotient of the scoring of absolute value of the difference and the data sample to be assessed is determined as the signature contributions value of the characteristic variable.
Further, the evaluation module 507, according to the size of each signature contributions value to the spy in data sample to be assessed Sign variable is ranked up, and generates ranking results;Since ranking results most before take the characteristic variable of specified quantity, determined For the key characteristic variables of the scoring of the influence data sample to be assessed.
Further, further include information generating module 511, for any key characteristic variables, obtain in data to be assessed Value corresponding to its in sample;The key feature information comprising whole key characteristic variables and corresponding value is generated, so as to user Business processing is carried out according to the key feature information.
Corresponding, the embodiment of the present application also provides a kind of data characteristics assessment equipment, comprising:
Memory is stored with data characteristics appraisal procedure;
Processor calls the data characteristics appraisal procedure in memory, and executes:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple Characteristic variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced in advance The characteristic value of obtained characteristic variable generates another analog data sample;
Obtain scoring of the data model for the analog data sample, according to the scoring of the data sample to be assessed and The scoring of the analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is commented Estimate.
Based on same invention thinking, the embodiment of the present application also provides a kind of corresponding non-volatile computer storage Jie Matter is stored with computer executable instructions, the computer executable instructions setting are as follows:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple Characteristic variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced in advance The characteristic value of obtained characteristic variable generates another analog data sample;
Obtain scoring of the data model for the analog data sample, according to the scoring of the data sample to be assessed and The scoring of the analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is commented Estimate.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device, For equipment and medium class embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place Illustrate referring to the part of embodiment of the method, just no longer repeats one by one here.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement or step recorded in detail in the claims or module can be according to different from embodiments Sequence executes and still may be implemented desired result.In addition, process depicted in the drawing is not necessarily required and is shown Particular order or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing It is also possible or may be advantageous.
In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, The hardware circuit for realizing the logical method process can be readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller Device: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320 are deposited Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc. Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions For either the software module of implementation method can be the structure in hardware component again.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when the embodiment of specification.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), the data letter number and carrier wave of such as modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.
It will be understood by those skilled in the art that embodiment one or more in this specification can provide for method, system or Computer program product.Therefore, complete hardware embodiment, complete software embodiment or combination can be used in the embodiment of this specification Form in terms of software and hardware.Moreover, it wherein includes computer that the embodiment of this specification, which can be used in one or more, The computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of usable program code The form of the computer program product of upper implementation.
The embodiment of this specification can retouch in the general context of computer-executable instructions executed by a computer It states, such as program module.Generally, program module include execute particular transaction or realize particular abstract data type routine, Programs, objects, component, data structure etc..The embodiment that this specification can also be practiced in a distributed computing environment, at this In a little distributed computing environment, by executing affairs by the connected remote processing devices of communication network.It is counted in distribution It calculates in environment, program module can be located in the local and remote computer storage media including storage equipment.

Claims (17)

1. a kind of data characteristics appraisal procedure, comprising:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple features Variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced with and is previously obtained Characteristic variable characteristic value, generate another analog data sample;
Scoring of the data model for the analog data sample is obtained, according to the scoring of the data sample to be assessed and described The scoring of analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is assessed.
2. the method as described in claim 1, the characteristic value of the characteristic variable are previously obtained by such as under type:
Obtain all training samples of the data model;
Determine all training sample corresponding values under the characteristic variable;
According to all training samples, corresponding value calculates the spy for generating the characteristic variable under the characteristic variable Value indicative.
3. method according to claim 2, according to all training sample corresponding values under the characteristic variable Calculate the characteristic value for generating the characteristic variable, comprising:
Statistical value is generated according to all training samples corresponding value under the characteristic variable;
The statistical value is determined as to the characteristic value of the characteristic variable, wherein the statistical value include median, mode or At least one of average.
4. method as claimed in claim 3, according to all training sample corresponding values under the characteristic variable Generate statistical value, comprising:
According to all training sample corresponding values under the characteristic variable, selected from all training samples Part training sample;
According to the part training sample, corresponding value generates statistical value under the characteristic variable.
5. method as claimed in claim 4, according to all training sample corresponding values under the characteristic variable, Part training sample is selected from all training samples, comprising:
The minimum value and maximum value for obtaining all training sample corresponding values under the characteristic variable, determine value Section;
According to fixed value length, equivalent branch mailbox is carried out to the value interval, generates several branch mailbox value intervals;
Determine the value quantity that each branch mailbox value interval is included;
Training sample corresponding to the maximum branch mailbox value interval of value quantity is determined as the part training sample.
6. the method as described in claim 1, according to the scoring of the data sample to be assessed and the analog data sample Scoring, calculates the signature contributions value of characteristic variable described in the data sample, comprising:
Determine the absolute value of the difference of the scoring of the data sample to be assessed and the scoring of the analog data sample;
The absolute value of the difference is determined to the signature contributions value of the characteristic variable, alternatively, by the absolute value of the difference with it is described The quotient of the scoring of data sample to be assessed is determined as the signature contributions value of the characteristic variable.
7. the method as described in claim 1, according to the size of the signature contributions value of characteristic variable, to the data sample to be assessed This characteristic variable is assessed, comprising:
The characteristic variable in data sample to be assessed is ranked up according to the size of each signature contributions value, generates ranking results;
Since ranking results most before take the characteristic variable of specified quantity, determining it as influences the data sample to be assessed Scoring key characteristic variables.
8. the method for claim 7, further includes:
For any key characteristic variables, value corresponding to it is obtained in data sample to be assessed;
The key feature information comprising whole key characteristic variables and corresponding value is generated, so that user is according to the key feature Information carries out business processing.
9. a kind of data characteristics assesses device, comprising:
Grading module obtains scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes Multiple characteristic variables and corresponding value;
Generation module replaces the corresponding value of characteristic variable described in the data sample to be assessed for any feature variable Characteristic value for the characteristic variable being previously obtained, generates another analog data sample;
Computing module obtains scoring of the data model for the analog data sample, according to the data sample to be assessed The scoring of scoring and the analog data sample, calculates the signature contributions value of characteristic variable described in the data sample;
Evaluation module, according to the size of the signature contributions value of characteristic variable, to the characteristic variable of the data sample to be assessed into Row assessment.
10. device as claimed in claim 9 further includes characteristic value acquisition module, the entirety training of the data model is obtained Sample;Determine all training sample corresponding values under the characteristic variable;Existed according to all training samples Corresponding value calculates the characteristic value for generating the characteristic variable under the characteristic variable.
11. device as claimed in claim 10, the characteristic value acquisition module, according to all training samples in the spy It levies corresponding value under variable and generates statistical value;The statistical value is determined as to the characteristic value of the characteristic variable, wherein institute Stating statistical value includes at least one of median, mode or average.
12. device as claimed in claim 11, the characteristic value acquisition module, according to all training samples in the spy Corresponding value under variable is levied, selects part training sample from all training samples;According to part training sample This corresponding value under the characteristic variable generates statistical value.
13. device as claimed in claim 12, the characteristic value acquisition module obtain all training samples in the spy The minimum value and maximum value for levying corresponding value under variable, determine value interval;According to fixed value length, taken to described It is worth section and carries out equivalent branch mailbox, generates several branch mailbox value intervals;Determine the value quantity that each branch mailbox value interval is included;It will Training sample corresponding to the maximum branch mailbox value interval of value quantity is determined as the part training sample.
14. the device as described in claim 9, the computing module, the scoring of the data sample to be assessed and described is determined The absolute value of the difference of the scoring of analog data sample;The absolute value of the difference is determined to the signature contributions value of the characteristic variable, Alternatively, the quotient of the absolute value of the difference and the scoring of the data sample to be assessed to be determined as to the feature tribute of the characteristic variable Offer value.
15. device as claimed in claim 9, the evaluation module, according to the size of each signature contributions value to data to be assessed Characteristic variable in sample is ranked up, and generates ranking results;Since ranking results most before take the feature of specified quantity to become Amount, determines it as the key characteristic variables for influencing the scoring of the data sample to be assessed.
16. device as claimed in claim 15 further includes information generating module, for any key characteristic variables, obtain Value corresponding to its in data sample to be assessed;Generate the key feature letter comprising whole key characteristic variables and corresponding value Breath, so that user carries out business processing according to the key feature information.
17. a kind of data characteristics assessment equipment, comprising:
Memory is stored with data characteristics appraisal procedure;
Processor calls the data characteristics appraisal procedure in memory, and executes:
Obtain scoring of the data model for data sample to be assessed, wherein the data sample to be assessed includes multiple features Variable and corresponding value;
For any feature variable, the corresponding value of characteristic variable described in the data sample to be assessed is replaced with and is previously obtained Characteristic variable characteristic value, generate another analog data sample;
Scoring of the data model for the analog data sample is obtained, according to the scoring of the data sample to be assessed and described The scoring of analog data sample calculates the signature contributions value of characteristic variable described in the data sample;
According to the size of the signature contributions value of characteristic variable, the characteristic variable of the data sample to be assessed is assessed.
CN201810435231.3A 2018-05-09 2018-05-09 Data characteristic evaluation method, device and equipment Active CN110472802B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810435231.3A CN110472802B (en) 2018-05-09 2018-05-09 Data characteristic evaluation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810435231.3A CN110472802B (en) 2018-05-09 2018-05-09 Data characteristic evaluation method, device and equipment

Publications (2)

Publication Number Publication Date
CN110472802A true CN110472802A (en) 2019-11-19
CN110472802B CN110472802B (en) 2023-12-01

Family

ID=68503326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810435231.3A Active CN110472802B (en) 2018-05-09 2018-05-09 Data characteristic evaluation method, device and equipment

Country Status (1)

Country Link
CN (1) CN110472802B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340102A (en) * 2020-02-24 2020-06-26 支付宝(杭州)信息技术有限公司 Method and apparatus for evaluating model interpretation tools
CN111815435A (en) * 2020-07-14 2020-10-23 深圳市卡牛科技有限公司 Visualization method, device, equipment and storage medium for group risk characteristics
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN113052325A (en) * 2021-03-25 2021-06-29 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for optimizing online model
CN113408582A (en) * 2021-05-17 2021-09-17 支付宝(杭州)信息技术有限公司 Training method and device of feature evaluation model
CN113570236A (en) * 2021-07-23 2021-10-29 中信银行股份有限公司 Scoring card reason code technology operation method and device, terminal equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633265A (en) * 2017-09-04 2018-01-26 深圳市华傲数据技术有限公司 For optimizing the data processing method and device of credit evaluation model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633265A (en) * 2017-09-04 2018-01-26 深圳市华傲数据技术有限公司 For optimizing the data processing method and device of credit evaluation model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陆辉;: "电子档案大数据的可视化组织和分析", 科技通报, no. 12, pages 183 - 186 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340102A (en) * 2020-02-24 2020-06-26 支付宝(杭州)信息技术有限公司 Method and apparatus for evaluating model interpretation tools
CN111815435A (en) * 2020-07-14 2020-10-23 深圳市卡牛科技有限公司 Visualization method, device, equipment and storage medium for group risk characteristics
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN113052325A (en) * 2021-03-25 2021-06-29 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for optimizing online model
CN113408582A (en) * 2021-05-17 2021-09-17 支付宝(杭州)信息技术有限公司 Training method and device of feature evaluation model
CN113408582B (en) * 2021-05-17 2023-08-29 支付宝(杭州)信息技术有限公司 Training method and device for feature evaluation model
CN113570236A (en) * 2021-07-23 2021-10-29 中信银行股份有限公司 Scoring card reason code technology operation method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN110472802B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
CN110472802A (en) A kind of data characteristics appraisal procedure, device and equipment
CN109242135B (en) Model operation method, device and business server
CN108734479A (en) Data processing method, device, equipment and the server of Insurance Fraud identification
CN108334647A (en) Data processing method, device, equipment and the server of Insurance Fraud identification
CN108596410B (en) Automatic wind control event processing method and device
CN108921569A (en) A kind of method and device of determining customer complaint type
CN109242519B (en) Abnormal behavior identification method, device and equipment
CN108960719A (en) Selection method and apparatus and computer readable storage medium
CN108921566A (en) A kind of wash sale recognition methods and device based on graph structure model
US11699106B2 (en) Categorical feature enhancement mechanism for gradient boosting decision tree
CN110119860A (en) A kind of rubbish account detection method, device and equipment
CN110046633A (en) A kind of data quality checking method and device
CN111494964B (en) Virtual article recommendation method, model training method, device and storage medium
CN109034534A (en) A kind of model score means of interpretation, device and equipment
CN108346107A (en) A kind of social content Risk Identification Method, device and equipment
CN109711424A (en) A kind of rule of conduct acquisition methods, device and equipment based on decision tree
US20190251609A1 (en) Commodity demand prediction system, commodity demand prediction method, and commodity demand prediction program
CN109614414A (en) A kind of determination method and device of user information
CN110378400A (en) A kind of model training method and device for image recognition
CN110110035A (en) Data processing method and device and computer readable storage medium
CN114490786B (en) Data sorting method and device
CN112036737A (en) Method and device for calculating regional electric quantity deviation
CN110033092A (en) Data label generation, model training, event recognition method and device
CN109492401A (en) A kind of content vector risk checking method, device, equipment and medium
CN110033117A (en) Model calibration method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201029

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201029

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant