CN110472802B

CN110472802B - Data characteristic evaluation method, device and equipment

Info

Publication number: CN110472802B
Application number: CN201810435231.3A
Authority: CN
Inventors: 刘腾飞
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2023-12-01
Anticipated expiration: 2038-05-09
Also published as: CN110472802A

Abstract

The embodiment of the specification discloses a data characteristic evaluation method, a device and equipment. The method comprises the steps of carrying out data reconstruction on a data sample to be evaluated by using the characteristic value of a characteristic variable to generate a simulated data sample, and grading the simulated data sample by using a data model, so that it can be determined how much the value of a certain characteristic variable will change on the grading of the data sample, and then the influence of the value of the characteristic variable on the grading of the data sample to be evaluated can be known according to the grading change, namely, the influence of each characteristic variable on the grading can be reflected by a quantifiable characteristic contribution value, so that a user of the data model can carry out subsequent business decision and processing according to the characteristic contribution value.

Description

Data characteristic evaluation method, device and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for evaluating data features.

Background

In the application of artificial intelligence techniques, machine learning models are widely used for data classification and anomaly detection.

In this manner, the data sample typically contains a plurality of feature variables, and the trained data model scores the data sample based on the plurality of feature variables. In this process, the data model often resembles a "black box" for the user of the data model, and although detection results, and aid decisions, can be given for different data samples, a conclusion is given as to why the data model is likely to be functional for each feature. Often it is unclear.

Based on this, a more efficient data feature evaluation scheme is needed.

Disclosure of Invention

The embodiment of the specification provides a data characteristic evaluation method, a device and equipment, which are used for solving the following problems: to provide a more efficient data feature evaluation scheme.

Based on this, the embodiment of the present specification provides a data feature evaluation method, including:

obtaining a score of a data model for a data sample to be evaluated, wherein the data sample to be evaluated comprises a plurality of characteristic variables and corresponding values;

for any characteristic variable, replacing a value corresponding to the characteristic variable in the data sample to be evaluated with a characteristic value of the characteristic variable obtained in advance to generate another analog data sample;

obtaining the scores of the data model on the simulation data samples, and calculating the characteristic contribution values of the characteristic variables in the data samples according to the scores of the data samples to be evaluated and the scores of the simulation data samples;

and evaluating the characteristic variable of the data sample to be evaluated according to the characteristic contribution value of the characteristic variable.

Meanwhile, an embodiment of the present disclosure further provides a data feature evaluation apparatus, including:

The scoring module is used for obtaining the score of the data model on the data sample to be evaluated, wherein the data sample to be evaluated comprises a plurality of characteristic variables and corresponding values;

the generation module is used for replacing a value corresponding to the characteristic variable in the data sample to be evaluated with a characteristic value of the characteristic variable obtained in advance for any characteristic variable to generate another analog data sample;

the calculation module is used for obtaining the scores of the data models for the simulation data samples and calculating the characteristic contribution values of the characteristic variables in the data samples according to the scores of the data samples to be evaluated and the scores of the simulation data samples;

and the evaluation module evaluates the characteristic variables of the data sample to be evaluated according to the characteristic contribution values of the characteristic variables.

Correspondingly, the embodiment of the specification also provides a data characteristic evaluation device, which comprises:

a memory storing a data feature evaluation program;

and a processor which calls a data characteristic evaluation program in the memory and executes:

Correspondingly, embodiments of the present specification also provide a non-volatile computer storage medium storing computer executable instructions arranged to:

The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect:

the method comprises the steps of carrying out data reconstruction on a data sample to be evaluated by using the characteristic value of a characteristic variable to generate a simulated data sample, and grading the simulated data sample by using a data model, so that the change of the value of a certain characteristic variable can be determined how much the grading of the data sample is caused, the influence of the value of the characteristic variable on the grading of the data sample to be evaluated can be known according to the grading change, and the influence of each characteristic variable on the grading can be reflected by a quantifiable characteristic contribution value. Thus, a user of the data model may evaluate the contribution size of each feature and generate key feature information for reference for why each data would have such a score. In this way, the model user does not need to have deep knowledge about the business problem, the calculation process is irrelevant to the use scene of the data model, and the calculation process can be completely automated without human intervention, so that the efficiency is greatly improved.

Drawings

FIG. 1 is a schematic flow chart of a data feature evaluation scheme according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a constructed analog data sample according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of output key feature information according to an embodiment of the present disclosure;

FIG. 4 is a schematic block diagram of output key feature information provided by an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a data feature evaluation device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

As described above, machine learning has been widely used in various business scenarios, for example, machine learning can be used to help determine whether a transaction is abnormal, whether a certain operation of a user is risky, whether a user applying a loan will pay, and so on. The method is generally to train a data model based on training samples, and then use the data model to detect the abnormality of the data. The training samples can be marked or unmarked.

In this process, machine learning models often resemble a "black box" for many users of data model results, and while detection results, aid in decision making, it is often unclear why the model will give such conclusions. This unexplainability reduces the friendliness of the data model and also reduces the ease of use of the system. Therefore, in order to explain the model result to the business personnel, some simple explanation will be given on the basis of the model result, that is, the key feature information of the model result is used for explaining what factors cause the data model to be scored or classified in this way, so that the business personnel can be assisted in making better business decisions.

At present, there are mainly the following ways that key feature information can be generated:

1, generating key feature information based on training labels: training a tag refers to the goal that a machine model is to predict, for example, determining whether a transaction is a spurious transaction, which is whether the spurious (only/not valued) or the tag of the model. When training labels exist in the training data, the distinguishing capability of different characteristic variables on the target data can be calculated according to the training labels, or a model with relatively strong interpretability can be selected. This approach relies heavily on the presence of tag data, which cannot be used if not in a real scene.

And 2, manually formulating key characteristic information based on the business rule. For example, in a loan application scenario, subjective experiences of business personnel, like whether users work, how much users income year by year, whether users have information variables such as properties, etc., often have important influence on the final model result. Based on the service knowledge, some service rules can be set artificially, only the important feature information is concerned, and the values of the important feature variables are output as key feature information. In this way, the user who needs the model has stronger business knowledge, and meanwhile, the method cannot be automated, that is, the rule needs to be re-formulated when the same model is changed into an application scene, so that the efficiency is low.

Based on the above, the embodiment of the present disclosure provides a data feature evaluation scheme, by slightly modifying an original data sample, generating a simulated data sample, and implementing quantitative measurement of the influence degree of a certain feature variable on a scoring result of a data model, so that for any data sample, a feature variable with a large influence degree on the data sample can be accurately evaluated, without knowing an application scenario, and can be automatically implemented, and more effectively.

As shown in fig. 1, fig. 1 is a flow chart of a data feature evaluation scheme provided in an embodiment of the present disclosure, including the following steps:

s101, obtaining scores of a data model on a data sample to be evaluated, wherein the data sample to be evaluated comprises a plurality of characteristic variables and corresponding values.

In a machine learning model, a data sample is typically a vector containing a plurality of feature variables. Model training is carried out through a certain number of training samples by adopting a preset algorithm, and a data model with accuracy or precision meeting expectations is obtained. Then, the trained data model can be used for detecting unknown data samples (namely, data samples to be evaluated). The method generally includes obtaining a score value according to the value of each feature vector (the score value can be selected according to actual requirements to determine whether normalization is performed or not), and then determining according to the score value.

For example, an isolated Forest (Isolation Forest) algorithm is used to evaluate transaction data to determine if it is anomalous. Generally, the transaction data may be regarded as a vector containing m-dimensional feature variables, wherein each feature variable may be information about the amount of the transaction, the frequency of the transaction, the number of times of the transaction by the buyer or the time interval of the immediately preceding transaction, i.e. each feature vector has a corresponding value. In other words, for the ith transaction data T _i There are m feature variables to describe this transaction information: c (C) ₁ ，C ₂ ，…,C _m . Namely T _i ＝{C ₁ ，C ₂ ，…,C _m }. Given a data model trained by using the Isolation Forest algorithm, input T _i ＝{C ₁ ，C ₂ ，…,C _m The data model assigns a score to the transaction, which is used to indicate the degree of abnormality of the transaction, and the score may be a normalized score between 0 and 1, with higher scores being more abnormal. If the rule is not evaluated, it may not be known asWhat data model will score the corresponding score.

S103, replacing the value corresponding to the characteristic variable in the data sample to be evaluated with the characteristic value of the characteristic variable obtained in advance for any characteristic variable, and generating another analog data sample.

In an unsupervised anomaly detection algorithm, in general, it has the following law: a) The abnormal data is small in all data; b) The abnormal data is different from most other data.

Therefore, in order to evaluate the influence of each characteristic variable in abnormal data on the scoring result, the basic idea is to keep the value of other characteristic variables unchanged, and replace the value of the same characteristic variable in the data sample to be evaluated with the characteristic value of the characteristic variable. The characteristic value of the characteristic variable is generally obtained in advance, and is the most common or representative value in all data sample values, and is generally representative for the whole data. The method can be obtained empirically or by counting training samples (i.e. data samples of a training data model).

It will be readily appreciated that if the feature variables of a data sample are representative feature values, the data is unlikely to be outlier data, and the data model should score it in the normal data range (e.g., in the Isolation Forest algorithm, the score value of the data sample should approach zero).

Thus, we can use the characteristic value to replace the corresponding value in the data sample to be evaluated to obtain a simulated data sample, e.g. for a given transaction data T _i ＝{C ₁ ，C ₂ ，…,C _m For T }, for _i The j-th characteristic variable C in (a) _j In other words, if the characteristic value is C _j 'Cj' can be used instead of C _j Is used for the original value of (a), _， generating analog data samples T _ij ＝{C ₁ ,C ₂ ,…,C _j ’，…,C _m T herein }, T _ij Except that the value of the j-th feature is C _j ' other characteristic information in addition to the original transactionThe data remains unchanged. Based on T _i There are m feature variables so we can generate m simulated data samples. In each analog data sample, the original transaction data T is followed _i In contrast, the values of one characteristic variable are different, and other values are kept unchanged.

As shown in fig. 2, fig. 2 is a schematic diagram of a structure of a simulation data sample according to an embodiment of the present disclosure. The original sample of transaction data to be evaluated contains four characteristic variables: the sex of the buyer, the transaction amount, the number of transactions of the buyer on the same day, the interval between the buyer and the last transaction event, the respective characteristic values of which are respectively 0 (representing female), 75 (representing average transaction amount), 1.2 (representing the number of transactions per buyer on the same day), 22 (representing average transaction interval per buyer), are the data T to be evaluated _i In terms of T _i = {1, 1000, 20,2}, whereby four corresponding analog data samples, T, can be constructed _i1 ＝{0，1000，20，2}，T _i2 ＝{1，75，20，2}，T _i3 ＝{1，1000，1.2，2}，T _i4 ＝{1，1000，20，22}。

S105, obtaining the scores of the data models on the simulation data samples, and calculating the characteristic contribution values of the characteristic variables in the data samples according to the scores of the data samples to be evaluated and the scores of the simulation data samples.

After obtaining m simulated data samples, the m simulated data samples can be respectively scored by using the same data model. Continuing the previous example, they will model the data for the original transaction data T _i The score of (2) is denoted S _i To model T _i1 ，T _i2 ，…,T _im The scores of (2) are respectively denoted as S _i1 ，S _i2 ，…,S _im (i.e., T _ij Scoring of S _ij ) After the total score is obtained, we can calculate the contribution of each characteristic variable to the original data score, and record the ith transaction data T _i The characteristic contribution value of the j-th characteristic variable of (2) is V _ij ，V _ij For measuring T _i The value of the j-th characteristic variable in (a) is replaced by the characteristic variable's characteristicAfter the symptom value, the data model's scoring of the simulated data samples differs from the scoring of the raw data. V (V) _ij The calculation method can be adjusted according to different actual needs. For example, absolute contribution value V _ij ＝|S _i -S _ij |。

By adopting the mode, the characteristic contribution values V of the quantized m characteristic variables can be obtained by calculating the characteristic variables _ij 。

And S107, evaluating the characteristic variable of the data sample to be evaluated according to the characteristic contribution value of the characteristic variable.

Based on the above, it can be known that for an m-dimensional data sample, the feature contribution value V is caused by the fact that the values of the feature variables deviate from the feature values, which causes the data model to score the data model abnormally (the data model is expressed as a higher score in the Isolation Forest) _ij The larger the data model, the larger the influence of the j variable in the data sample to be evaluated on the data score is reflected. For abnormal data, the abnormal data is considered to be abnormal because the values of some characteristic variables deviate from the characteristic values too far, so that under the scheme, some characteristic variables with abnormal values can be effectively found, and the characteristic variables in the data can be effectively evaluated for any unknown data, particularly the data to be evaluated, which is confirmed as abnormal data by a data model.

In the scheme, the characteristic value of the characteristic variable is used for carrying out data reconstruction on the data sample to be evaluated to generate the simulated data sample, and then the data model is used for scoring the simulated data sample, so that the change of the value of the characteristic variable can be determined how much the score of the data sample is changed, and the influence of the value of the characteristic variable on the score of the data sample to be evaluated can be known according to the change of the score, namely, the influence of each characteristic variable on the score can be reflected through the quantifiable characteristic contribution value. Thus, a user of the data model can evaluate the contribution size of each feature. In this way, the model user does not need to have deep knowledge about the business problem, the calculation process is irrelevant to the use scene of the data model, and the calculation process can be completely automated without human intervention, so that the efficiency is greatly improved.

In practical applications, the feature values of the feature variables in S103 may be generally determined in advance by the user of the data model according to experience, or may be obtained by statistics according to training samples of the data model, which is specifically as follows:

acquiring a whole training sample of the data model; determining respective corresponding values of the whole training samples under the characteristic variables; and calculating and generating the characteristic value of the characteristic variable according to the respective corresponding value of the whole training samples under the characteristic variable.

I.e. for each feature value, it should be obtained by statistics from the values of the feature variable in the training sample, since the data model for scoring is obtained based on the training sample, the values of the feature variable in the training sample have a large influence on the scoring of the data model.

As a specific embodiment, the statistics may be obtained as follows: generating statistical values according to the values corresponding to the training samples under the characteristic variables; and determining the statistical value as a characteristic value of the characteristic variable, wherein the statistical value comprises at least one of a median, a mode or an average. The median is the value of the midpoint of the value sequence, the mode is the value of the largest occurrence number, and it is easy to understand that the median, the mode or the average number has a certain representativeness to the value of one data sample, and the specific value can be determined according to the actual requirement.

For example, under a discrete feature variable (e.g., gender, academy, etc.), the mode may generally be selected as the feature value. For example, in transaction data T _i In j-th feature variable C _j Wherein the value 1 is one of 0,1 or 2, and the value is the most value in the sample, the characteristic value C _j ' is 1.

Further, a part of the representative training samples can also be used to represent the whole training samples, and then the statistics in the part of the training samples are used to characterize the characteristic values, and for the selected mode of the part of the training samples, the following mode can be used:

selecting partial training samples from the whole training samples according to the respective corresponding values of the whole training samples under the characteristic variables; and generating a statistical value according to the respective corresponding value of the part of training samples under the characteristic variable.

For example, for a certain characteristic variable C _j Determining that the value interval of the training samples in the whole training samples is [0,100]It can be empirically set to take the middle 20% interval, namely the value interval [40,60 ]]As a representative interval, whatever is C _j Training samples whose values fall within the interval are then determined to be representative of the portion of training samples. Therefore, the characteristic value is determined according to the statistics value of the part of training samples, and the part of training samples are used for replacing all training samples, so that the calculated amount in the process of determining the characteristic value can be reduced, and the calculation efficiency is improved.

Furthermore, for the above scheme, selecting a part of training samples may also be performed in a box-dividing manner, which specifically includes the following steps:

acquiring the minimum value and the maximum value of the values corresponding to the whole training samples under the characteristic variables, and determining a value interval; performing equivalent box division on the value interval according to the fixed value length to generate a plurality of box division value intervals; determining the value number contained in each box-dividing value interval; and determining the training samples corresponding to the box-dividing value interval with the largest value number as the partial training samples.

That is, a characteristic variable (for example, a transaction amount) whose value is continuous may be subjected to discrete processing. Firstly, determining a value interval of a training sample, then, equivalently dividing the training sample into a plurality of divided value intervals (the length of the divided boxes can be determined according to actual needs), then, selecting the interval with the largest value as the value interval which can most represent the characteristic variable, and further, carrying out statistics according to the value in the value interval to obtain the characteristic value of the characteristic variable.

In practical application, for characteristic contribution value V _ij In general, the calculation method of (2) can be obtained in two ways:

First, determining an absolute value of a difference between a score of the data sample to be evaluated and a score of the analog data sample; determining the absolute value of the difference to a characteristic contribution value of the characteristic variable, V _ij ＝|S _i -S _ij V obtained in this way _ij May be referred to as an absolute characteristic contribution value.

Second, determining the quotient of the absolute value of the difference and the score of the data sample to be evaluated as the characteristic contribution value of the characteristic variable, i.e. V _ij ＝|S _i -S _ij |/S _i The feature contribution value obtained in this way may be referred to as a relative feature value.

In addition, the absolute characteristic contribution value can be further processed in modes of squaring, multiplying by a scaling factor or normalizing and the like, and the absolute characteristic contribution value can be set according to actual needs and does not limit the scheme.

In practical applications, the model user may not want to see the evaluation situation of all feature variables in one abnormal data, and only want to know which feature variables cause the data abnormality, so as to evaluate the feature variables of the data sample to be evaluated according to the feature contribution values of the feature variables in S107, where the evaluation includes:

sorting the feature variables in the data sample to be evaluated according to the size of each feature contribution value to generate a sorting result; and taking a specified number of characteristic variables from the forefront of the sorting results, and determining the characteristic variables as key characteristic variables affecting the grading of the data sample to be evaluated.

Specifically, the characteristic contribution values are ranked from large to small, and the first n (n can be freely set according to the needs) characteristic variables are determined to be the key characteristic variables with the greatest influence on the grading of the data sample to be evaluated.

Further, in practical application, when the data model scores or classifies the data sample to be evaluated, the data model can also acquire corresponding values according to the determined feature variables, generate corresponding key feature information, and output the key feature information together with the scoring result of the data model, wherein the method is as follows: for any key characteristic variable, acquiring a value corresponding to the key characteristic variable in a data sample to be evaluated; generating key feature information containing all key feature variables and corresponding values so that a user can conduct business processing according to the key feature information.

As shown in fig. 3, fig. 3 is a schematic diagram of output key feature information provided in the embodiment of the present disclosure. For a certain data T to be evaluated _i The data model is determined to be abnormal data, and includes the data model C _1, To C ₁₀ Ten characteristic variables, through the scheme, the T is determined _i In other words, the three feature variables with the largest feature contribution values are C ₁ 、C ₂ And C ₃ The values are a, b and c respectively. So that the key feature information info code (T) shown in FIG. 3 can be output while the data model outputs the detection result of "anomaly _i )＝{C ₁ ＝a；C ₂ ＝b；C ₃ =c }. The user of the model can know that the data model classifies the specific transaction into the abnormal state due to the values of the three key characteristic variables, so that the business decision can be made more clearly. In the process, as shown in fig. 4, fig. 4 is a schematic block diagram of output key feature information provided by the embodiment of the present specification, and as shown in fig. 4, the whole process includes data input, data reconstruction and calculation of V _ij And outputting four parts of the info code.

It should be noted that, the above scheme is generally illustrated with respect to abnormal data, but in practical application, the above scheme may be used to evaluate characteristics of any data. The algorithm adopted by the data model is not limited to the Isolation Forest algorithm, and the adopted algorithm is only required to detect the data on the basis of carrying out quantization scoring on the values of the data characteristics.

In addition, the above scheme defines that only one characteristic variable is changed and other values are unchanged when the analog data sample is constructed, but the values of each combination of a plurality of characteristic variables can be changed while other values are kept unchanged. Thus, the resulting feature contribution value may be used to measure the impact of the combination of feature variables on the data score. In this way, the above-described combination including a plurality of feature variables may be regarded as one composite feature variable.

Based on the same concept, the present invention further provides a data feature evaluation device, as shown in fig. 5, where fig. 5 is a schematic structural diagram of the data feature evaluation device provided in the embodiment of the present specification, and the data feature evaluation device includes:

the scoring module 501 is used for obtaining the score of the data model to the data sample to be evaluated, wherein the data sample to be evaluated comprises a plurality of characteristic variables and corresponding values;

a generating module 503, configured to replace, for any feature variable, a value corresponding to the feature variable in the data sample to be evaluated with a feature value of a feature variable obtained in advance, to generate another analog data sample;

the calculation module 505 acquires the scores of the data models for the simulation data samples, and calculates the characteristic contribution values of the characteristic variables in the data samples according to the scores of the data samples to be evaluated and the scores of the simulation data samples;

and the evaluation module 507 evaluates the characteristic variables of the data sample to be evaluated according to the characteristic contribution values of the characteristic variables.

Further, the illustrated apparatus further includes a feature value obtaining module 509, configured to obtain a total training sample of the data model; determining respective corresponding values of the whole training samples under the characteristic variables; and calculating and generating the characteristic value of the characteristic variable according to the respective corresponding value of the whole training samples under the characteristic variable.

Further, the feature value obtaining module 509 generates a statistical value according to the values corresponding to the all training samples under the feature variables; and determining the statistical value as a characteristic value of the characteristic variable, wherein the statistical value comprises at least one of a median, a mode or an average.

Further, the feature value obtaining module 509 selects a part of training samples from the whole training samples according to the values corresponding to the whole training samples under the feature variables; and generating a statistical value according to the respective corresponding value of the part of training samples under the characteristic variable.

Further, the feature value obtaining module 509 obtains the minimum value and the maximum value of the values corresponding to the whole training samples under the feature variables, and determines a value interval; performing equivalent box division on the value interval according to the fixed value length to generate a plurality of box division value intervals; determining the value number contained in each box-dividing value interval; and determining the training samples corresponding to the box-dividing value interval with the largest value number as the partial training samples.

Further, the calculating module 505 determines an absolute value of a difference between the score of the data sample to be evaluated and the score of the simulated data sample; and determining the absolute value of the difference to be the characteristic contribution value of the characteristic variable, or determining the quotient of the absolute value of the difference and the score of the data sample to be evaluated to be the characteristic contribution value of the characteristic variable.

Further, the evaluation module 507 sorts the feature variables in the data sample to be evaluated according to the size of each feature contribution value, and generates a sorting result; and taking a specified number of characteristic variables from the forefront of the sorting results, and determining the characteristic variables as key characteristic variables affecting the grading of the data sample to be evaluated.

Further, the system further comprises an information generating module 511, which is used for acquiring a value corresponding to any key characteristic variable in the data sample to be evaluated; generating key feature information containing all key feature variables and corresponding values so that a user can conduct business processing according to the key feature information.

Correspondingly, the embodiment of the application also provides data characteristic evaluation equipment, which comprises the following steps:

a memory storing a data feature evaluation program;

Based on the same inventive concept, the embodiments of the present application further provide a corresponding nonvolatile computer storage medium, storing computer executable instructions, where the computer executable instructions are configured to:

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, device and medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and the relevant parts will be referred to in the description of the method embodiments, which is not repeated herein.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps or modules recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of the units may be implemented in the same one or more pieces of software and/or hardware when implementing the embodiments of the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. The computer-readable medium, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signal numbers and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be apparent to one of ordinary skill in the art that one or more embodiments in the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the present description may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular transactions or implement particular abstract data types. Embodiments of the present description may also be practiced in distributed computing environments where transactions are performed by remote processing devices that are connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Claims

1. A data feature evaluation method for transaction data analysis, comprising:

obtaining a score of a data model for a data sample to be evaluated, wherein the data sample to be evaluated comprises a plurality of characteristic variables and corresponding values; the data sample to be evaluated comprises transaction data, and the characteristic variable comprises at least one of transaction amount information, transaction frequency information, buyer and seller daily transaction number information and time interval information of a previous transaction; the scores of the data samples to be evaluated are used for representing the abnormal degree of the transaction corresponding to the transaction data;

For any characteristic variable, replacing a value corresponding to the characteristic variable in the data sample to be evaluated with a characteristic value of the characteristic variable obtained in advance to generate another analog data sample; the characteristic values of the characteristic variables are determined by statistical values generated by all training samples of the data model under the respective corresponding values of the characteristic variables, and the statistical values comprise at least one of median, mode or average;

obtaining the scores of the data model on the simulation data samples, and calculating the characteristic contribution values of the characteristic variables in the data samples according to the scores of the data samples to be evaluated and the scores of the simulation data samples; the scores of the simulation data samples are used for representing the abnormal degree of the transaction corresponding to the simulation data samples; the feature contribution value of the feature variable is used to represent the difference between the score of the data model for the simulated data sample and the score of the transaction data;

evaluating the characteristic variable of the data sample to be evaluated according to the characteristic contribution value of the characteristic variable;

the evaluating the characteristic variable of the data sample to be evaluated comprises the following steps:

And evaluating the influence degree of the characteristic variable of the data sample to be evaluated on the grading result of the data model.

2. The method of claim 1, wherein the characteristic values of the characteristic variables are obtained in advance by:

acquiring a whole training sample of the data model;

determining respective corresponding values of the whole training samples under the characteristic variables;

and calculating and generating the characteristic value of the characteristic variable according to the respective corresponding value of the whole training samples under the characteristic variable.

3. The method of claim 2, generating the feature value of the feature variable from the respective corresponding value calculations of the ensemble of training samples under the feature variable, comprising:

generating statistical values according to the values corresponding to the training samples under the characteristic variables;

and determining the statistical value as a characteristic value of the characteristic variable, wherein the statistical value comprises at least one of a median, a mode or an average.

4. A method according to claim 3, generating statistics from respective corresponding values of the ensemble of training samples under the characteristic variables, comprising:

selecting partial training samples from the whole training samples according to the respective corresponding values of the whole training samples under the characteristic variables;

And generating a statistical value according to the respective corresponding value of the part of training samples under the characteristic variable.

5. The method of claim 4, selecting a portion of the training samples from the ensemble of training samples based on their respective corresponding values under the characteristic variable, comprising:

acquiring the minimum value and the maximum value of the values corresponding to the whole training samples under the characteristic variables, and determining a value interval;

performing equivalent box division on the value interval according to the fixed value length to generate a plurality of box division value intervals;

determining the value number contained in each box-dividing value interval;

and determining the training samples corresponding to the box-dividing value interval with the largest value number as the partial training samples.

6. The method of claim 1, calculating a feature contribution value of the feature variable in the data sample based on the score of the data sample to be evaluated and the score of the simulated data sample, comprising:

determining an absolute value of a difference between the score of the data sample to be evaluated and the score of the simulated data sample;

and determining the absolute value of the difference to be the characteristic contribution value of the characteristic variable, or determining the quotient of the absolute value of the difference and the score of the data sample to be evaluated to be the characteristic contribution value of the characteristic variable.

7. The method of claim 1, wherein evaluating the feature variable of the data sample to be evaluated according to the magnitude of the feature contribution value of the feature variable, comprises:

sorting the feature variables in the data sample to be evaluated according to the size of each feature contribution value to generate a sorting result;

and taking a specified number of characteristic variables from the forefront of the sorting results, and determining the characteristic variables as key characteristic variables affecting the grading of the data sample to be evaluated.

8. The method of claim 7, further comprising:

for any key characteristic variable, acquiring a value corresponding to the key characteristic variable in a data sample to be evaluated;

generating key feature information containing all key feature variables and corresponding values so that a user can conduct business processing according to the key feature information.

9. A data characteristic evaluation device for transaction data analysis, comprising:

the scoring module is used for obtaining the score of the data model on the data sample to be evaluated, wherein the data sample to be evaluated comprises a plurality of characteristic variables and corresponding values; the data sample to be evaluated comprises transaction data, and the characteristic variable comprises at least one of transaction amount information, transaction frequency information, buyer and seller daily transaction number information and time interval information of a previous transaction; the scores of the data samples to be evaluated are used for representing the abnormal degree of the transaction corresponding to the transaction data;

The generation module is used for replacing a value corresponding to the characteristic variable in the data sample to be evaluated with a characteristic value of the characteristic variable obtained in advance for any characteristic variable to generate another analog data sample; the characteristic values of the characteristic variables are determined by statistical values generated by all training samples of the data model under the respective corresponding values of the characteristic variables, and the statistical values comprise at least one of median, mode or average;

the calculation module is used for obtaining the scores of the data models for the simulation data samples and calculating the characteristic contribution values of the characteristic variables in the data samples according to the scores of the data samples to be evaluated and the scores of the simulation data samples; the scores of the simulation data samples are used for representing the abnormal degree of the transaction corresponding to the simulation data samples; the feature contribution value of the feature variable is used to represent the difference between the score of the data model for the simulated data sample and the score of the transaction data;

the evaluation module is used for evaluating the characteristic variable of the data sample to be evaluated according to the characteristic contribution value of the characteristic variable;

10. The apparatus of claim 9, further comprising a feature value acquisition module to acquire a population of training samples of the data model; determining respective corresponding values of the whole training samples under the characteristic variables; and calculating and generating the characteristic value of the characteristic variable according to the respective corresponding value of the whole training samples under the characteristic variable.

11. The apparatus of claim 10, the feature value acquisition module to generate statistics from respective corresponding values of the ensemble of training samples under the feature variables; and determining the statistical value as a characteristic value of the characteristic variable, wherein the statistical value comprises at least one of a median, a mode or an average.

12. The apparatus of claim 11, the feature value acquisition module to select a portion of training samples from the ensemble of training samples based on respective corresponding values of the ensemble of training samples under the feature variable; and generating a statistical value according to the respective corresponding value of the part of training samples under the characteristic variable.

13. The device of claim 12, wherein the feature value obtaining module obtains a minimum value and a maximum value of values corresponding to the whole training samples under the feature variable respectively, and determines a value interval; performing equivalent box division on the value interval according to the fixed value length to generate a plurality of box division value intervals; determining the value number contained in each box-dividing value interval; and determining the training samples corresponding to the box-dividing value interval with the largest value number as the partial training samples.

14. The apparatus of claim 9, the computing module to determine an absolute value of a difference between the score of the data sample to be evaluated and the score of the simulated data sample; and determining the absolute value of the difference to be the characteristic contribution value of the characteristic variable, or determining the quotient of the absolute value of the difference and the score of the data sample to be evaluated to be the characteristic contribution value of the characteristic variable.

15. The apparatus of claim 9, wherein the evaluation module ranks feature variables in the data sample to be evaluated according to the magnitude of each feature contribution value, and generates a ranking result; and taking a specified number of characteristic variables from the forefront of the sorting results, and determining the characteristic variables as key characteristic variables affecting the grading of the data sample to be evaluated.

16. The apparatus of claim 15, further comprising an information generation module that obtains, for any key feature variable, a value corresponding to the key feature variable in the data sample to be evaluated; generating key feature information containing all key feature variables and corresponding values so that a user can conduct business processing according to the key feature information.

17. A data characteristic evaluation device for transaction data analysis, comprising:

A memory storing a data feature evaluation program;