CN113657510A

CN113657510A - Method and device for determining data sample with marked value

Info

Publication number: CN113657510A
Application number: CN202110953954.4A
Authority: CN
Inventors: 纪忠光; 凌芳觉
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-16

Abstract

The specification provides a method for determining a data sample with a marked value, which comprises the steps of firstly obtaining a set of unmarked data samples, and then estimating a probability density function aiming at any characteristic vector value according to the distribution of the characteristic vector value of each data sample in the set of unmarked data samples, so that the sample distribution concentration of a space point where any characteristic vector value is located can be determined according to the probability density function. When the labeling value degree of the data samples in the unlabeled data sample set is determined, the obtained probability density function is used for determining the data sample concentration of the space point where the feature vector value of each data sample is located, the smaller the concentration is, the number of the data samples corresponding to the feature vector value is shown, the smaller the number of the data samples corresponding to other feature vector values is, the larger the contribution to the balance data sample distribution is, and therefore the more valuable the construction of the prediction model is.

Description

Method and device for determining data sample with marked value

Technical Field

One or more embodiments of the present disclosure relate to the field of machine learning, and in particular, to a method and an apparatus for determining a data sample with labeled value.

Background

In machine learning, it is generally necessary to manually add labels to data samples to obtain training samples (i.e., manual labeling), and then construct a prediction model using the training samples. Active learning arises in order to reduce the cost of manual labeling.

Active learning means that a small number of data samples are extracted manually in advance, labels are added to the data samples and serve as training samples, and then the small number of labeled training samples are used for building a prediction model. After the prediction model is built, the data samples without the added labels are predicted by the prediction model, uncertainty of the prediction result is evaluated for the prediction result of each data sample without the added labels through a query function, and the data samples with the added labels are screened out, wherein the higher the uncertainty of the prediction result of the data sample without the added labels is, the more valuable the building of the prediction model is, and the more valuable the adding of the labels is. And after the labels are manually added to the screened data samples without the added labels, expanding the part of data samples to the training samples, then utilizing the expanded training samples to train the prediction model again, and repeating the steps until the stopping condition is reached.

Under the condition of balanced data samples, the prediction result of the prediction model constructed each time is relatively accurate, and the query function can effectively reduce the cost of manual marking. However, in practical applications, the data samples may be unbalanced, for example, in the field of financial anti-fraud, most of the obtained transaction samples are transaction samples without a fraud problem, only a small number of transaction samples have a fraud problem, and at this time, the prediction result of the prediction model may be inclined, and the above query function may not effectively reduce the cost of manual labeling.

Disclosure of Invention

In view of the above, one or more embodiments of the present disclosure provide a method for determining a valued data sample.

To achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

according to a first aspect of one or more embodiments herein, there is provided a method for determining a valued data sample, comprising:

acquiring an unlabeled data sample set; any unlabeled data sample i includes a feature vector value x_i；

According to { x_iDetermining a probability density function f (x) for the feature vector value x, where i is 1, 2, 3 … … n, n is the distribution of the number of samples of the unlabeled set of data samples };

for any unlabeled data sample i, determining a characteristic vector value x of the sample_iCalculating f (x) using f (x)_i) (ii) a According to f (x)_i) Determining the labeling value degree of the data sample, and if the labeling value degree is greater than a preset value, determining that the data sample has the labeling value; wherein, f (x)_i) The smaller the annotated value of the unlabeled data sample i.

According to a second aspect of one or more embodiments of the present specification, there is provided a prediction model construction method including:

acquiring an unlabeled data sample set and an labeled data sample set; circularly executing the following steps until a preset condition is met:

taking the current labeled data sample as a training sample to construct a prediction model;

determining whether any unmarked data sample in the current unmarked data sample set has a marked value or not by using the method for determining the data sample with the marked value, so as to obtain a data sample set with the marked value;

adding a label to a sample in the marked value data sample set, and moving the marked value data sample set from the unmarked data sample set to the marked data sample set;

and after the cycle execution is finished, taking the current prediction model as a prediction model after the training is finished.

According to a third aspect of one or more embodiments herein, there is provided a method of determining a value-tagged transaction sample, comprising:

acquiring an unlabeled transaction sample set; any unlabeled transaction sample j includes a transaction characteristic vector value x_j；

According to { x_jDetermining a probability density function f (x) for the feature vector value x, where i is 1, 2, 3 … … m, m is the number of samples of the unlabeled trading sample set;

for any unmarked transaction sample j, determining a characteristic vector value x of the transaction sample_jCalculating f (x) using f (x)_j) (ii) a According to f (x)_j) Determine theIf the marked value degree is larger than a preset value, determining that the transaction sample has the marked value; wherein, f (x)_j) The smaller the value of the unmarked transaction sample j.

According to a fourth aspect of one or more embodiments herein, there is provided an apparatus for determining a valued data sample, comprising:

the collection acquisition module is used for acquiring an unlabeled data sample collection; any unlabeled data sample i includes a feature vector value x_i；

A probability density function determination module according to { x_iDetermining a probability density function f (x) for the feature vector value x, where j is 1, 2, 3 … … n, n is the distribution of the number of samples of the unlabeled set of data samples };

a marking value determining module used for determining the characteristic vector value x of any unmarked data sample i_iCalculating f (x) using f (x)_i) (ii) a According to f (x)_i) Determining the labeling value degree of the data sample, and if the labeling value degree is greater than a preset value, determining that the data sample has the labeling value; wherein, f (x)_i) The smaller the annotated value of the unlabeled data sample i.

According to a fifth aspect of one or more embodiments of the present specification, there is provided a prediction model construction apparatus including:

the collection acquisition module is used for acquiring an unlabeled data sample collection and an labeled data sample collection;

the cyclic execution module is used for cyclically executing the following units until a preset condition is met:

the model construction unit is used for constructing a prediction model by taking the currently labeled data sample as a training sample;

the marked value data sample set determining unit determines whether any unmarked data sample in the current unmarked data sample set has marked value by using the marked value data sample determining method to obtain a marked value data sample set;

the label adding unit is used for adding labels to the samples in the marked value data sample set and moving the marked value data sample set from the unmarked data sample set to the marked data sample set;

and the prediction model determining module is used for taking the current prediction model as the trained prediction model after the execution of the circular execution module is finished.

According to a sixth aspect of one or more embodiments herein, there is provided an apparatus for determining a value-bearing transaction sample, comprising:

the collection acquisition module is used for acquiring an unlabeled transaction sample collection; any unlabeled transaction sample j includes a transaction characteristic vector value x_j；

A probability density function determination module for determining a probability density function based on { x }_jDetermining a probability density function f (x) for the feature vector value x, where i is 1, 2, 3 … … m, m is the number of samples of the unlabeled trading sample set;

a marking value degree determining module used for determining the characteristic vector value x of any unmarked transaction sample j_jCalculating f (x) using f (x)_j) (ii) a According to f (x)_j) Determining the marked value degree of the transaction sample, and if the marked value degree is greater than a preset value, determining that the transaction sample has marked value; wherein, f (x)_j) The smaller the value of the unmarked transaction sample j.

In one or more embodiments of the present specification, an unlabeled data sample set is obtained, and then a probability density function for any feature vector value is estimated according to a distribution of feature vector values of each data sample in the unlabeled data sample set, so that a sample distribution concentration of a space point where any feature vector value is located can be determined according to the probability density function. When the labeling value degree of the data samples in the unlabeled data sample set is determined, the obtained probability density function is used for determining the data sample concentration of the space point where the feature vector value of each data sample is located, the smaller the concentration is, the number of the data samples corresponding to the feature vector value is shown, the smaller the number of the data samples corresponding to other feature vector values is, the larger the contribution to the balance data sample distribution is, and therefore the more valuable the construction of the prediction model is.

In one or more embodiments of the present specification, a probability density function is used to select a sample with a small number of samples corresponding to a characteristic vector value, so that the distribution of data samples in a training sample set for constructing a prediction model tends to be balanced, and the construction of the prediction model is accelerated, thereby effectively reducing the labeling cost of the data samples in a scene where the data samples are unbalanced.

Drawings

Fig. 1 is a basic flow diagram of an active learning method according to an exemplary embodiment.

Fig. 2 is a diagram illustrating the relationship between entropy and prediction result probability shown in this specification.

Fig. 3 is a schematic diagram illustrating inaccuracy of prediction results due to non-imbalance of data samples in the present specification.

FIG. 4 is a flowchart illustrating a method for determining a worth-tagged data sample according to an exemplary embodiment.

FIG. 5 is a diagram of a set of unlabeled data samples for a method of determining labeled valuable data samples according to an exemplary embodiment.

FIG. 6 is a flow chart of a method for determining valued labeled data samples based on uncertainty and probability density as provided by an exemplary embodiment.

FIG. 7 is a flowchart of a method for determining a valued data sample based on feature dissimilarity and probability density, according to an exemplary embodiment.

FIG. 8 is a flow chart of a method for determining worth labeled data samples based on uncertainty, feature variance, and probability density as provided by an exemplary embodiment.

FIG. 9 is a flowchart illustrating a method for constructing a predictive model according to an exemplary embodiment.

FIG. 10 is a flow chart illustrating a method for determining a sample of trades with annotated value according to an exemplary embodiment.

FIG. 11 is an apparatus diagram of an apparatus for determining worth-tagged data samples according to an exemplary embodiment.

Fig. 12 is an apparatus diagram of a prediction model building apparatus according to an exemplary embodiment.

FIG. 13 is an apparatus diagram of an apparatus for determining a sample of trades with marked value according to an exemplary embodiment.

Fig. 14 is a schematic structural diagram of an apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

In many application scenarios, a large number of characteristic data samples are obtained based on currently or historically acquired data, and then the acquired data samples are labeled for the learning of a constructed model, such as the financial anti-fraud field, the image recognition field, the voice recognition field, and the like. The large-scale data sample labeling is a labor cost consuming matter, and in order to reduce the labeling cost of the data samples, active learning is provided, that is, a prediction model which labels the data samples as little as possible to achieve an expected prediction effect is provided.

As shown in fig. 1, for a general process of active learning, a small number of samples are extracted from an unlabeled data sample set in advance and labeled to obtain a labeled data sample set:

1. and taking the current labeled data sample set as a training sample to construct a prediction model.

2. And (3) predicting the data samples in the current unlabeled data sample set by using the prediction model obtained in the step (1) to obtain the prediction result of each data sample in the unlabeled data sample set.

3. And (4) evaluating the prediction result obtained in the step (2) by utilizing a query function to obtain a marked value data sample set.

4. And (4) labeling the data samples in the labeled value data sample set obtained in the step (3).

5. And the marked valuable sample set is moved from the unmarked data sample set to the marked sample set.

6. And repeating the steps until the stop condition is met.

The query function in step 3 generally selects the data sample having the most gain for constructing the prediction model, i.e., the data sample having the most marked value, from the unlabeled data sample set according to the prediction result in step 2, thereby reducing the cost of labeling the data sample.

In the related art, when the query function selects the data sample with the most marked value from the set of unlabeled data samples, the query function is generally based on an uncertainty criterion, that is, for any unlabeled data sample, the more uncertain the prediction result given by the prediction model is, the more valuable the unlabeled data sample is for constructing the prediction model. In the method, under the condition of balanced data samples, the training samples for constructing the prediction model are balanced, so the prediction result of the prediction model is relatively referential, and at the moment, the data sample set with the labeling value selected by the query function based on the uncertainty criterion has higher referential significance, so that the labeling cost of the data samples can be effectively reduced.

However, in practical applications, there are many scenarios where the acquired data samples are unbalanced, for example, in the field of financial fraud, most transaction data is normally secure, there is no fraud problem, and the transaction data with fraud security problem is only a few, and the transaction samples obtained based on the current or historical transaction data are mostly transaction positive samples (i.e., transaction samples without fraud problem), and a few transaction negative samples (i.e., transaction samples with fraud problem).

In the application scenario of the sample imbalance, due to the imbalance of the distribution of the data samples, the referenceability of the prediction result of the prediction model constructed by using a small amount of labeled data samples for the first time is greatly reduced, while the labeling value of the data sample set with the labeling value selected by the query function based on the uncertainty criterion is also greatly reduced, and although the prediction model with the expected effect can be obtained after repeated iteration possibly for many times, the purpose of active learning is deviated, and the labeling cost of the data samples is not effectively reduced.

There are many query functions based on uncertainty criteria, such as confidence measure, edge sampling measure, entropy measure, etc., and in the case of a binary model and a query function based on entropy measure, the value of characteristic vector of any data sample is represented by x, the prediction probability of the model is represented by P (x), the entropy represented by e (x) is represented by P (x), which is the relationship between P (x) and e (x), as shown in fig. 2, and P in the graph represents P (x). When p (x) is equal to 0.5, it means that the probability that the model predicts that the data sample is one of the classes of results is 0.5, and correspondingly, the probability that the model predicts that the data sample is another class of results is also 0.5, and when the model predicts the data sample with considerable uncertainty, the entropy is the largest.

Assuming that the sample distribution is unbalanced, as shown in fig. 3, the left peak (the higher peak in the figure) is the true distribution of the number of data samples corresponding to the positive example, and the right peak (the lower peak in the figure) is the true distribution of the number of data samples corresponding to the negative example, because of the unbalanced distribution of the samples, the sample distribution corresponding to the result with the prediction result probability of 0.5 of the model is on the side of the positive example with a higher number of samples, while it is really difficult to distinguish the data samples of the positive example from the negative example, the sample corresponding to the current model with the prediction result probability of 0.6 of the current model (the point where the data samples of the positive example and the negative example cross most in the figure), that is, the data sample with the prediction result probability of 0.6 of the current model is the sample which the model is difficult to distinguish the samples corresponding to the positive example and the negative example, and the selected data sample is more preferable based on the inaccurate criterion.

Based on this, in order to make the distribution of training samples for constructing a model tend to be balanced, an unlabeled data sample set is obtained first, and then a probability density function for any characteristic vector value is estimated according to the distribution of characteristic vector values of each data sample in the unlabeled data sample set, so that the sample distribution concentration of a space point where any characteristic vector value is located can be determined according to the probability density function. When the labeling value degree of the data samples in the unlabeled data sample set is determined, the obtained probability density function is used for determining the data sample concentration of the space point where the feature vector value of each data sample is located, the smaller the concentration is, the number of the data samples corresponding to the feature vector value is shown, the smaller the number of the data samples corresponding to other feature vector values is, the larger the contribution to the balance data sample distribution is, and therefore the more valuable the construction of the prediction model is.

The method, apparatus, and the like provided in the present specification will be described in detail below.

The specification provides a method for determining a data sample with a labeling value, which can be applied to the situation of reducing the labeling cost of the data sample through active learning in any data sample balance application scene. A model construction method uses a determination method of a data sample with marked value when determining a data sample set with marked value. And the method for determining the marked value transaction sample in the financial fraud scene is provided, and is a practical application of the method for determining the marked value data sample in the financial fraud field.

For convenience of description, in the following description, U denotes an Unlabeled data sample set (unfabeled data sample set), L denotes an Unlabeled data sample set (Labeled data sample set), and V denotes a set of valuable data samples selected from U (a set of valid data samples).

As shown in fig. 4, a flow chart of a method for determining a data sample with a labeling value shown in this specification is schematically illustrated, and the method includes the following steps:

and 102, acquiring an unlabeled data sample set.

Wherein any unlabeled data sample i comprises a characteristic vector value x_iAs shown in FIG. 5, the unlabeled data sample set includes n data samples, and the data sample 1 includes a characteristic vector value x₁The data sample 2 comprises a characteristic vector value x₂And so on.

Step 104, according to { x_iA probability density function f (x) for the feature vector value x is determined for a distribution of 1, 2, 3 … … n, n being the number of samples of the unlabeled set of data samples }.

Utilizing the value { x ] of each characteristic vector in the current unlabeled data sample set_iI 1, 2, 3 … … n, n the number of samples of the unlabeled set of data samples } estimates the probability density function for any feature vector value x. When estimating the probability density function, a parametric estimation method or a non-parametric estimation method may be used, and the specific use of the probability density estimation method depends on the actual application scenario. If the characteristic vector values in U are observed to conform to a particular distribution, a parameter estimation method is used, for example, if the characteristic vector values in U are observed to generally conform to a normal distribution, parameters corresponding to the normal distribution are determined according to the characteristic vector values, and the probability density function is estimated. If it is difficult to determine by observation what score each characteristic vector value in U corresponds toHere, a non-parametric estimation method, such as a kernel density estimation method, may be used.

In the following, taking a kernel density estimation method by a nonparametric method as an example, it is assumed that a set of each feature vector value of U is X ═ X_i1, 2, 3 … … n, each eigenvector is d-dimensional, then the kernel density estimate for any eigenvector value x is:

f (x) is a density function of a d-dimensional random variable, and K (×) is a kernel function defined over a d-dimensional space, i.e.: r^d→R₊And k (x) ≧ 0, [ integral ] k (x) du ═ 1.

The general form of rewriting this is:

wherein n represents the number of data samples (i.e. the number of samples in U) participating in the probability density estimation, H is a P × P symmetric positive bandwidth matrix, and is a window size in estimation and related to the dimension of the characteristic vector value of the data samples, and K (×) is a kernel function, which may be a gaussian kernel function, a polynomial kernel function, or the like.

106, aiming at any unmarked data sample i, determining a characteristic vector value x of the sample_iCalculating f (x) using f (x)_i) (ii) a According to f (x)_i) And determining the labeling value degree of the data sample, and if the labeling value degree is greater than a preset value, determining that the data sample has the labeling value.

F (x) can be calculated for any data sample i in U_i) A characteristic vector value x representing the data sample_iSo as to reflect the corresponding number of data samples, f (x), of the characteristic vector value in U_i) The larger the value is, the greater the concentration of the data sample corresponding to the characteristic vector value is, and the number of the data samples corresponding to the characteristic vector value in U is large, so that the labeling value of the data sample i is not very high (Because there may already be many labeled data samples corresponding to the characteristic vector value); f (x)_i) The smaller the data sample density corresponding to the characteristic vector value is, the smaller the number of data samples corresponding to the characteristic vector value in U is, and the labeling value of the data sample i is relatively higher (because the number of labeled data samples corresponding to the characteristic vector value is possibly smaller).

For any data sample i in U, can be according to f (x)_i) The score of the data sample may be obtained, for example, by setting the score y to 1-f (x), and then setting the score y of any data sample to 1-f (x)_i) The larger y is, the higher the labeling value degree of the data sample is.

The default value when determining whether the labeled value degree is greater than the default value is determined according to the actual situation, taking the labeled value degree as y ═ 1-f (x) as an example, if f (x) is obtained according to each feature vector value in U_i) If the value range of y is (0.6, 1), if the value degree of the label is set to 0.6, it is obviously unreasonable, and a reasonable preset value needs to be set according to the value range of the probability density function and the like as references.

It should be noted that, the probability density estimation function of U is obtained in this specification, in order to reflect, through the probability function, the distribution of the data sample corresponding to each feature vector value most, or reflect the sample concentration of each feature vector value, so that the samples corresponding to the feature vector values with a small number of samples can be selected, and the number of the data samples corresponding to each feature vector value tends to be balanced (i.e., the number is equivalent).

Therefore, based on the idea, in one or more embodiments, the probability density estimation method may not be used, and only the distribution of each feature vector value in U is obtained, and the feature vector values with less data sample distribution are selected from the distribution, so that the number of data samples corresponding to each feature vector value in L for constructing the model tends to be balanced, thereby improving the efficiency and accuracy for constructing the model.

In one or more embodiments of the present specification, after determining whether each data sample in U has a marked value, the marked value data sample is made into a marked value data sample set, as shown in fig. 5, which is a schematic flow chart of extracting V from U based on the above method shown in the present specification.

In practical applications, when determining whether each data sample in U has a labeling value, not only the number distribution of each feature vector value in U (i.e., the balance of the data samples) but also the uncertainty of the prediction result of each data sample in U or the difference between each data sample in U and each data sample in L may be considered.

Thus, in one or more embodiments of the present description, in step 106, according to f (x)_i) Determining the annotation value degree of the data sample may further be:

acquiring a constructed prediction model, wherein the prediction model is obtained by utilizing a labeled data sample set for training;

for any unlabeled data sample i, x is added_iInputting the prediction model, and determining the uncertainty of the prediction result of the prediction model for the data sample according to the output result of the prediction model; according to f (x)_i) And the uncertainty of the data sample, and determining the mark value degree of the data sample.

As shown in fig. 6, a prediction model constructed according to L is obtained first, then, for any data sample i in U, the data sample i in U is predicted by using the prediction model to obtain a prediction result of the data sample i, uncertainty of the data sample i is obtained based on the prediction result, and f (x) of the data sample i is obtained by using a probability density function f (x)_i) Then passes the uncertainty of the data sample i and f (x)_i) And obtaining the marking value degree of the data sample i.

There are many methods for obtaining uncertainty of data samples based on the prediction result, such as confidence-based, committee-based, posterior probability-based, entropy-based, etc. as described above. Taking an uncertainty determination method based on an entropy method as an example, the entropy of a prediction model for any type of prediction results is as follows:

I(x)＝-p(x)log(p(x))

thus, the entropy of the prediction result of the prediction model for any data sample is:

wherein x represents a characteristic vector value of the data sample, c represents a prediction class of the prediction model,

taking classification two as an example, p (x)₁+p(x)₂Where 1, 1 represents one class and 2 represents another, then e (x) ═ p (x) log (p) (x)) + (1-p (x)) log (1-p (x)), and p (x) represents the prediction probability for either class in the binary class.

When determining the labeling value degree of any data sample in U, a calculation method such as q (x) ═ α e (x) +(1- α) (1-f (x)) may be used, where α is adjusted according to actual needs and has a value range of (0,1), the larger α is, the more uncertainty of the data sample is considered when determining the labeling value of any data sample in U, and the smaller α is, the more consideration is the distribution balance of the data sample when determining the labeling value of any data sample in U.

Where e (x) and f (x) are an order of magnitude, and therefore no number-level consistency modification is performed, if e (x) and f (x) are not an order of magnitude, in order to modify the weights of e (x) and f (x), e (x) and f (x) may be mapped to an order of magnitude, for example, e (x) has a value range of (50, 100), and f (x) has a value range of (0,1), so that the contribution of each criterion in determining whether a data sample has a labeled value is measured according to α.

In addition, in practical applications, the query function is also based on a diversity criterion, that is, a data sample with the minimum information redundancy (the minimum feature overlap) is selected from U compared with each data sample in L, so that in step 106, the data sample is selected according to f (x)_i) Determining the annotation value degree of the data sample may further be:

acquiring a marked data sample set;

for any unlabeled data sample i, according to x_iObtaining the feature vector value of each labeled data sample in the labeled data sample set to obtain the feature difference degree of the data sample and the labeled data sample set; according to f (x)_i) And determining the labeling value degree of the data sample according to the characteristic difference degree of the data sample.

As shown in fig. 7, L is obtained first, then for any data sample set i in U, the feature similarity of each data sample in the data sample i and L is compared to obtain the feature difference between each data sample in the data sample i and L, and then the feature difference of the data sample i and f (x) are obtained according to the feature difference of the data sample i_i) And obtaining the mark value degree of the data sample. When calculating the feature similarity, a clustering algorithm may be used to perform cluster analysis on the feature vector value corresponding to the data sample i and the feature vector values of the data samples in the L to obtain the feature similarity or the feature difference between the data sample i and each data sample in the L.

Assuming that the feature difference is represented by d (x), the annotation value can be determined by using a method such as q (x) ═ β d (x) + (1- β) (1-f (x)), where β has a value range of (0,1), in the same manner as the above-mentioned uncertainty criterion, and if d (x) and f (x) are not an order of magnitude, for modifying the weights of d (x) and f (x), e (x) and f (x) can be mapped to an order of magnitude, for example, d (x) has a value range of (50, 100), and f (x) has a value range of (0,1), so that each criterion is measured according to β to determine whether the data sample has contribution of the annotation value.

As shown in fig. 8, when determining the labeling value of any data sample i in U, the uncertainty of the prediction result of the data sample i, the feature difference degree between the data samples in U and L, and the sample density f (x) of the data sample i can be considered at the same time_i) The specific implementation can refer to the above description of fig. 7 and 8, and will not be described in detail here.

It can be found that the manner of measuring the labeling value based on the balance of the data samples shown in this specification may be combined with the manner of measuring the labeling value based on other aspects, and is not limited to the above uncertainty-based criterion and diversity-based criterion, and is adjusted according to the actual application scenario (i.e., q (x) ═ 1-f (x)) + Δ, Δ represents other influencing factors), for example, in the next application scenario, the distribution balance and diversity of the data samples have a great influence on the construction of the model, and then the manner of determining the labeling value of any data sample in U as shown in fig. 7 may be selected.

The above is a description of a method for determining a valuable data sample, which is applicable to any application scenario where data samples are unevenly distributed, and the present description provides a model construction method, i.e., an active learning method, based on the above method for determining a valuable data sample, which uses the above method for determining a valuable data sample when determining V, and the following detailed description is provided.

The present specification provides a model construction method, including:

step 1, constructing a prediction model by taking a current labeled data sample as a training sample;

step 2, determining whether any unmarked data sample in the current unmarked data sample set has an unmarked value by using the method for determining the data sample with the marked value, so as to obtain a data sample set with the marked value;

step 3, adding labels to the samples in the marked value data sample set, and moving the marked value data sample set from the unmarked data sample set to the marked data sample set;

The preset conditions are adjusted according to actual needs, for example, the prediction accuracy of the actually needed prediction model is more than ninety percent, and then the loop stop conditions can be set to determine whether the prediction accuracy of the prediction model reaches one hundred and ninety percent; if the distribution of each data sample in the L for constructing the model tends to be balanced, the prediction result of the prediction model constructed based on the L is considered to be relatively accurate, and then the preset condition can be set to determine whether the distribution of the feature vector values of each data sample in the L for constructing the model reaches a certain degree of balance.

The timing for determining whether the preset condition is met depends on the preset condition and the actual situation, and the determination may be performed once after the execution of each of the steps 1, 2, and 3 is completed, or may be performed after one of the steps. For example, if the preset condition is that the prediction accuracy of the prediction model meets the requirement, after the prediction model is built (after step 1 is completed) each time, the prediction accuracy of the prediction model is tested to determine whether the preset condition is met, and if the preset condition is that the cycle execution frequency reaches the preset frequency, it may be determined whether the cycle execution frequency meets the preset condition after each cycle is completed.

In one or more embodiments of the present disclosure, to ensure that the tag added to V is accurate (or reliable), the selected V is generally manually tagged, and certainly, on the premise that the tag added to V is accurate, the tag may also be added to V in other manners, which is not limited in this description.

As shown in fig. 9, a schematic flowchart for executing a determination logic for the present description, that is, determining whether a preset condition is met after each cycle is completed, includes the following steps:

step 202, obtaining an unlabeled data sample set U and a labeled data sample set L.

And 204, constructing a prediction model by taking the currently labeled data sample as a training sample.

Step 206, determining whether any unmarked data sample in the current unmarked data sample set has a marked value by using any one of the methods for determining the data sample with the marked value, so as to obtain a data sample set with the marked value.

And step 208, adding labels to the samples in the marked value data sample set, and moving the marked value data sample set from the unmarked data sample set to the marked data sample set.

And step 210, judging whether a preset condition is met. If yes, directly jumping to step 212 to finish, otherwise, jumping to step 204, and executing

steps

204, 206, 208 and 210 again.

And step 212, taking the current prediction model as a trained prediction model.

The model construction method is based on an active learning method, the basic flow is similar to that of the method shown in FIG. 1, but in the query function part, the model construction method is used based on uncertainty criterion or multi-sample criterion, but based on sample balance, that is, in screening V from U, any method of determining a valuable data sample as described above is used, and thus, when a prediction model is constructed under the scene of unbalanced distribution of data samples, V may be selected from U based on the balance of the data samples (i.e., the above-mentioned determination of the labeling value degree based on f (x)), or both the balance and the uncertainty of the data samples (i.e., the above-mentioned balance based on f (x) and e (x)) may be considered, and both the balance and the diversity of the data samples (i.e., the above-mentioned balance based on f (x) and d (x)) may be considered.

It should be noted that, in different application scenarios, the gain effect of V selected from U based on different modes on the construction of the prediction model is different, and the mode based on data sample equalization shown in this specification selects V from U, and mainly for a scenario with a high requirement on data sample equalization, the more the equalization of the prediction model depends on the training sample, the more the V selected from U based on data sample equalization shown in this specification has the gain effect on the construction of the prediction model.

The above is a description of the prediction model construction method. Based on the determination method of the data sample with the marked value, the specification provides a specific implementation method of the method applied to the financial fraud field, in the financial fraud field, most of the obtained transaction samples are transaction samples without fraud problems or risks, and only a few of the transaction samples are transaction samples with fraud problems or risks, so that when a transaction risk prediction model is constructed based on the obtained transaction samples, the prediction result of the prediction model is inclined as shown in 3, and therefore, the determination method of the data sample with the marked value can be used, and training samples for constructing the transaction risk model tend to be balanced. As shown in fig. 10, a flow chart of a method for determining a transaction sample with marked value is shown in the present specification, which includes the following steps:

and step 302, acquiring an unlabeled transaction sample set.

Wherein any unlabeled transaction sample j comprises a transaction characteristic vector value x_j。

Step 304, according to { x_jDetermining a probability density function f (x) for the feature vector value x, where j is 1, 2, 3 … … m, m is the distribution of the number of samples of the unlabeled trading sample set;

step 306, for any unmarked transaction sample j, determining the characteristic vector value x of the transaction sample_jCalculating f (x) using f (x)_j) (ii) a According to f (x)_j) And determining the marked value degree of the transaction sample, and if the marked value degree is greater than a preset value, determining that the transaction sample has the marked value.

Wherein, f (x)_j) The smaller the value of the unmarked transaction sample j.

Of course, the method for determining the valuable data sample can be applied to the financial fraud field, the characteristic vector value corresponds to the transaction characteristic vector value of the field, the data sample corresponds to the transaction sample of the field, and the relevant points can refer to the above description of the method for determining the valuable data sample, and the detailed description is not provided here.

The above is a detailed description of the method provided in the present specification, and the present specification also provides a device, an apparatus, and a computer-readable storage medium corresponding to the above method, and the device, the apparatus, and the computer-readable storage medium are described in detail below.

The present specification also provides a device for determining a valuable data sample, as shown in fig. 11, including:

a set obtaining module 1102, configured to obtain an unlabeled data sample set; any unlabeled data sample i includes a feature vector value x_i；

Probability density function determination module 1104, based on { x }_iDetermining a probability density function f (x) for the feature vector value x, where i is 1, 2, 3 … … n, n is the distribution of the number of samples of the unlabeled set of data samples };

a labeling value determining module 1106, configured to determine, for any unlabeled data sample i, a feature vector value x of the sample_iCalculating f (x) using f (x)_i) (ii) a According to f (x)_i) Determining the labeling value degree of the data sample, and if the labeling value degree is greater than a preset value, determining that the data sample has the labeling value; wherein, f (x)_i) The smaller the annotated value of the unlabeled data sample i.

The annotation value determination module 1106 may be further configured to:

Or for:

acquiring a marked data sample set;

The probability density function determination module 1104 may also be configured to:

from the distribution of { xi | i ═ 1, 2, 3 … … n, n ═ the number of samples of the unlabeled set of data samples }, a probability density function f (x) for the feature vector value x is determined using a kernel density function.

The present specification also provides a prediction model construction apparatus, as shown in fig. 12, including:

a set obtaining module 1202, configured to obtain an unlabeled data sample set and a labeled data sample set;

a loop execution module 1204, configured to loop the following units until a preset condition is met:

a model building unit 1214, configured to build a prediction model by using the currently labeled data sample as a training sample;

annotated valuable data sample set determining unit 1224, configured to determine whether any annotated data sample in the current annotated data sample set has an annotated value by the method according to any of claims 1 to 4, to obtain an annotated valuable data sample set;

a label adding unit 1234, configured to add a label to a sample in the marked value data sample set, and move the marked value data sample set from the unmarked data sample set to the marked data sample set;

and the prediction model determining module 1206 is configured to take the current prediction model as the trained prediction model after the execution of the loop execution module is completed.

The present specification also provides a device for determining a transaction sample with a marked value, as shown in fig. 13, comprising:

a set obtaining module 1302, configured to obtain a set of unlabeled transaction samples; any unlabeled transaction sample j includes a transaction characteristic vector value x_j；

A probability density function determination module 1304, configured to determine a probability density function f (x) for the feature vector value x according to a distribution of { xj | i ═ 1, 2, 3 … … m ═ number of samples of the unlabeled transaction sample set };

a marking value degree determination module 1306, configured to determine, for any unmarked transaction sample jCharacteristic vector value x of the transaction sample_jCalculating f (x) using f (x)_j) (ii) a According to f (x)_j) Determining the marked value degree of the transaction sample, and if the marked value degree is greater than a preset value, determining that the transaction sample has marked value; wherein, f (x)_j) The smaller the value of the unmarked transaction sample j.

The apparatuses, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

This specification also provides an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the method as described in any above by executing the executable instructions.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

FIG. 14 is a schematic block diagram of an electronic device in accordance with an exemplary embodiment. Referring to FIG. 14, at the hardware level, the device includes a processor 1402, an internal bus 1404, a network interface 1406, a memory 1408, and a non-volatile storage 1410, although other hardware required for service may be included. One or more embodiments of the present description can be implemented in software, such as by processor 1402 reading corresponding computer programs from non-volatile storage 1410 into memory 1408 and then running. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

Any of the above devices can be applied to the electronic device shown in fig. 14.

The present specification also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method as any one of the above.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of determining a valued data sample, comprising:

2. The method of claim 1, according to f (x)_i) Determining the annotation value degree of the data sample, including:

3. The method of claim 1, according to f (x)_i) Determining the annotation value degree of the data sample, including:

acquiring a marked data sample set;

for any unlabeled data sample i, according tox_iObtaining the feature vector value of each labeled data sample in the labeled data sample set to obtain the feature difference degree of the data sample and the labeled data sample set; according to f (x)_i) And determining the labeling value degree of the data sample according to the characteristic difference degree of the data sample.

4. The method of claim 1, determining a probability density function f (x) for a feature vector value x, comprising:

using the kernel density function, a probability density function f (x) for the feature vector value x is determined.

5. A predictive model construction method, comprising:

determining whether any unlabeled data sample in the current unlabeled data sample set has labeled value by using the method of any one of claims 1 to 4, so as to obtain a labeled value data sample set;

6. A method of determining a valued transaction sample, comprising:

According to { x_jDetermining a probability density function f (x) for the feature vector value x, where j is 1, 2, 3 … … m, m is the distribution of the number of samples of the unlabeled trading sample set;

for any unlabeled intersectionEasy sample j, determining characteristic vector value x of the trade sample_jCalculating f (x) using f (x)_j) (ii) a According to f (x)_j) Determining the marked value degree of the transaction sample, and if the marked value degree is greater than a preset value, determining that the transaction sample has marked value; wherein, f (x)_j) The smaller the value of the unmarked transaction sample j.

7. An apparatus for determining a valued data sample, comprising:

A probability density function determination module according to { x_iDetermining a probability density function f (x) for the feature vector value x, where i is 1, 2, 3 … … n, n is the distribution of the number of samples of the unlabeled set of data samples };

8. A predictive model building apparatus comprising:

a marked value data sample set determining unit, which determines whether any unmarked data sample in the current unmarked data sample set has marked value by using the apparatus of claim 7, to obtain a marked value data sample set;

9. An apparatus for determining a value-tagged transaction sample, comprising:

A probability density function determination module for determining a probability density function based on { x }_jDetermining a probability density function f (x) for the feature vector value x, where j is 1, 2, 3 … … m, m is the distribution of the number of samples of the unlabeled trading sample set;

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the method of any one of claims 1-6 by executing the executable instructions.

11. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 6.