CN111882063A

CN111882063A - Data annotation request method, device, equipment and storage medium suitable for low budget

Info

Publication number: CN111882063A
Application number: CN202010767850.XA
Authority: CN
Inventors: 赵曦滨; 万海; 张豪; 黄潇
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2020-11-03
Anticipated expiration: 2040-08-03
Also published as: CN111882063B

Abstract

The specification discloses a data annotation request method, a data annotation request device, data annotation request equipment and a storage medium which are suitable for low budget, wherein the data annotation request method combines the self-confidence value of a model for sample prediction with the historical detection effect of the model for the sample, and adds budget parameters into influence factors of real labels of requested sample data, so that the model is more inclined to request real labels of samples with more wrong categories under the condition of reasonably distributing limited budget, the problems of selection of request vectors and neglect of budget influence in the prior art are solved, the influence caused by unbalanced distribution of data can be better responded, the positive degree of the request willingness of the model labels is dynamically adjusted by monitoring the residual condition of environmental budget, and the training effect of the model is improved.

Description

Data annotation request method, device, equipment and storage medium suitable for low budget

Technical Field

The invention relates to the field of machine learning, in particular to a data annotation request method, device, equipment and storage medium suitable for low budget.

Background

The online learning algorithm can process mass data generated every moment on line, but a traditional online learning framework requires a model to train every piece of data, and in an actual process, a real label of a sample cannot be acquired for free generally. In an anomaly detection task in an actual scene, the budget for labeling actual training data is usually limited, and when an unbalanced data stream is processed, whether the limited data labeling budget can be sufficiently allocated to rare anomaly samples is a decisive factor of the anomaly detection effect of an algorithm. The limitation of data labeling budget is a practical problem to be faced in an actual process, an active learning mechanism is one of coping methods, a real label of data is selectively and actively requested to be trained according to the confidence degree of a model for sample prediction so as to save the limited data labeling budget, and the training effect of the model is hopefully ensured as much as possible under the condition of limited budget by labeling the training data most needed by the model.

In the prior art, an asymmetric tag request mechanism can better cope with unbalanced distribution of data, but like a cost sensitive mechanism, the active learning mode also faces the problem of selecting a tag request vector. In addition, the existing active learning related technologies do not consider the budget influence, and only from the perspective of a model, whether to request a real label of data for training is determined according to the confidence of prediction, so that the model can put forward the same data labeling requirements under different budget conditions, which is obviously not good enough.

In summary, it is an urgent need to solve the problem of adding a budget factor to an active learning mechanism to improve the effect of detecting the abnormality of the model by researching a data annotation request method suitable for low budget.

Disclosure of Invention

The present specification provides a method, apparatus, device and storage medium for low budget adaptive data annotation request, which overcome at least one technical problem in the prior art.

According to a first aspect of embodiments of the present specification, there is provided a data annotation request method for accommodating low budget, including: obtaining a current budget parameter according to a pre-obtained budget proportion value, a count value of a requested real label, a count value of the number of processed samples and a preset budget parameter iterative formula; when receiving a sample data to be detected, obtaining a prediction result of the sample data through a pre-trained abnormal detection model, and calculating a corresponding confidence value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample; according to the classification of the prediction result, obtaining a request factor of the sample data according to the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value; taking the probability factor as success probability, and performing a Bernoulli random test to obtain a test result; if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the sample data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model; if the real label of the sample data is requested, increasing the count value of the number of the requested real labels by one; if the prediction result is inconsistent with the real label, increasing a mistake count value of the sample corresponding to the prediction result by one; increasing the count value of the number of the processed samples by one; and updating the current budget parameters according to the count value of the real label requested currently, the count value of the number of samples processed currently and a preset budget parameter iterative formula.

Preferably, in the step of obtaining the current budget parameter according to a pre-obtained budget ratio value, a requested count value of real tags, a count value of the number of processed samples, and a preset budget parameter iterative formula, the budget parameter iterative formula includes the following form:

wherein beta is the current budget parameter, beta₀To obtain in advanceQN is the count value of the requested real tag, and TN is the count value of the number of processed samples.

Preferably, the step of obtaining a request factor of the sample data according to the classification of the prediction result and the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value includes:

if the prediction result of the sample data is a normal sample, the request factor of the sample is obtained according to the following formula,

wherein, b₊Is the request factor of normal sample, beta is the current budget parameter, K⁺To predict the count of the number of mistakes made in the classification of normal samples, K^-To predict the count value, X, as the number of mistakes made in the classification of abnormal samples^′The current maximum mode length of the sample is taken as the maximum mode length;

and according to

The probability factor is obtained and the probability value is obtained,

wherein b is a probability factor, b₊A requesting factor, p, for the sample_tA confidence value of the prediction result of the sample data;

if the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,

wherein b is_-A request factor for an exception sample, β is a current budget parameter, K⁺To predict the count of the number of mistakes made in the classification of normal samples, K^-To predict the count value, X, as the number of mistakes made in the classification of abnormal samples^′The current maximum mode length of the sample is taken as the maximum mode length;

and according to

The probability factor is obtained and the probability value is obtained,

wherein b is a probability factor, b_-A requesting factor, p, for the sample_tIs the confidence value of the prediction result of the sample data.

Preferably, the method further comprises the following steps: and updating the parameters of the abnormal detection model according to the real label of the sample data obtained by the request.

According to a second aspect of the embodiments of the present specification, there is provided a data annotation requesting device adapted to low budget, including: the system comprises a budget parameter obtaining module, a confidence obtaining module, a probability obtaining module, a test module, a label request module, a label counting module, a mistake counting module, a sample counting module and a budget parameter updating module, wherein the budget parameter obtaining module is configured to obtain a current budget parameter according to a budget proportion value obtained in advance, a count value of requested real labels, a count value of the number of processed samples and a preset budget parameter iterative formula; the self-confidence level obtaining module is configured to obtain a prediction result of sample data to be detected through a pre-trained abnormal detection model every time the sample data to be detected is received, and calculate a corresponding self-confidence level value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample; the probability obtaining module is configured to obtain a request factor of the sample data according to the classification of the prediction result, the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtain a probability factor according to the request factor and the self-confidence value; the test module is configured to perform a Bernoulli random test once by taking the probability factor as a success probability to obtain a test result; the label request module is configured to request a real label of the sample data if the test result is successful and the consumed budget ratio value is not greater than the budget ratio value, wherein the consumed budget ratio value is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model; the tag counting module is configured to increase a count value of the number of the requested real tags by one if the real tags of the sample data are requested; the mistake counting module is configured to increase a mistake counting value of the sample corresponding to the prediction result by one if the prediction result is inconsistent with the real label; the sample counting module is configured to increase the count value of the number of processed samples by one after receiving and processing one sample data to be detected; and the budget parameter updating module is configured to update the current budget parameter according to the count value of the currently requested real label, the count value of the currently processed sample number and a preset budget parameter iterative formula.

Preferably, the budget parameter iterative formula in the budget parameter obtaining module includes the following form:

wherein beta is the current budget parameter, beta₀For the pre-obtained budget ratio value, QN is the count value of the requested real tag, and TN is the count value of the number of processed samples.

Preferably, the probability obtaining module is configured to obtain a request factor of the sample according to the following equation if the prediction result of the sample data is a normal sample,

and according to

The probability factor is obtained and the probability value is obtained,

wherein b is_-A request factor for an exception sample, β is a current budget parameter, K⁺To predict the count of the number of mistakes made in the classification of normal samples, K^-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;

and according to

The probability factor is obtained and the probability value is obtained,

Optionally, the system further comprises a tag application module, wherein the tag application module is configured to update parameters of the anomaly detection model according to a real tag of sample data obtained by the request.

According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the data annotation request method when executing the instructions.

According to a fourth aspect of embodiments herein, there is provided a storage medium storing computer instructions which, when executed by a processor, implement the steps of the data annotation request method.

The beneficial effects of the embodiment of the specification are as follows:

the specification provides a data annotation request method, a device, equipment and a storage medium suitable for low budget, wherein the data annotation request method comprises the steps of firstly, combining a self-confidence value predicted by a model on a sample with a historical detection effect (namely, the number of times of mistakes) of the model on the sample to generate a request factor so as to adaptively adjust the deviation degree of the model on requesting normal/abnormal sample annotation, so that the model is more inclined to request real labels of more types of samples with mistakes, the problem of selecting request vectors in the prior art is solved, and the influence caused by unbalanced distribution of data can be better responded; secondly, the consumed budget is monitored by counting the number of processed samples and the number of requested real labels, a budget parameter is obtained according to a budget proportion value obtained in advance, the budget parameter is added into a request factor, and the positive degree of the request will of the model labels can be dynamically adjusted by monitoring the residual condition of the environmental budget, so that the limited budget is reasonably allocated, and the training effect of the model is ensured as much as possible.

The innovation points of the embodiment of the specification comprise:

1. in the embodiment, the historical error times of the model on each type of sample are monitored, and the confidence degree of the model on the sample prediction is combined with the historical detection effect of the model on the sample, so that the deviation degree of the model on the request of the normal/abnormal sample marking is adjusted, the model tends to request real labels in the type of samples with more historical error times, the selection problem of the request vector is solved, the influence caused by data unbalanced distribution can be better responded, and the method is one of the innovation points of the embodiment of the specification.

2. In this embodiment, the consumed budget is monitored by counting the number of processed samples and the number of requested real tags, a budget parameter is obtained according to a budget proportion value obtained in advance, and the budget parameter is added to the request factor, so that the positive degree of the model tag request will can be dynamically adjusted by monitoring the residual situation of the environmental budget, thereby reasonably allocating the limited budget and ensuring the training effect of the model as much as possible, which is one of the innovative points in the embodiments of the present specification.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for accommodating a low-budget data annotation request according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a data annotation request device adapted to low budget according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "including" and "having" and any variations thereof in the embodiments of the present specification and the drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

In the abnormal detection task, the budget for labeling actual training data is usually limited, when processing an unbalanced data stream, a real label of data is selectively and actively requested according to the confidence degree of the model to the sample prediction and training is performed, and it is desirable to save the limited data labeling budget by labeling the training data most needed by the model, and ensure the training effect of the model as much as possible under the condition of limited budget.

In the prior art, when an asymmetric label request mechanism is adopted to solve unbalanced distribution of data, selection of a request vector depends on industry priori knowledge and cannot meet dynamic change of data distribution in a stream processing mode, therefore, the confidence degree of a model for sample prediction and the effect of detection in model history prediction are combined to influence the request willingness of the model for real labels, dynamic adjustment of label requests is realized according to data processing results, the model is prone to requesting real labels of categories with a large number of mistakes, the distribution of the data can be better learned, the number of times of mistakes is reduced, and the training effect is improved. Meanwhile, in order to introduce the budget condition into the request mechanism, the budget parameter is added into the request factor, the existing budget condition of the model is monitored, and the willingness of requesting the real label is adjusted according to the existing budget condition.

The embodiment of the specification discloses a method, a device, equipment and a storage medium for data annotation request adapting to low budget, which are respectively described in detail below.

Example one

Fig. 1 is a flowchart illustrating a method for accommodating a low-budget data annotation request according to an embodiment of the present disclosure. As shown in fig. 1, a method for accommodating a low-budget data annotation request is provided, which includes:

110. and obtaining the current budget parameter according to a pre-obtained budget proportion value, a requested count value of the real label, a count value of the number of processed samples and a preset budget parameter iterative formula.

In a specific embodiment, in the step of obtaining the current budget parameter according to a pre-obtained budget ratio value, a requested count value of real tags, a count value of the number of processed samples, and a preset budget parameter iterative formula, the budget parameter iterative formula includes the following form:

β₀In order to make the environmental budget account for the proportion of the whole samples to be processed, the stream processing model also needs to analyze mass data generated continuously on line for a long time in the actual process, and the stream processing model is not generally required to process nearly infinite data with a relatively fixed budget, so that giving the environmental budget in the form of a proportion, namely the proportion of the budget given to the model to the number of the samples to be processed, is obviously a more reasonable way.

In the iterative formula of the budget parameters,

for the scale value for which the model has consumed the budget,

for proportional values of the existing budget, the current budget parameter β follows the existing budget

Is increased, thereby making the model more prone to request the true label of the sample; instead, with the existing budget

The model is more inclined to not request the real label of the data to save the data annotation budget. In the formula, the value of beta is in the range of 0<β<2β₀In and then when

When, there is beta ═ beta₀At this time, the model can still adapt to a severe budget environment.

120. When receiving a sample data to be tested, obtaining a prediction result of the sample data through a pre-trained anomaly detection model, and calculating a corresponding confidence value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample.

130. According to the classification of the prediction result, a request factor of the sample data is obtained according to the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and a probability factor is obtained according to the request factor and the self-confidence value.

In a specific embodiment, the step of obtaining a request factor of the sample data according to the classification of the prediction result and the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value includes:

wherein, b₊Is the request factor of normal sample, beta is the current budget parameter, K⁺To predict the count of the number of mistakes made in the classification of normal samples, K^-To predict the count of the number of mistakes made in the classification of abnormal samples, X' is the current maximum modulo length of the sample.

The request factor combines the historical number of mistakes made by the sample class, so that the model is more inclined to request the true labels of the samples made by more classes.

And according to

The probability factor is obtained and the probability value is obtained,

wherein b is a probability factor, b₊A requesting factor, p, for the sample_tIs the confidence value of the prediction result of the sample data.

The probability factor combines the historical error number of the sample type in the request factor with the confidence value of the prediction result of the sample data, so that the model is more inclined to request the real label of the sample with less confidence and more errors, and the label request mechanism is more reasonable.

wherein b is_-A request factor for an exception sample, β is a current budget parameter, K⁺To predict the count of the number of mistakes made in the classification of normal samples, K^-To predict the count of the number of mistakes made in the classification of abnormal samples, X' is the current maximum modulo length of the sample.

And according to

The probability factor is obtained and the probability value is obtained,

The processing of the sample data with the prediction result of the abnormal sample is similar to the data with the prediction result of the normal sample, the number of historical mistakes in different types is different, and the value of the probability factor is influenced, so that request intents with different weights are given to the data in different types in the request label, and the model can be fitted to the data with unbalanced distribution more fully.

140. And performing a Bernoulli random test once by taking the probability factor as success probability to obtain a test result.

The larger the value of the probability factor is, the more the test result tends to be successful, so that the true label of the sample data is more likely to be requested.

150. And if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the sample data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the abnormality detection model with the count value of the number of samples processed by the abnormality detection model.

Under the condition that the experimental result is successful, if the current condition exists

The real label of the sample data is requested, the necessity of requesting the label is guaranteed according to the prediction confidence level and the category history mistake number, the feasibility of requesting the label is guaranteed according to the budget condition, the existing budget is fully utilized, the budget is saved to a great extent, and the budget economy is improved.

160. And if the real label of the sample data is requested, increasing the count value of the number of the requested real labels by one.

The model needs to detect the number of real tags that have been requested in order to perform operations and processing when receiving the next data.

170. And if the prediction result is inconsistent with the real label, increasing the mistake count value of the sample corresponding to the prediction result by one.

After the real label is requested, the real label is required to be compared with the prediction result so as to obtain whether the classification made by the model is correct, the learning degree of the model to the data is represented by the error count on each class, so that the historical detection effect is obtained, and the model is more inclined to request the real label of the class which is easy to make errors.

180. And increasing the count value of the number of the processed samples by one.

The model needs to monitor the number of processed samples, so as to update the budget parameter at the next moment, and realize the real-time label request according to the current budget.

190. And updating the current budget parameters according to the count value of the real label requested currently, the count value of the number of samples processed currently and a preset budget parameter iterative formula.

And updating the current budget parameters, and determining whether to carry out a label request or not according to the current environment budget for each received sample data to be tested.

In a specific embodiment, the method further comprises: and updating the parameters of the abnormal detection model according to the real label of the sample data obtained by the request.

And for the real label obtained by request, the model is trained through the real label, so that the fitting capacity of the model to data is improved, and a more accurate prediction result is obtained.

In this embodiment, a data annotation request method adaptive to low budget is provided, in which the confidence level of model prediction on a sample is combined with the historical detection effect of the model on the sample, so as to adaptively adjust the bias level of the model on requesting normal/abnormal samples, and a budget monitoring mechanism is introduced, so that the positive level of the model label request will can be dynamically adjusted by monitoring the remaining situation of environmental budget.

Example two

Fig. 2 is a schematic structural diagram of a data annotation requesting device adapted to low budget according to an embodiment of the present disclosure. As shown in fig. 2, there is provided a data annotation requesting device 200 for accommodating low budget, including: a budget parameter obtaining module 210, a confidence obtaining module 220, a probability obtaining module 230, a testing module 240, a label requesting module 250, a label counting module 260, a offending counting module 270, a sample counting module 280, and a budget parameter updating module 290, wherein

The budget parameter obtaining module 210 is configured to obtain a current budget parameter according to a budget ratio value obtained in advance, a count value of a requested real label, a count value of a number of processed samples, and a preset budget parameter iterative formula.

In a specific embodiment, the budget parameter iterative formula in the budget parameter obtaining module includes the following form:

The confidence level obtaining module 220 is configured to, each time a sample data to be detected is received, obtain a prediction result of the sample data through a pre-trained anomaly detection model, and calculate a corresponding confidence level value according to the prediction result, where the prediction result includes a normal sample and an abnormal sample.

The probability obtaining module 230 is configured to obtain a request factor of the sample data according to the classification of the prediction result, the current maximum module length of the sample, the current budget parameter, and the count value of the number of mistakes made on the classified sample, and obtain a probability factor according to the request factor and the self-confidence value.

In one embodiment, the probability obtaining module is configured to obtain a request factor of the sample according to the following equation if the prediction result of the sample data is a normal sample,

wherein, b₊Is the request factor of normal sample, beta is the current budget parameter, K⁺To predict the count of the number of mistakes made in the classification of normal samples, K^-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;

and according to

The probability factor is obtained and the probability value is obtained,

and according to

The probability factor is obtained and the probability value is obtained,

The testing module 240 is configured to perform a bernoulli random test with the probability factor as a success probability to obtain a test result.

The label requesting module 250 is configured to request the real label of the sample data if the test result is successful and the consumed budget ratio value is not greater than the budget ratio value, where the consumed budget ratio value is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model.

The tag counting module 260 is configured to increase the count value of the number of the requested real tags by one if the real tags of the sample data are requested.

The error count module 270 is configured to increase the error count value of the sample corresponding to the prediction result by one if the prediction result is inconsistent with the real label.

The sample counting module 280 is configured to increment the count value of the number of processed samples by one after receiving and processing one sample data to be detected.

The budget parameter updating module 290 is configured to update the current budget parameter according to the count value of the currently requested real tag, the count value of the currently processed sample number, and a preset budget parameter iterative formula.

In a specific implementation manner, the system further includes a tag application module 295, where the tag application module 295 is configured to update parameters of the anomaly detection model according to a real tag of sample data obtained by the request.

In the present embodiment, a data annotation request device 200 that is adaptive to a low budget is provided, which can realize the functions of the data annotation request method that is adaptive to a low budget, and the corresponding implementation steps and effects can be referred to in the method section.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure. As shown in fig. 3, a computing device 300 is provided comprising a memory 310, a processor 320, and computer instructions stored on the memory 310 and executable on the processor 320, the processor 320 implementing the steps of the method when executing the instructions.

Embodiments of the present description provide a storage medium storing computer instructions that, when executed by a processor, implement the steps of the described method.

Example four

In one embodiment, the data annotation request method suitable for low budget is applied to financial anti-fraud tasks, and monitoring of financial transaction data is better completed by reasonably allocating the budget of a real tag request, wherein the step of requesting the real tag for transaction data classification comprises the following steps:

410. and obtaining the current budget parameter according to a pre-obtained budget proportion value, a requested count value of the real label, a count value of the number of the processed transaction data and a preset budget parameter iterative formula. 420. When one transaction data is received, a prediction result of the transaction data is obtained through a pre-trained anomaly detection model, and a corresponding confidence value is calculated according to the prediction result, wherein the prediction result comprises normal transaction data and abnormal transaction data. The model is predicted to be abnormal transaction data, which means that abnormal transaction behaviors such as cheating or illegal operation may exist in a transaction process related to the transaction data, the confidence value of the prediction result represents the distance of the data from the classification plane, and the closer the data is, the lower the confidence degree of the model on the division result of the data is.

430. According to the classification of the prediction result, a request factor of the transaction data is obtained according to the current maximum module length in all the received data, the current budget parameter and the count value of the number of mistakes made on the classification sample, and a probability factor is obtained according to the request factor and the self-confidence value. Specifically, the step of obtaining the probability factor can refer to the method in the first embodiment. 440. And performing a Bernoulli random test once by taking the probability factor as success probability to obtain a test result. 450. And if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the transaction data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the abnormality detection model with the count value of the number of the transaction data processed by the abnormality detection model. 460. And if the real label of the transaction data is requested, increasing the count value of the number of the requested real labels by one. 470. And if the prediction result is inconsistent with the real label, increasing the mistake count value of the sample corresponding to the prediction result by one. 480. And increasing the count value of the number of the processed transaction data by one. 490. And updating the current budget parameters according to the count value of the currently requested real label, the count value of the number of the currently processed transaction data and a preset budget parameter iterative formula.

And classifying the transaction data through an anomaly detection model, and for the classified prediction result, calculating the confidence value of the model on the prediction result, the currently available budget and the number of times of mistakes made by the model in the classification to influence the positive degree of requesting the true label of the transaction data. Therefore, for the transaction data with low confidence degree of the model on the prediction result and more error making times of the sample, on the premise of considering the current budget, the true label of the transaction data is more prone to be requested, and whether the prediction result made by the model is correct or not is determined. In the financial security detection task, an anomaly detection model is used for detecting continuously generated transaction data, and under the limited budget, the budget is reasonably distributed to improve the capability of the model for detecting the illegal transaction data, so that under the limited resource, a more reliable monitoring effect is obtained to the maximum extent.

To sum up, embodiments of the present specification provide a data annotation request method, an apparatus, a device, and a storage medium suitable for low budget, where the data annotation request method can better cope with unbalanced distribution of data, solve the problem of selecting a tag request vector in the prior art, and incorporate the budget influence into an online learning mechanism, overcome the problem that models in the prior art can provide the same data annotation requirements under different budget conditions from the perspective of the model, and can adaptively adjust the deviation degree of the model to request normal/abnormal sample annotations, thereby reasonably allocating limited budget.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A data annotation request method adaptive to low budget is characterized by comprising the following steps:

obtaining a current budget parameter according to a pre-obtained budget proportion value, a count value of a requested real label, a count value of the number of processed samples and a preset budget parameter iterative formula;

when receiving a sample data to be detected, obtaining a prediction result of the sample data through a pre-trained abnormal detection model, and calculating a corresponding confidence value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample;

according to the classification of the prediction result, obtaining a request factor of the sample data according to the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value;

taking the probability factor as success probability, and performing a Bernoulli random test to obtain a test result;

if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the sample data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model;

if the real label of the sample data is requested, increasing the count value of the number of the requested real labels by one;

if the prediction result is inconsistent with the real label, increasing a mistake count value of the sample corresponding to the prediction result by one;

increasing the count value of the number of the processed samples by one;

and updating the current budget parameters according to the count value of the real label requested currently, the count value of the number of samples processed currently and a preset budget parameter iterative formula.

2. The method according to claim 1, wherein in the step of obtaining the current budget parameter according to the pre-obtained budget ratio value, the count value of the requested real tags, the count value of the number of processed samples, and a preset budget parameter iterative formula, the budget parameter iterative formula comprises the following form:

3. The method of claim 1, wherein the step of obtaining a request factor for the sample data according to the classification of the prediction result and the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the confidence value comprises:

and according to

The probability factor is obtained and the probability value is obtained,

and according to

The probability factor is obtained and the probability value is obtained,

4. The method of claim 1, further comprising:

and updating the parameters of the abnormal detection model according to the real label of the sample data obtained by the request.

5. A data annotation request device that accommodates low budgets, comprising: a budget parameter obtaining module, a confidence obtaining module, a probability obtaining module, a test module, a label request module, a label counting module, a mistake counting module, a sample counting module and a budget parameter updating module, wherein

The budget parameter obtaining module is configured to obtain a current budget parameter according to a budget proportion value obtained in advance, a count value of a requested real label, a count value of the number of processed samples and a preset budget parameter iterative formula;

the self-confidence level obtaining module is configured to obtain a prediction result of sample data to be detected through a pre-trained abnormal detection model every time the sample data to be detected is received, and calculate a corresponding self-confidence level value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample;

the probability obtaining module is configured to obtain a request factor of the sample data according to the classification of the prediction result, the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtain a probability factor according to the request factor and the self-confidence value;

the test module is configured to perform a Bernoulli random test once by taking the probability factor as a success probability to obtain a test result;

the label request module is configured to request a real label of the sample data if the test result is successful and the consumed budget ratio value is not greater than the budget ratio value, wherein the consumed budget ratio value is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model;

the tag counting module is configured to increase a count value of the number of the requested real tags by one if the real tags of the sample data are requested;

the mistake counting module is configured to increase a mistake counting value of the sample corresponding to the prediction result by one if the prediction result is inconsistent with the real label;

the sample counting module is configured to increase the count value of the number of processed samples by one after receiving and processing one sample data to be detected;

and the budget parameter updating module is configured to update the current budget parameter according to the count value of the currently requested real label, the count value of the currently processed sample number and a preset budget parameter iterative formula.

6. The apparatus of claim 5, wherein the iterative budget parameter formula in the budget parameter obtaining module comprises the following form:

7. The apparatus of claim 5, wherein the probability obtaining module is configured to

and according to

The probability factor is obtained and the probability value is obtained,

wherein b is_-A request factor for an exception sample, β is a current budget parameter, K⁺To predict the number of mistakes made in classifying normal samplesCount value of (K)^-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;

and according to

The probability factor is obtained and the probability value is obtained,

8. The apparatus of claim 5, further comprising a tag application module, wherein

And the label application module is configured to update the parameters of the anomaly detection model according to the real label of the sample data obtained by the request.

9. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the method of any of claims 1-4 when executing the instructions.

10. A storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 4.