CN111882063A - Data annotation request method, device, equipment and storage medium suitable for low budget - Google Patents

Data annotation request method, device, equipment and storage medium suitable for low budget Download PDF

Info

Publication number
CN111882063A
CN111882063A CN202010767850.XA CN202010767850A CN111882063A CN 111882063 A CN111882063 A CN 111882063A CN 202010767850 A CN202010767850 A CN 202010767850A CN 111882063 A CN111882063 A CN 111882063A
Authority
CN
China
Prior art keywords
sample
budget
value
count value
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010767850.XA
Other languages
Chinese (zh)
Other versions
CN111882063B (en
Inventor
赵曦滨
万海
张豪
黄潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010767850.XA priority Critical patent/CN111882063B/en
Publication of CN111882063A publication Critical patent/CN111882063A/en
Application granted granted Critical
Publication of CN111882063B publication Critical patent/CN111882063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06MCOUNTING MECHANISMS; COUNTING OF OBJECTS NOT OTHERWISE PROVIDED FOR
    • G06M1/00Design features of general application

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The specification discloses a data annotation request method, a data annotation request device, data annotation request equipment and a storage medium which are suitable for low budget, wherein the data annotation request method combines the self-confidence value of a model for sample prediction with the historical detection effect of the model for the sample, and adds budget parameters into influence factors of real labels of requested sample data, so that the model is more inclined to request real labels of samples with more wrong categories under the condition of reasonably distributing limited budget, the problems of selection of request vectors and neglect of budget influence in the prior art are solved, the influence caused by unbalanced distribution of data can be better responded, the positive degree of the request willingness of the model labels is dynamically adjusted by monitoring the residual condition of environmental budget, and the training effect of the model is improved.

Description

Data annotation request method, device, equipment and storage medium suitable for low budget
Technical Field
The invention relates to the field of machine learning, in particular to a data annotation request method, device, equipment and storage medium suitable for low budget.
Background
The online learning algorithm can process mass data generated every moment on line, but a traditional online learning framework requires a model to train every piece of data, and in an actual process, a real label of a sample cannot be acquired for free generally. In an anomaly detection task in an actual scene, the budget for labeling actual training data is usually limited, and when an unbalanced data stream is processed, whether the limited data labeling budget can be sufficiently allocated to rare anomaly samples is a decisive factor of the anomaly detection effect of an algorithm. The limitation of data labeling budget is a practical problem to be faced in an actual process, an active learning mechanism is one of coping methods, a real label of data is selectively and actively requested to be trained according to the confidence degree of a model for sample prediction so as to save the limited data labeling budget, and the training effect of the model is hopefully ensured as much as possible under the condition of limited budget by labeling the training data most needed by the model.
In the prior art, an asymmetric tag request mechanism can better cope with unbalanced distribution of data, but like a cost sensitive mechanism, the active learning mode also faces the problem of selecting a tag request vector. In addition, the existing active learning related technologies do not consider the budget influence, and only from the perspective of a model, whether to request a real label of data for training is determined according to the confidence of prediction, so that the model can put forward the same data labeling requirements under different budget conditions, which is obviously not good enough.
In summary, it is an urgent need to solve the problem of adding a budget factor to an active learning mechanism to improve the effect of detecting the abnormality of the model by researching a data annotation request method suitable for low budget.
Disclosure of Invention
The present specification provides a method, apparatus, device and storage medium for low budget adaptive data annotation request, which overcome at least one technical problem in the prior art.
According to a first aspect of embodiments of the present specification, there is provided a data annotation request method for accommodating low budget, including: obtaining a current budget parameter according to a pre-obtained budget proportion value, a count value of a requested real label, a count value of the number of processed samples and a preset budget parameter iterative formula; when receiving a sample data to be detected, obtaining a prediction result of the sample data through a pre-trained abnormal detection model, and calculating a corresponding confidence value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample; according to the classification of the prediction result, obtaining a request factor of the sample data according to the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value; taking the probability factor as success probability, and performing a Bernoulli random test to obtain a test result; if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the sample data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model; if the real label of the sample data is requested, increasing the count value of the number of the requested real labels by one; if the prediction result is inconsistent with the real label, increasing a mistake count value of the sample corresponding to the prediction result by one; increasing the count value of the number of the processed samples by one; and updating the current budget parameters according to the count value of the real label requested currently, the count value of the number of samples processed currently and a preset budget parameter iterative formula.
Preferably, in the step of obtaining the current budget parameter according to a pre-obtained budget ratio value, a requested count value of real tags, a count value of the number of processed samples, and a preset budget parameter iterative formula, the budget parameter iterative formula includes the following form:
Figure BDA0002615358150000031
wherein beta is the current budget parameter, beta0To obtain in advanceQN is the count value of the requested real tag, and TN is the count value of the number of processed samples.
Preferably, the step of obtaining a request factor of the sample data according to the classification of the prediction result and the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value includes:
if the prediction result of the sample data is a normal sample, the request factor of the sample is obtained according to the following formula,
Figure BDA0002615358150000032
wherein, b+Is the request factor of normal sample, beta is the current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-To predict the count value, X, as the number of mistakes made in the classification of abnormal samplesThe current maximum mode length of the sample is taken as the maximum mode length;
and according to
Figure BDA0002615358150000033
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b+A requesting factor, p, for the sampletA confidence value of the prediction result of the sample data;
if the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,
Figure BDA0002615358150000034
wherein b is-A request factor for an exception sample, β is a current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-To predict the count value, X, as the number of mistakes made in the classification of abnormal samplesThe current maximum mode length of the sample is taken as the maximum mode length;
and according to
Figure BDA0002615358150000041
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b-A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
Preferably, the method further comprises the following steps: and updating the parameters of the abnormal detection model according to the real label of the sample data obtained by the request.
According to a second aspect of the embodiments of the present specification, there is provided a data annotation requesting device adapted to low budget, including: the system comprises a budget parameter obtaining module, a confidence obtaining module, a probability obtaining module, a test module, a label request module, a label counting module, a mistake counting module, a sample counting module and a budget parameter updating module, wherein the budget parameter obtaining module is configured to obtain a current budget parameter according to a budget proportion value obtained in advance, a count value of requested real labels, a count value of the number of processed samples and a preset budget parameter iterative formula; the self-confidence level obtaining module is configured to obtain a prediction result of sample data to be detected through a pre-trained abnormal detection model every time the sample data to be detected is received, and calculate a corresponding self-confidence level value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample; the probability obtaining module is configured to obtain a request factor of the sample data according to the classification of the prediction result, the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtain a probability factor according to the request factor and the self-confidence value; the test module is configured to perform a Bernoulli random test once by taking the probability factor as a success probability to obtain a test result; the label request module is configured to request a real label of the sample data if the test result is successful and the consumed budget ratio value is not greater than the budget ratio value, wherein the consumed budget ratio value is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model; the tag counting module is configured to increase a count value of the number of the requested real tags by one if the real tags of the sample data are requested; the mistake counting module is configured to increase a mistake counting value of the sample corresponding to the prediction result by one if the prediction result is inconsistent with the real label; the sample counting module is configured to increase the count value of the number of processed samples by one after receiving and processing one sample data to be detected; and the budget parameter updating module is configured to update the current budget parameter according to the count value of the currently requested real label, the count value of the currently processed sample number and a preset budget parameter iterative formula.
Preferably, the budget parameter iterative formula in the budget parameter obtaining module includes the following form:
Figure BDA0002615358150000051
wherein beta is the current budget parameter, beta0For the pre-obtained budget ratio value, QN is the count value of the requested real tag, and TN is the count value of the number of processed samples.
Preferably, the probability obtaining module is configured to obtain a request factor of the sample according to the following equation if the prediction result of the sample data is a normal sample,
Figure BDA0002615358150000052
wherein, b+Is the request factor of normal sample, beta is the current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-To predict the count value, X, as the number of mistakes made in the classification of abnormal samplesThe current maximum mode length of the sample is taken as the maximum mode length;
and according to
Figure BDA0002615358150000053
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b+A requesting factor, p, for the sampletA confidence value of the prediction result of the sample data;
if the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,
Figure BDA0002615358150000054
wherein b is-A request factor for an exception sample, β is a current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;
and according to
Figure BDA0002615358150000061
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b-A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
Optionally, the system further comprises a tag application module, wherein the tag application module is configured to update parameters of the anomaly detection model according to a real tag of sample data obtained by the request.
According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the data annotation request method when executing the instructions.
According to a fourth aspect of embodiments herein, there is provided a storage medium storing computer instructions which, when executed by a processor, implement the steps of the data annotation request method.
The beneficial effects of the embodiment of the specification are as follows:
the specification provides a data annotation request method, a device, equipment and a storage medium suitable for low budget, wherein the data annotation request method comprises the steps of firstly, combining a self-confidence value predicted by a model on a sample with a historical detection effect (namely, the number of times of mistakes) of the model on the sample to generate a request factor so as to adaptively adjust the deviation degree of the model on requesting normal/abnormal sample annotation, so that the model is more inclined to request real labels of more types of samples with mistakes, the problem of selecting request vectors in the prior art is solved, and the influence caused by unbalanced distribution of data can be better responded; secondly, the consumed budget is monitored by counting the number of processed samples and the number of requested real labels, a budget parameter is obtained according to a budget proportion value obtained in advance, the budget parameter is added into a request factor, and the positive degree of the request will of the model labels can be dynamically adjusted by monitoring the residual condition of the environmental budget, so that the limited budget is reasonably allocated, and the training effect of the model is ensured as much as possible.
The innovation points of the embodiment of the specification comprise:
1. in the embodiment, the historical error times of the model on each type of sample are monitored, and the confidence degree of the model on the sample prediction is combined with the historical detection effect of the model on the sample, so that the deviation degree of the model on the request of the normal/abnormal sample marking is adjusted, the model tends to request real labels in the type of samples with more historical error times, the selection problem of the request vector is solved, the influence caused by data unbalanced distribution can be better responded, and the method is one of the innovation points of the embodiment of the specification.
2. In this embodiment, the consumed budget is monitored by counting the number of processed samples and the number of requested real tags, a budget parameter is obtained according to a budget proportion value obtained in advance, and the budget parameter is added to the request factor, so that the positive degree of the model tag request will can be dynamically adjusted by monitoring the residual situation of the environmental budget, thereby reasonably allocating the limited budget and ensuring the training effect of the model as much as possible, which is one of the innovative points in the embodiments of the present specification.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for accommodating a low-budget data annotation request according to an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a data annotation request device adapted to low budget according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "including" and "having" and any variations thereof in the embodiments of the present specification and the drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
In the abnormal detection task, the budget for labeling actual training data is usually limited, when processing an unbalanced data stream, a real label of data is selectively and actively requested according to the confidence degree of the model to the sample prediction and training is performed, and it is desirable to save the limited data labeling budget by labeling the training data most needed by the model, and ensure the training effect of the model as much as possible under the condition of limited budget.
In the prior art, when an asymmetric label request mechanism is adopted to solve unbalanced distribution of data, selection of a request vector depends on industry priori knowledge and cannot meet dynamic change of data distribution in a stream processing mode, therefore, the confidence degree of a model for sample prediction and the effect of detection in model history prediction are combined to influence the request willingness of the model for real labels, dynamic adjustment of label requests is realized according to data processing results, the model is prone to requesting real labels of categories with a large number of mistakes, the distribution of the data can be better learned, the number of times of mistakes is reduced, and the training effect is improved. Meanwhile, in order to introduce the budget condition into the request mechanism, the budget parameter is added into the request factor, the existing budget condition of the model is monitored, and the willingness of requesting the real label is adjusted according to the existing budget condition.
The embodiment of the specification discloses a method, a device, equipment and a storage medium for data annotation request adapting to low budget, which are respectively described in detail below.
Example one
Fig. 1 is a flowchart illustrating a method for accommodating a low-budget data annotation request according to an embodiment of the present disclosure. As shown in fig. 1, a method for accommodating a low-budget data annotation request is provided, which includes:
110. and obtaining the current budget parameter according to a pre-obtained budget proportion value, a requested count value of the real label, a count value of the number of processed samples and a preset budget parameter iterative formula.
In a specific embodiment, in the step of obtaining the current budget parameter according to a pre-obtained budget ratio value, a requested count value of real tags, a count value of the number of processed samples, and a preset budget parameter iterative formula, the budget parameter iterative formula includes the following form:
Figure BDA0002615358150000091
wherein beta is the current budget parameter, beta0For the pre-obtained budget ratio value, QN is the count value of the requested real tag, and TN is the count value of the number of processed samples.
β0In order to make the environmental budget account for the proportion of the whole samples to be processed, the stream processing model also needs to analyze mass data generated continuously on line for a long time in the actual process, and the stream processing model is not generally required to process nearly infinite data with a relatively fixed budget, so that giving the environmental budget in the form of a proportion, namely the proportion of the budget given to the model to the number of the samples to be processed, is obviously a more reasonable way.
In the iterative formula of the budget parameters,
Figure BDA0002615358150000092
for the scale value for which the model has consumed the budget,
Figure BDA0002615358150000093
for proportional values of the existing budget, the current budget parameter β follows the existing budget
Figure BDA0002615358150000094
Is increased, thereby making the model more prone to request the true label of the sample; instead, with the existing budget
Figure BDA0002615358150000095
The model is more inclined to not request the real label of the data to save the data annotation budget. In the formula, the value of beta is in the range of 0<β<2β0In and then when
Figure BDA0002615358150000096
When, there is beta ═ beta0At this time, the model can still adapt to a severe budget environment.
120. When receiving a sample data to be tested, obtaining a prediction result of the sample data through a pre-trained anomaly detection model, and calculating a corresponding confidence value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample.
130. According to the classification of the prediction result, a request factor of the sample data is obtained according to the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and a probability factor is obtained according to the request factor and the self-confidence value.
In a specific embodiment, the step of obtaining a request factor of the sample data according to the classification of the prediction result and the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value includes:
if the prediction result of the sample data is a normal sample, the request factor of the sample is obtained according to the following formula,
Figure BDA0002615358150000101
wherein, b+Is the request factor of normal sample, beta is the current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-To predict the count of the number of mistakes made in the classification of abnormal samples, X' is the current maximum modulo length of the sample.
The request factor combines the historical number of mistakes made by the sample class, so that the model is more inclined to request the true labels of the samples made by more classes.
And according to
Figure BDA0002615358150000102
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b+A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
The probability factor combines the historical error number of the sample type in the request factor with the confidence value of the prediction result of the sample data, so that the model is more inclined to request the real label of the sample with less confidence and more errors, and the label request mechanism is more reasonable.
If the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,
Figure BDA0002615358150000103
wherein b is-A request factor for an exception sample, β is a current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-To predict the count of the number of mistakes made in the classification of abnormal samples, X' is the current maximum modulo length of the sample.
And according to
Figure BDA0002615358150000111
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b-A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
The processing of the sample data with the prediction result of the abnormal sample is similar to the data with the prediction result of the normal sample, the number of historical mistakes in different types is different, and the value of the probability factor is influenced, so that request intents with different weights are given to the data in different types in the request label, and the model can be fitted to the data with unbalanced distribution more fully.
140. And performing a Bernoulli random test once by taking the probability factor as success probability to obtain a test result.
The larger the value of the probability factor is, the more the test result tends to be successful, so that the true label of the sample data is more likely to be requested.
150. And if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the sample data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the abnormality detection model with the count value of the number of samples processed by the abnormality detection model.
Under the condition that the experimental result is successful, if the current condition exists
Figure BDA0002615358150000112
The real label of the sample data is requested, the necessity of requesting the label is guaranteed according to the prediction confidence level and the category history mistake number, the feasibility of requesting the label is guaranteed according to the budget condition, the existing budget is fully utilized, the budget is saved to a great extent, and the budget economy is improved.
160. And if the real label of the sample data is requested, increasing the count value of the number of the requested real labels by one.
The model needs to detect the number of real tags that have been requested in order to perform operations and processing when receiving the next data.
170. And if the prediction result is inconsistent with the real label, increasing the mistake count value of the sample corresponding to the prediction result by one.
After the real label is requested, the real label is required to be compared with the prediction result so as to obtain whether the classification made by the model is correct, the learning degree of the model to the data is represented by the error count on each class, so that the historical detection effect is obtained, and the model is more inclined to request the real label of the class which is easy to make errors.
180. And increasing the count value of the number of the processed samples by one.
The model needs to monitor the number of processed samples, so as to update the budget parameter at the next moment, and realize the real-time label request according to the current budget.
190. And updating the current budget parameters according to the count value of the real label requested currently, the count value of the number of samples processed currently and a preset budget parameter iterative formula.
And updating the current budget parameters, and determining whether to carry out a label request or not according to the current environment budget for each received sample data to be tested.
In a specific embodiment, the method further comprises: and updating the parameters of the abnormal detection model according to the real label of the sample data obtained by the request.
And for the real label obtained by request, the model is trained through the real label, so that the fitting capacity of the model to data is improved, and a more accurate prediction result is obtained.
In this embodiment, a data annotation request method adaptive to low budget is provided, in which the confidence level of model prediction on a sample is combined with the historical detection effect of the model on the sample, so as to adaptively adjust the bias level of the model on requesting normal/abnormal samples, and a budget monitoring mechanism is introduced, so that the positive level of the model label request will can be dynamically adjusted by monitoring the remaining situation of environmental budget.
Example two
Fig. 2 is a schematic structural diagram of a data annotation requesting device adapted to low budget according to an embodiment of the present disclosure. As shown in fig. 2, there is provided a data annotation requesting device 200 for accommodating low budget, including: a budget parameter obtaining module 210, a confidence obtaining module 220, a probability obtaining module 230, a testing module 240, a label requesting module 250, a label counting module 260, a offending counting module 270, a sample counting module 280, and a budget parameter updating module 290, wherein
The budget parameter obtaining module 210 is configured to obtain a current budget parameter according to a budget ratio value obtained in advance, a count value of a requested real label, a count value of a number of processed samples, and a preset budget parameter iterative formula.
In a specific embodiment, the budget parameter iterative formula in the budget parameter obtaining module includes the following form:
Figure BDA0002615358150000131
wherein beta is the current budget parameter, beta0To obtain in advanceQN is the count value of the requested real tag, and TN is the count value of the number of processed samples.
The confidence level obtaining module 220 is configured to, each time a sample data to be detected is received, obtain a prediction result of the sample data through a pre-trained anomaly detection model, and calculate a corresponding confidence level value according to the prediction result, where the prediction result includes a normal sample and an abnormal sample.
The probability obtaining module 230 is configured to obtain a request factor of the sample data according to the classification of the prediction result, the current maximum module length of the sample, the current budget parameter, and the count value of the number of mistakes made on the classified sample, and obtain a probability factor according to the request factor and the self-confidence value.
In one embodiment, the probability obtaining module is configured to obtain a request factor of the sample according to the following equation if the prediction result of the sample data is a normal sample,
Figure BDA0002615358150000132
wherein, b+Is the request factor of normal sample, beta is the current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;
and according to
Figure BDA0002615358150000141
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b+A requesting factor, p, for the sampletA confidence value of the prediction result of the sample data;
if the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,
Figure BDA0002615358150000142
wherein b is-A request factor for an exception sample, β is a current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;
and according to
Figure BDA0002615358150000143
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b-A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
The testing module 240 is configured to perform a bernoulli random test with the probability factor as a success probability to obtain a test result.
The label requesting module 250 is configured to request the real label of the sample data if the test result is successful and the consumed budget ratio value is not greater than the budget ratio value, where the consumed budget ratio value is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model.
The tag counting module 260 is configured to increase the count value of the number of the requested real tags by one if the real tags of the sample data are requested.
The error count module 270 is configured to increase the error count value of the sample corresponding to the prediction result by one if the prediction result is inconsistent with the real label.
The sample counting module 280 is configured to increment the count value of the number of processed samples by one after receiving and processing one sample data to be detected.
The budget parameter updating module 290 is configured to update the current budget parameter according to the count value of the currently requested real tag, the count value of the currently processed sample number, and a preset budget parameter iterative formula.
In a specific implementation manner, the system further includes a tag application module 295, where the tag application module 295 is configured to update parameters of the anomaly detection model according to a real tag of sample data obtained by the request.
In the present embodiment, a data annotation request device 200 that is adaptive to a low budget is provided, which can realize the functions of the data annotation request method that is adaptive to a low budget, and the corresponding implementation steps and effects can be referred to in the method section.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure. As shown in fig. 3, a computing device 300 is provided comprising a memory 310, a processor 320, and computer instructions stored on the memory 310 and executable on the processor 320, the processor 320 implementing the steps of the method when executing the instructions.
Embodiments of the present description provide a storage medium storing computer instructions that, when executed by a processor, implement the steps of the described method.
Example four
In one embodiment, the data annotation request method suitable for low budget is applied to financial anti-fraud tasks, and monitoring of financial transaction data is better completed by reasonably allocating the budget of a real tag request, wherein the step of requesting the real tag for transaction data classification comprises the following steps:
410. and obtaining the current budget parameter according to a pre-obtained budget proportion value, a requested count value of the real label, a count value of the number of the processed transaction data and a preset budget parameter iterative formula. 420. When one transaction data is received, a prediction result of the transaction data is obtained through a pre-trained anomaly detection model, and a corresponding confidence value is calculated according to the prediction result, wherein the prediction result comprises normal transaction data and abnormal transaction data. The model is predicted to be abnormal transaction data, which means that abnormal transaction behaviors such as cheating or illegal operation may exist in a transaction process related to the transaction data, the confidence value of the prediction result represents the distance of the data from the classification plane, and the closer the data is, the lower the confidence degree of the model on the division result of the data is.
430. According to the classification of the prediction result, a request factor of the transaction data is obtained according to the current maximum module length in all the received data, the current budget parameter and the count value of the number of mistakes made on the classification sample, and a probability factor is obtained according to the request factor and the self-confidence value. Specifically, the step of obtaining the probability factor can refer to the method in the first embodiment. 440. And performing a Bernoulli random test once by taking the probability factor as success probability to obtain a test result. 450. And if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the transaction data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the abnormality detection model with the count value of the number of the transaction data processed by the abnormality detection model. 460. And if the real label of the transaction data is requested, increasing the count value of the number of the requested real labels by one. 470. And if the prediction result is inconsistent with the real label, increasing the mistake count value of the sample corresponding to the prediction result by one. 480. And increasing the count value of the number of the processed transaction data by one. 490. And updating the current budget parameters according to the count value of the currently requested real label, the count value of the number of the currently processed transaction data and a preset budget parameter iterative formula.
And classifying the transaction data through an anomaly detection model, and for the classified prediction result, calculating the confidence value of the model on the prediction result, the currently available budget and the number of times of mistakes made by the model in the classification to influence the positive degree of requesting the true label of the transaction data. Therefore, for the transaction data with low confidence degree of the model on the prediction result and more error making times of the sample, on the premise of considering the current budget, the true label of the transaction data is more prone to be requested, and whether the prediction result made by the model is correct or not is determined. In the financial security detection task, an anomaly detection model is used for detecting continuously generated transaction data, and under the limited budget, the budget is reasonably distributed to improve the capability of the model for detecting the illegal transaction data, so that under the limited resource, a more reliable monitoring effect is obtained to the maximum extent.
To sum up, embodiments of the present specification provide a data annotation request method, an apparatus, a device, and a storage medium suitable for low budget, where the data annotation request method can better cope with unbalanced distribution of data, solve the problem of selecting a tag request vector in the prior art, and incorporate the budget influence into an online learning mechanism, overcome the problem that models in the prior art can provide the same data annotation requirements under different budget conditions from the perspective of the model, and can adaptively adjust the deviation degree of the model to request normal/abnormal sample annotations, thereby reasonably allocating limited budget.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A data annotation request method adaptive to low budget is characterized by comprising the following steps:
obtaining a current budget parameter according to a pre-obtained budget proportion value, a count value of a requested real label, a count value of the number of processed samples and a preset budget parameter iterative formula;
when receiving a sample data to be detected, obtaining a prediction result of the sample data through a pre-trained abnormal detection model, and calculating a corresponding confidence value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample;
according to the classification of the prediction result, obtaining a request factor of the sample data according to the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the self-confidence value;
taking the probability factor as success probability, and performing a Bernoulli random test to obtain a test result;
if the test result is successful and the proportion value of the consumed budget is not greater than the budget proportion value, requesting a real label of the sample data, wherein the proportion value of the consumed budget is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model;
if the real label of the sample data is requested, increasing the count value of the number of the requested real labels by one;
if the prediction result is inconsistent with the real label, increasing a mistake count value of the sample corresponding to the prediction result by one;
increasing the count value of the number of the processed samples by one;
and updating the current budget parameters according to the count value of the real label requested currently, the count value of the number of samples processed currently and a preset budget parameter iterative formula.
2. The method according to claim 1, wherein in the step of obtaining the current budget parameter according to the pre-obtained budget ratio value, the count value of the requested real tags, the count value of the number of processed samples, and a preset budget parameter iterative formula, the budget parameter iterative formula comprises the following form:
Figure FDA0002615358140000021
wherein beta is the current budget parameter, beta0For the pre-obtained budget ratio value, QN is the count value of the requested real tag, and TN is the count value of the number of processed samples.
3. The method of claim 1, wherein the step of obtaining a request factor for the sample data according to the classification of the prediction result and the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtaining a probability factor according to the request factor and the confidence value comprises:
if the prediction result of the sample data is a normal sample, the request factor of the sample is obtained according to the following formula,
Figure FDA0002615358140000022
wherein, b+Is the request factor of normal sample, beta is the current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-To predict the count value, X, as the number of mistakes made in the classification of abnormal samplesThe current maximum mode length of the sample is taken as the maximum mode length;
and according to
Figure FDA0002615358140000023
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b+A requesting factor, p, for the sampletA confidence value of the prediction result of the sample data;
if the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,
Figure FDA0002615358140000031
wherein b is-A request factor for an exception sample, β is a current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;
and according to
Figure FDA0002615358140000032
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b-A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
4. The method of claim 1, further comprising:
and updating the parameters of the abnormal detection model according to the real label of the sample data obtained by the request.
5. A data annotation request device that accommodates low budgets, comprising: a budget parameter obtaining module, a confidence obtaining module, a probability obtaining module, a test module, a label request module, a label counting module, a mistake counting module, a sample counting module and a budget parameter updating module, wherein
The budget parameter obtaining module is configured to obtain a current budget parameter according to a budget proportion value obtained in advance, a count value of a requested real label, a count value of the number of processed samples and a preset budget parameter iterative formula;
the self-confidence level obtaining module is configured to obtain a prediction result of sample data to be detected through a pre-trained abnormal detection model every time the sample data to be detected is received, and calculate a corresponding self-confidence level value according to the prediction result, wherein the prediction result comprises a normal sample and an abnormal sample;
the probability obtaining module is configured to obtain a request factor of the sample data according to the classification of the prediction result, the current maximum module length of the sample, the current budget parameter and the count value of the number of mistakes made on the classified sample, and obtain a probability factor according to the request factor and the self-confidence value;
the test module is configured to perform a Bernoulli random test once by taking the probability factor as a success probability to obtain a test result;
the label request module is configured to request a real label of the sample data if the test result is successful and the consumed budget ratio value is not greater than the budget ratio value, wherein the consumed budget ratio value is obtained by comparing the count value of the real label requested by the anomaly detection model with the count value of the number of samples processed by the anomaly detection model;
the tag counting module is configured to increase a count value of the number of the requested real tags by one if the real tags of the sample data are requested;
the mistake counting module is configured to increase a mistake counting value of the sample corresponding to the prediction result by one if the prediction result is inconsistent with the real label;
the sample counting module is configured to increase the count value of the number of processed samples by one after receiving and processing one sample data to be detected;
and the budget parameter updating module is configured to update the current budget parameter according to the count value of the currently requested real label, the count value of the currently processed sample number and a preset budget parameter iterative formula.
6. The apparatus of claim 5, wherein the iterative budget parameter formula in the budget parameter obtaining module comprises the following form:
Figure FDA0002615358140000041
wherein beta is the current budget parameter, beta0For the pre-obtained budget ratio value, QN is the count value of the requested real tag, and TN is the count value of the number of processed samples.
7. The apparatus of claim 5, wherein the probability obtaining module is configured to
If the prediction result of the sample data is a normal sample, the request factor of the sample is obtained according to the following formula,
Figure FDA0002615358140000051
wherein, b+Is the request factor of normal sample, beta is the current budget parameter, K+To predict the count of the number of mistakes made in the classification of normal samples, K-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;
and according to
Figure FDA0002615358140000052
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b+A requesting factor, p, for the sampletA confidence value of the prediction result of the sample data;
if the prediction result of the sample data is an abnormal sample, obtaining the request factor of the sample according to the following formula,
Figure FDA0002615358140000053
wherein b is-A request factor for an exception sample, β is a current budget parameter, K+To predict the number of mistakes made in classifying normal samplesCount value of (K)-The number of errors in the abnormal sample classification is predicted as a count value, and X' is the current maximum module length of the sample;
and according to
Figure FDA0002615358140000054
The probability factor is obtained and the probability value is obtained,
wherein b is a probability factor, b-A requesting factor, p, for the sampletIs the confidence value of the prediction result of the sample data.
8. The apparatus of claim 5, further comprising a tag application module, wherein
And the label application module is configured to update the parameters of the anomaly detection model according to the real label of the sample data obtained by the request.
9. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the method of any of claims 1-4 when executing the instructions.
10. A storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 4.
CN202010767850.XA 2020-08-03 2020-08-03 Data annotation request method, device, equipment and storage medium suitable for low budget Active CN111882063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010767850.XA CN111882063B (en) 2020-08-03 2020-08-03 Data annotation request method, device, equipment and storage medium suitable for low budget

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010767850.XA CN111882063B (en) 2020-08-03 2020-08-03 Data annotation request method, device, equipment and storage medium suitable for low budget

Publications (2)

Publication Number Publication Date
CN111882063A true CN111882063A (en) 2020-11-03
CN111882063B CN111882063B (en) 2022-12-02

Family

ID=73205665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010767850.XA Active CN111882063B (en) 2020-08-03 2020-08-03 Data annotation request method, device, equipment and storage medium suitable for low budget

Country Status (1)

Country Link
CN (1) CN111882063B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800799A (en) * 2018-12-31 2019-05-24 华南理工大学 A kind of online Active Learning Method suitable for no label unbalanced data stream
CN111143517A (en) * 2019-12-30 2020-05-12 浙江阿尔法人力资源有限公司 Method, device, equipment and storage medium for predicting human-selected label
CN111460150A (en) * 2020-03-27 2020-07-28 北京松果电子有限公司 Training method, classification method and device of classification model and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800799A (en) * 2018-12-31 2019-05-24 华南理工大学 A kind of online Active Learning Method suitable for no label unbalanced data stream
CN111143517A (en) * 2019-12-30 2020-05-12 浙江阿尔法人力资源有限公司 Method, device, equipment and storage medium for predicting human-selected label
CN111460150A (en) * 2020-03-27 2020-07-28 北京松果电子有限公司 Training method, classification method and device of classification model and storage medium

Also Published As

Publication number Publication date
CN111882063B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
US10755196B2 (en) Determining retraining of predictive models
CN113837596B (en) Fault determination method and device, electronic equipment and storage medium
CN113688957A (en) Target detection method, device, equipment and medium based on multi-model fusion
CN113515399A (en) Data anomaly detection method and device
CN111930603A (en) Server performance detection method, device, system and medium
CN113988458A (en) Anti-money laundering risk monitoring method and model training method, device, equipment and medium
CN113095563A (en) Method and device for reviewing prediction result of artificial intelligence model
CN113191074A (en) Machine room power supply parameter detection method for data center
CN113269359A (en) User financial status prediction method, device, medium, and computer program product
CN117330963B (en) Energy storage power station fault detection method, system and equipment
CN110188322A (en) A kind of wave-shape amplitude uncertainty determines method and system
CN111882063B (en) Data annotation request method, device, equipment and storage medium suitable for low budget
CN107480703B (en) Transaction fault detection method and device
US11501132B2 (en) Predictive maintenance system for spatially correlated industrial equipment
CN111783883A (en) Abnormal data detection method and device
CN116629612A (en) Risk prediction method and device, storage medium and electronic equipment
CN116560794A (en) Exception handling method and device for virtual machine, medium and computer equipment
CN111428858A (en) Method and device for determining number of samples, electronic equipment and storage medium
CN114710397A (en) Method, device, electronic equipment and medium for positioning fault root cause of service link
CN115659826A (en) Server failure rate detection method and device, electronic equipment and storage medium
CN115293735A (en) Unmanned factory industrial internet platform monitoring management method and system
EP3686812A1 (en) System and method for context-based training of a machine learning model
CN113590484A (en) Algorithm model service testing method, system, equipment and storage medium
CN113052509A (en) Model evaluation method, model evaluation apparatus, electronic device, and storage medium
CN114722061A (en) Data processing method and device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant