CN110610193A - Method and device for processing labeled data - Google Patents

Method and device for processing labeled data Download PDF

Info

Publication number
CN110610193A
CN110610193A CN201910739900.0A CN201910739900A CN110610193A CN 110610193 A CN110610193 A CN 110610193A CN 201910739900 A CN201910739900 A CN 201910739900A CN 110610193 A CN110610193 A CN 110610193A
Authority
CN
China
Prior art keywords
data
labeled
category
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910739900.0A
Other languages
Chinese (zh)
Inventor
刘逸哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dazhu (hangzhou) Technology Co Ltd
Original Assignee
Dazhu (hangzhou) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dazhu (hangzhou) Technology Co Ltd filed Critical Dazhu (hangzhou) Technology Co Ltd
Priority to CN201910739900.0A priority Critical patent/CN110610193A/en
Publication of CN110610193A publication Critical patent/CN110610193A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a processing method and device of label data, computer equipment and a computer storage medium, relates to the technical field of data labeling, and can adjust the distribution of data to be labeled and improve the effect of label data on model training. The method comprises the following steps: acquiring sample data of each category randomly extracted from machine data as data to be labeled; in the process of labeling the data to be labeled based on the labeling platform, counting the labeled data under each category, and judging whether the labeled data under each category respectively reach the training standard preset for the classification prediction model; if so, inputting the labeled data meeting the training standard category as training data into a network model for training to obtain a classification prediction model; and updating the data to be labeled according to the prediction probability of the classification prediction model on the test data.

Description

Method and device for processing labeled data
Technical Field
The present invention relates to the field of data annotation technologies, and in particular, to a method and an apparatus for processing annotated data, a computer device, and a computer storage medium.
Background
In recent years, with the continuous development of computer and internet technologies, various intelligent applications are developed, and tools such as big data and artificial intelligence are gradually applied to practice. Natural language processing is one direction of artificial intelligence, enabling computers to read human languages and understand the content, thought and emotion expressed in the languages.
The mainstream technology of natural language technology processing is mainly based on statistical machine learning, and the technologies mainly depend on two aspects, namely a statistical model and an optimization algorithm aiming at different tasks; another aspect is a corresponding large-scale corpus. Therefore, a large amount of marked supervised data is needed for natural language processing, because the marked data amount is very large, multiple people are needed for marking in parallel, and the marking efficiency and accuracy are very important for large-scale algorithm modeling. In many tasks, the labeling system of the corpus is often difficult to grasp, the language cannot be comprehensively and finely described if the classification is too coarse, and the labeling efficiency is reduced if the classification is too fine.
The existing labeling system and method support a multi-user real-time labeling system, label data are imported at one time and handed to labeling personnel, and the distribution of the label data is unknown before labeling, so that extracted data to be labeled are often unbalanced, some important data needing to be labeled are not extracted, and some data are extracted too much and repeatedly labeled, so that the distribution of the data to be labeled is uneven, and the effect of the label data on model training is influenced.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for processing annotation data, a computer device, and a computer storage medium, and mainly aims to adjust distribution of data to be annotated and improve an effect of the annotation data on model training.
According to an aspect of the present invention, there is provided a method for processing annotation data, the method comprising:
acquiring sample data of each category randomly extracted from machine data as data to be labeled;
in the process of labeling the data to be labeled based on the labeling platform, counting the labeled data under each category, and judging whether the labeled data under each category respectively reach the training standard preset for the classification prediction model;
if so, inputting the labeled data meeting the training standard category as training data into a network model for training to obtain a classification prediction model;
and updating the data to be labeled according to the prediction probability of the classification prediction model on the test data.
Further, the counting the labeled data under each category and judging whether the labeled data under each category respectively reaches the training standard preset for the classification prediction model includes:
counting the number of data corresponding to the labeled data under each category, and taking the number of data reaching a target threshold value preset for a classification prediction model as a training standard;
and judging whether the number of the data corresponding to the labeled data under each category reaches a target threshold value preset for the classification prediction model.
Further, before the data labeled under the training standard class is input into the network model as training data for training to obtain a classification prediction model, the method further includes:
and deleting redundant sample data in the labeled data meeting the training standard category so as to ensure that the number distribution of the data corresponding to the labeled data under each category is the same.
Further, after the statistics of the labeled data under each category and the judgment of whether the labeled data under each category respectively reaches the training standard preset for the classification prediction model, the method further includes:
if not, screening sample data which does not meet the training standard category from the machine data, and labeling the sample data which does not meet the training standard category so as to enable the labeled data under the category to meet the training standard of the category model.
Further, the screening sample data which does not meet the category of the training standard from the machine data includes:
setting a regular matching rule by collecting keywords which do not reach the training standard category;
and screening sample data which does not reach the training standard category from the machine data based on the regular matching rule.
Further, the updating the data to be labeled according to the prediction probability of the classification prediction model on the test data includes:
predicting the classification prediction model by using unlabeled test data to obtain the prediction probability of the test data;
and extracting target test data with the prediction probability within a preset range, and updating the data to be labeled.
Further, the extracting target test data with the prediction probability within a preset range, and the updating the data to be labeled includes:
sequencing the target test data according to the prediction probability from small to large to obtain a labeling sequence of the target test data;
and updating the target test data into the data to be labeled, and adjusting the labeling sequence of the data to be labeled.
According to another aspect of the present invention, there is provided an apparatus for processing annotation data, the apparatus comprising:
the acquisition unit is used for acquiring sample data under each category randomly extracted from the machine data as data to be marked;
the judging unit is used for counting the labeled data under each category and judging whether the labeled data under each category respectively reach the training standard which is set for the classification prediction model in advance in the process of labeling the data to be labeled based on the labeling platform;
the training unit is used for inputting the labeled data meeting the training standard category into the network model as training data for training to obtain a classification prediction model if the labeled data meeting the training standard category meets the training standard of the corresponding category model;
and the updating unit is used for updating the data to be labeled according to the prediction probability of the classification prediction model on the test data.
Further, the judging unit includes:
the statistical module is used for counting the number of data corresponding to the labeled data under each category and taking the number of data reaching a target threshold value preset for the classification prediction model as a training standard;
and the judging module is used for judging whether the number of the data corresponding to the labeled data under each category reaches a target threshold value preset for the classification prediction model.
Further, the apparatus further comprises:
and the deleting unit is used for deleting redundant sample data in the labeled data reaching the training standard category before inputting the labeled data reaching the training standard category as training data into the network model for training to obtain the classification prediction model, so that the number distribution of the data corresponding to the labeled data under each category is the same.
Further, the apparatus further comprises:
and the screening unit is used for screening sample data which does not reach the training standard category from the machine data and labeling the sample data which does not reach the training standard category so as to enable the data labeled under the category to reach the training standard of the category model after counting the labeled data under the category and judging whether the labeled data under the category respectively reach the training standard which is set for the classification prediction model in advance.
Further, the screening unit includes:
the setting module is used for setting a regular matching rule by collecting keywords which do not reach the training standard category;
and the screening module is used for screening sample data which does not reach the training standard category from the machine data based on the regular matching rule.
Further, the update unit includes:
the test module is used for predicting the classification prediction model by adopting the unlabeled test data to obtain the prediction probability of the test data;
and the extraction module is used for extracting the target test data with the prediction probability within a preset range and updating the data to be labeled.
Further, the extraction module is specifically configured to sort the target test data according to the prediction probability from small to large to obtain a labeling order of the target test data;
the extraction module is specifically configured to update the target test data to the data to be labeled, and adjust a labeling sequence of the data to be labeled.
According to yet another aspect of the present invention, there is provided a computer apparatus comprising a memory storing a computer program and a processor implementing the steps of the method of processing annotation data when the processor executes the computer program.
According to yet another aspect of the present invention, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of processing annotation data.
By means of the technical scheme, the marked data under each category are counted in the process of marking the data to be marked based on the marking platform, the marked data under the training standard reaching the model of the corresponding category are input into the network model of the corresponding category as training data to be trained, accuracy of model training is guaranteed, and the data to be marked are updated further according to the prediction probability of the classification prediction model on the test data because the prediction probability in the model test process is irrational, so that the data to be marked are updated in a self-adaptive mode. Compared with the mode of processing the labeling data only aiming at the hot spots with the same name in the prior art, the embodiment of the invention adaptively adjusts the labeling data to reach the training standard of the corresponding category model through the labeled result counted in real time, improves the model training effect, and updates the to-be-labeled data in the model prediction process by combining the prediction probability, thereby improving the prediction effect of the model.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for processing annotation data according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for processing annotation data according to the embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating another processing procedure of annotation data according to an embodiment of the present invention
FIG. 4 is a schematic structural diagram of a processing apparatus for annotating data according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another processing apparatus for annotation data according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method for processing annotation data, which can update data to be annotated based on an annotation result of real-time statistics and improve the training effect of a model, and as shown in fig. 1, the method comprises the following steps:
101. and acquiring sample data of each category randomly extracted from the machine data as data to be labeled.
Machine data is a large amount of structured, unstructured data generated by devices or programs in servers, storage, the internet, and the internet of things. Compared with the traditional database data, the machine data has the characteristics of large data volume, high growth speed, high complexity, diversified types and the like. In order to ensure the diversity of data, the machine data should include data generated during the operation of various industries, such as finance, sports, tourism, education, and the like.
The machine data is typically unlabeled data and may be photographs, videos, news articles, tweets, and the like. In machine learning, a large number of data samples are often used for training and testing, and a better and accurate learning model is established, so that the learning model can recognize or predict data like a human. In the embodiment of the invention, the sample data under each category is randomly extracted from the machine data to be used as the data to be labeled, so that at the initial stage of the establishment of the learning model, the machine data lacks any known information, the sample under each category needs to be randomly extracted from the machine data to be used as the starting data of the model training, and the data is labeled to be used for the subsequent model training.
102. And in the process of labeling the data to be labeled based on the labeling platform, counting the labeled data under each category, and judging whether the labeled data under each category respectively reach the training standard preset for the classification prediction model.
The labeling platform is a privatized artificial intelligence platform integrating data labeling and task management, can realize services such as data planning, data and mathematical and data labeling, carries out large-scale artificial labeling tasks after the data to be labeled enters the labeling platform, and the number of people participating in labeling can be specifically distributed according to the data volume of the data to be labeled, and the data volume to be labeled is large, so that a large number of labeling personnel are distributed, otherwise, a small number of labeling personnel are distributed.
For example, the data of the financial categories generated in the process of labeling data is more than the labeled data amount of the financial categories, the labeled data amount of the generated educational and music categories is less, and further the training data of the categories is less during the subsequent training of the classification prediction model, so that the training effect of the classification prediction model under the categories is influenced.
Therefore, in order to ensure the accuracy of the prediction result of the classification prediction model, before the training of the classification prediction model, the labeled data under each category are counted in real time, and whether the labeled data under each category respectively reach the training standard preset for the classification prediction model is judged. The statistical content is not limited here, and may be a data amount of the labeled data or a distribution of the labeled data, and the training criterion set for the classification prediction model may set a fixed data amount for each labeled category, or may set a distribution of the labeled data for each category, and is not limited here. It can be understood that, as long as the labeled data under each category respectively reaches the training standard, the training effect of the classification prediction model will not be greatly affected.
103. If so, inputting the labeled data meeting the training standard category as training data into the network model for training to obtain a classification prediction model.
The labeled data meeting the training requirement of the classification prediction model can be input into a network model as training data for training, the network model can extract data features based on the labeled training data, and a classification prediction model of a mapping relation between input and output is constructed, so that the category of the unlabeled data can be output when the constructed classification prediction model is used for predicting the unlabeled data.
It should be noted that, in the embodiment of the present invention, the network model is not limited, and a neural network model, an LR regression model, or a support vector machine model may be used as long as the network model can achieve the classification training effect.
In general, a network model has multiple levels, and a prediction classification effect can be realized by a convolutional layer, a full-link layer, a pooling layer and a classification layer structure, wherein the convolutional layer is equivalent to a hidden layer of a neural network model and can be of a multilayer structure and used for extracting characteristic parameters of deeper labeled data in the category; in the neural network model, in order to reduce parameters and calculation, a pooling layer is often inserted at intervals in the continuous convolutional layer; the fully-connected layer is similar to the convolutional layer, the neurons of the convolutional layer are connected with the output local area of the previous layer, and in order to reduce the output of excessive feature vectors, two fully-connected layers can be arranged, and feature parameters output by training are integrated after training data are trained by a plurality of convolutional layers.
104. And updating the data to be labeled according to the prediction probability of the classification prediction model on the test data.
After the training of the classification prediction model is completed, in order to verify the training effect of the model, the prediction probability of the classification prediction model is usually verified by using unlabeled test data, the prediction probability is equivalent to the prediction confidence of the model and is used for reflecting the accuracy of the model on the prediction result of the test data, the higher the prediction probability is, the higher the prediction accuracy of the test data is, and conversely, the lower the prediction accuracy of the test data is.
For the embodiment of the invention, the accuracy of prediction is low for the test data with low prediction probability, which indicates that the characteristics of the part of the test data are difficult to learn by the network model, and the data with low prediction probability needs to be fed back to the labeling platform to update the data to be labeled, so that the network model can learn the test data with low prediction probability to improve the learning effect of the network model.
For example, the prediction probability distribution corresponding to the test data predicted as the sports category is [0.4-0.9], the test data with the prediction probability of 0.6-0.9 shows that the prediction result is relatively accurate and has little deviation from the actual, the network model does not need to learn the test data, the test data with the prediction probability of 0.5-0.6 shows that the prediction result is not ideal and has larger deviation from the actual, and the adaptability of the part of data in the network model can be improved through data labeling, so that the model training effect is improved.
According to the embodiment of the invention, the marked result is counted, more similar data are deleted in a self-adaptive manner, meanwhile, missing data or sparse data corresponding to some categories are complemented automatically, so that the marking efficiency can be improved greatly, the marked data result is communicated with a modeling system, data can be extracted automatically through a model, random extraction is compared with data to be marked which is extracted according to rules, and the accuracy of model training can be further improved.
The invention provides a processing method of labeled data, which is characterized in that in the process of labeling data to be labeled based on a labeling platform, the labeled data under each category are counted, the labeled data under the training standard reaching a corresponding category model are input into a network model of the corresponding category as training data for training, the accuracy of model training is ensured, and the data to be labeled are updated according to the prediction probability of the classification prediction model in the model testing process because the prediction probability is irrational, so that the data to be labeled are updated in a self-adaptive manner. Compared with the mode of processing the labeling data only aiming at the hot spots with the same name in the prior art, the embodiment of the invention adaptively adjusts the labeling data to reach the training standard of the corresponding category model through the labeled result counted in real time, improves the model training effect, and updates the to-be-labeled data in the model prediction process by combining the prediction probability, thereby improving the prediction effect of the model.
The embodiment of the present invention provides another method for processing annotation data, which can automatically update data to be annotated based on an annotation result of real-time statistics, and improve a training effect of a network model, as shown in fig. 2, the method includes:
201. and acquiring sample data of each category randomly extracted from the machine data as data to be labeled.
It is understood that the source of the machine data may include, but is not limited to, application log, internet of things, GPS positioning, etc., and the machine data may be applied to application scenarios of different industries, for example, machine data in the financial field may be used for transaction anti-fraud, and financial data such as consumption data, consumption time, merchant number, etc. may be used in combination with some other source information to determine the possibility that each transaction is fraudulent.
For the embodiment of the invention, a large amount of machine data is needed for machine learning, while the machine learning with supervision needs labeled data as prior experience, and the data of each category is randomly extracted from the machine data to be used as initial data to be labeled.
It can be understood that, in order to facilitate data annotation by the annotation platform, before data annotation is performed on data to be annotated, data cleaning may be performed on the data to be annotated, for example, data format is structured, and operations such as deleting useless redundant data are performed, so that the data to be annotated meets the annotation requirement of the annotation platform.
202. In the process of labeling the data to be labeled based on the labeling platform, counting the number of data corresponding to the labeled data under each category, and taking the number of data reaching a target threshold value preset for a classification prediction model as a training standard.
For the embodiment of the invention, in order to ensure the quality of data labeling, the labeling platform can set a labeling sample, a labeling template and the like as references, and for the labeled data with uncertain categories, abandoning or subsequent uniform labeling can be selected. The labeling form of the data to be specifically labeled may include, but is not limited to, characters, numbers, codes, and the like, for example, an industry name may be directly used, and a code may also be set. Of course, the operations of classifying, drawing a frame, annotating, marking, and the like may also be performed on the data to be labeled by using a labeling tool, which is not limited herein.
In the process of labeling data to be labeled based on a labeling platform, the data to be labeled under each category is uncertain in advance, labeled data exist under each category along with the continuous labeling work, the labeling condition of the data of each category is known by counting the number of data corresponding to the labeled data, for example, a target category has three categories of finance, tourism and sports, 10000 labeled data under the category of finance are labeled by counting, 5000 labeled data under the category of tourism and 1000 labeled data under the category of sports show that the data distribution under each category in the labeled data is not uniform.
For the embodiment of the present invention, the number of pieces of data with the same target threshold may be set for the labeled data under each category as the training standard, for example, 10000 pieces of labeled data under the above three categories are used as the training standard, and the number of pieces of data with different target thresholds may be set for the labeled data under each category as the training standard based on the unique identification of each category, for example, 10000 pieces of labeled data under the "finance" category are used as the training standard, 10000 pieces of labeled data under the "tourism" category are used as the training standard, and 9000 pieces of labeled data under the "sports" category are used as the training standard, where the setting manner of the training standard is not limited.
203. Judging whether the number of data corresponding to the labeled data under each category reaches a target threshold value preset for a classification prediction model or not; if yes, go to step 204a, otherwise go to step 204 b.
In the training process of the classification prediction model, in order to ensure the prediction effect of the model, the model training can be performed only after the labeled data of each category reaches the training standard, and the number of the data corresponding to the labeled data reaches the target threshold value preset for the classification prediction model as the training standard. Setting the number of data corresponding to the labeled data under each category to 10000 as training labels, and aiming at the example in the step 202, only the labeled data under the financial category meets the training standard and can be input into the network model for training as the training data, while the labeled data under the tourism and sports categories do not meet the training standard and cannot be input into the network model for training.
And 204a, if so, inputting the labeled data meeting the training standard category as training data into the network model for training to obtain a classification prediction model.
It can be understood that, in the process of inputting the labeled data under the class meeting the training standard as the training data into the network model for training, in order to ensure the balance of the labeled data under each class, for data under the category which has reached the training standard, the subsequently labeled data under the category can be shielded or deleted on the labeling platform, so that the number of the data corresponding to the labeled data under each category is the same, for example, the number of the data corresponding to the labeled data meeting the training standard of the financial category model is ten thousand, the number of the data corresponding to the labeled data of the finance category reaches ten thousand, and the data of the finance category exceeding the number of the data are redundant in the process of labeling the data to be labeled, redundant sample data in the marked data except the data reaching the training standard class can be deleted.
Here, the network model may be a neural network model, and the neural network model includes a multilayer structure, and the process of training the classification prediction model using the neural network model may specifically include: extracting the characteristic of the labeled data reaching the training standard category through the convolution layer of the neural network model to obtain the characteristic parameter of the data under the category; performing dimensionality reduction processing on the characteristic parameters of the category-specific data through a pooling layer of the convolutional neural network model to obtain characteristic parameters of each category of the category-specific data after dimensionality reduction processing; collecting characteristic parameters of each category of the category-below data after the dimension reduction processing through a full connection layer of a convolutional neural network model to obtain weight values of the category-below data on different categories; and generating a mapping relation between the data characteristics and each category according to the weight values of the data under the category in different categories through a classification layer of the convolutional neural network model, and constructing a classification prediction model.
And a step 204b corresponding to the step 204a, if not, screening sample data which does not reach the training standard category from the machine data, and labeling the sample data which does not reach the training standard category so as to enable the labeled data under the category to reach the training standard of the category model.
For the sample data which does not reach the training labeling category, it is shown that the category still needs a certain number of pieces of labeled data, the sample data which does not reach the training standard category can be further screened from the machine data and sent to a labeling platform for labeling, and the category distribution of the labeled data is adjusted, so that the number of pieces of data corresponding to the labeled data under the category is improved, for example, less labeled data under the category of "tourism" is provided, and more tourism data can be extracted from the machine data.
Specifically, the regular matching rule may be set by collecting keywords that do not meet the training criteria category, for example, the training annotation category that does not meet is a sports category, and keywords related to sports may be collected from platforms such as a new sports website, including: ball, running, exercise, etc.; and further screening sample data which does not reach the training standard category from the machine data based on a regular matching rule, wherein the regular matching rule can use the keywords under the category as pattern characters to match the sample data which is matched with the keywords and exists in the machine data, so that the number of data corresponding to the labeled data under the category is increased.
205a, predicting the classification prediction model by using the unlabeled test data to obtain the prediction probability of the test data.
Because the category corresponding to the unlabeled test data is unknown, and the classification prediction model can only obtain the prediction probability of the test data on each category, in general, the category with the highest prediction probability corresponding to the test data is used as the classification prediction result of the test data. Therefore, if the prediction probability of the test data is higher, the accuracy of the predicted result of the test data on the category is higher.
206a, extracting target test data with the prediction probability within a preset range, and updating the data to be labeled.
For target test data with a prediction probability within a preset range, test data with a low accuracy of a prediction result is usually selected, and certainly, in order to further ensure a model training effect, data to be labeled may also be updated based on all test data, where it is preferable that the prediction probability is greater than 1/N, N is the number of categories, for example, the category is 2, and then test data with a prediction probability of 0.5 may be screened.
In the process of updating the data to be labeled, the test data with lower prediction probability usually needs to be learned preferentially, and is labeled preferentially, the target test data can be sequenced from small to large according to the prediction probability to obtain the labeling sequence of the target test data, and the labeling sequence can ensure that the test data with larger prediction error is labeled preferentially; and further updating the target test data into the data to be labeled, and adjusting the labeling sequence of the data to be labeled.
The specific process of labeling data can be as shown in fig. 3, firstly randomly extracting machine data as data to be labeled, and labeling the data to be labeled based on the labeling platform, counting the labeling results in real time, counting whether the labeled data under each category reaches the model training standard or not, if so, training a classification prediction model, otherwise, extracting data to be labeled which does not reach the model training standard category based on the regular matching rule, returning to a labeling platform for labeling, so that the output of the labeled data under each category is uniform, and simultaneously, the classification prediction model is trained in the labeling process, after the classification prediction model is trained, the classification prediction model is tested by using the unlabeled data, in the test process of the classification prediction model, target test data with prediction probability within a preset range are extracted, and the target test data are returned to the labeling platform for labeling.
It should be noted that the whole processing process of the labeled data is a continuous cycle process, and the labeled data under each category can be adaptively adjusted in the model training stage, so as to ensure the balance of the labeled data under each category output by the system; meanwhile, in the model testing stage, data with poor prediction effect can be returned to the labeling platform for labeling, so that the model continuously adjusts characteristic parameters in the training process, and the prediction effect of the classification prediction model is continuously optimized.
Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for processing annotation data, where as shown in fig. 4, the device includes: an acquisition unit 31, a judgment unit 32, a training unit 33, and an update unit 34.
The acquiring unit 31 may be configured to acquire sample data in each category randomly extracted from machine data, as data to be labeled;
the judging unit 32 may be configured to, in a process of labeling the data to be labeled based on the labeling platform, count the labeled data under each category, and judge whether the labeled data under each category respectively reach a training standard preset for the classification prediction model;
the training unit 33 may be configured to, if the data labeled under the category meets the training standard of the corresponding category model, input the data labeled under the category meeting the training standard as training data into the network model for training, so as to obtain a classification prediction model;
and the updating unit 34 is configured to update the data to be labeled according to the prediction probability of the classification prediction model on the test data.
The invention provides a processing device of labeled data, which counts labeled data under each category in the process of labeling data to be labeled based on a labeling platform, inputs the labeled data under the training standard of a corresponding category model as training data into a network model of the corresponding category for training, ensures the accuracy of model training, and further updates the data to be labeled according to the prediction probability of a classification prediction model in the process of model testing because the prediction probability is irrational, thereby adaptively updating the data to be labeled. Compared with the mode of processing the labeling data only aiming at the hot spots with the same name in the prior art, the embodiment of the invention adaptively adjusts the labeling data to reach the training standard of the corresponding category model through the labeled result counted in real time, improves the model training effect, and updates the to-be-labeled data in the model prediction process by combining the prediction probability, thereby improving the prediction effect of the model.
As a further explanation of the processing device of the tag data shown in fig. 4, fig. 5 is a schematic structural diagram of another processing device of the tag data according to an embodiment of the present invention, and as shown in fig. 5, the determining unit 32 includes:
the counting module 321 may be configured to count the number of pieces of data corresponding to the labeled data under each category, and use a target threshold value preset for the classification prediction model as a training standard when the number of pieces of data reaches the target threshold value;
the determining module 322 may be configured to determine whether the number of pieces of data corresponding to the labeled data under each category reaches a target threshold preset for the category prediction model.
Further, the apparatus further comprises:
the deleting unit 35 may be configured to delete redundant sample data in the labeled data reaching the training standard category before the labeled data reaching the training standard category is input to the network model as training data for training to obtain the classification prediction model, so that the number distribution of the data corresponding to the labeled data under each category is the same.
Further, the apparatus further comprises:
the screening unit 36 may be configured to, after counting the data labeled under each category and determining whether the data labeled under each category respectively meets the training standard set for the classification prediction model in advance, if the data labeled under each category does not meet the training standard of the corresponding category model, screen sample data that does not meet the training standard from the machine data, and label the sample data that does not meet the training standard, so that the data labeled under each category meets the training standard of the category model.
Further, the screening unit 36 includes:
the setting module 361 can be used for setting a regular matching rule by collecting keywords which do not reach the training standard category;
the screening module 362 may be configured to screen, from the machine data, sample data that does not meet the category of the training standard based on the regular matching rule.
Further, the updating unit 34 includes:
the test module 341 is configured to predict the classification prediction model by using the unlabeled test data to obtain a prediction probability of the test data;
the extracting module 342 may be configured to extract target test data with a prediction probability within a preset range, and update the data to be labeled.
Further, the extracting module 342 may be specifically configured to sort the target test data according to the prediction probability from small to large to obtain a labeling order of the target test data;
the extracting module 342 may be further configured to update the target test data into the data to be labeled, and adjust a labeling sequence of the data to be labeled.
It should be noted that other corresponding descriptions of the functional units related to the processing apparatus for annotation data provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.
Based on the above-mentioned methods as shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the program is executed by a processor to implement the above-mentioned processing method of the annotation data as shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.
Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 4 and fig. 5, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the above-mentioned processing method of the annotation data shown in fig. 1 and 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
It will be understood by those skilled in the art that the physical device structure of the processing of the annotation data provided in the present embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.
The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the computer device described above, supporting the operation of information handling programs and other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme, compared with the prior art, the method has the advantages that the labeled data are adaptively adjusted to reach the training standard of the corresponding category model through the labeled result of real-time statistics, the model training effect is improved, and the labeled data are updated in the model prediction process by combining the prediction probability, so that the prediction effect of the model is improved.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A method for processing annotation data, the method comprising:
acquiring sample data of each category randomly extracted from machine data as data to be labeled;
in the process of labeling the data to be labeled based on the labeling platform, counting the labeled data under each category, and judging whether the labeled data under each category respectively reach the training standard preset for the classification prediction model;
if so, inputting the labeled data meeting the training standard category as training data into a network model for training to obtain a classification prediction model;
and updating the data to be labeled according to the prediction probability of the classification prediction model on the test data.
2. The method according to claim 1, wherein the counting labeled data under each category and determining whether the labeled data under each category respectively meet a training standard preset for a classification prediction model comprises:
counting the number of data corresponding to the labeled data under each category, and taking the number of data reaching a target threshold value preset for a classification prediction model as a training standard;
and judging whether the number of the data corresponding to the labeled data under each category reaches a target threshold value preset for the classification prediction model.
3. The method according to claim 2, wherein before the data labeled under the training standard class is input into the network model as training data for training, and a classification prediction model is obtained, the method further comprises:
and deleting redundant sample data in the labeled data meeting the training standard category so as to ensure that the number distribution of the data corresponding to the labeled data under each category is the same.
4. The method according to claim 1, wherein after the counting the labeled data under each category and determining whether the labeled data under each category respectively reach the training criteria set for the classification prediction model in advance, the method further comprises:
if not, screening sample data which does not meet the training standard category from the machine data, and labeling the sample data which does not meet the training standard category so as to enable the labeled data under the category to meet the training standard of the category model.
5. The method of claim 4, wherein the screening of machine data for sample data that does not meet the category of training criteria comprises:
setting a regular matching rule by collecting keywords which do not reach the training standard category;
and screening sample data which does not reach the training standard category from the machine data based on the regular matching rule.
6. The method according to claim 1, wherein the updating the data to be labeled according to the prediction probability of the classification prediction model on the test data comprises:
predicting the classification prediction model by using unlabeled test data to obtain the prediction probability of the test data;
and extracting target test data with the prediction probability within a preset range, and updating the data to be labeled.
7. The method according to claim 6, wherein the extracting of the target test data with the prediction probability within the preset range and the updating of the data to be labeled comprise:
sequencing the target test data according to the prediction probability from small to large to obtain a labeling sequence of the target test data;
and updating the target test data into the data to be labeled, and adjusting the labeling sequence of the data to be labeled.
8. An apparatus for processing annotation data, the apparatus comprising:
the acquisition unit is used for acquiring sample data under each category randomly extracted from the machine data as data to be marked;
the judging unit is used for counting the labeled data under each category and judging whether the labeled data under each category respectively reach the training standard which is set for the classification prediction model in advance in the process of labeling the data to be labeled based on the labeling platform;
the training unit is used for inputting the labeled data meeting the training standard category into the network model as training data for training to obtain a classification prediction model if the labeled data meeting the training standard category meets the training standard of the corresponding category model;
and the updating unit is used for updating the data to be labeled according to the prediction probability of the classification prediction model on the test data.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201910739900.0A 2019-08-12 2019-08-12 Method and device for processing labeled data Pending CN110610193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910739900.0A CN110610193A (en) 2019-08-12 2019-08-12 Method and device for processing labeled data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910739900.0A CN110610193A (en) 2019-08-12 2019-08-12 Method and device for processing labeled data

Publications (1)

Publication Number Publication Date
CN110610193A true CN110610193A (en) 2019-12-24

Family

ID=68889899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910739900.0A Pending CN110610193A (en) 2019-08-12 2019-08-12 Method and device for processing labeled data

Country Status (1)

Country Link
CN (1) CN110610193A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274391A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 SPO extraction method and device, electronic equipment and storage medium
CN111401980A (en) * 2020-02-19 2020-07-10 北京值得买科技股份有限公司 Method and device for improving sample sequencing diversity
CN111581615A (en) * 2020-05-08 2020-08-25 南京大创师智能科技有限公司 Method and system for providing artificial intelligence platform for individuals
CN111583199A (en) * 2020-04-24 2020-08-25 上海联影智能医疗科技有限公司 Sample image annotation method and device, computer equipment and storage medium
CN111858341A (en) * 2020-07-23 2020-10-30 深圳慕智科技有限公司 Test data measurement method based on neuron coverage
CN111898661A (en) * 2020-07-17 2020-11-06 交控科技股份有限公司 Method and device for monitoring working state of turnout switch machine
CN112000808A (en) * 2020-09-29 2020-11-27 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium
CN112446441A (en) * 2021-02-01 2021-03-05 北京世纪好未来教育科技有限公司 Model training data screening method, device, equipment and storage medium
CN112905835A (en) * 2021-02-26 2021-06-04 成都潜在人工智能科技有限公司 Multi-mode music title generation method and device and storage medium
CN113627509A (en) * 2021-08-04 2021-11-09 口碑(上海)信息技术有限公司 Data classification method and device, computer equipment and computer readable storage medium
CN114973056A (en) * 2022-03-28 2022-08-30 华中农业大学 Information density-based fast video image segmentation and annotation method

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274391A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 SPO extraction method and device, electronic equipment and storage medium
US20210216819A1 (en) * 2020-01-15 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, electronic device, and storage medium for extracting spo triples
CN111274391B (en) * 2020-01-15 2023-09-01 北京百度网讯科技有限公司 SPO extraction method and device, electronic equipment and storage medium
CN111401980A (en) * 2020-02-19 2020-07-10 北京值得买科技股份有限公司 Method and device for improving sample sequencing diversity
CN111583199B (en) * 2020-04-24 2023-05-26 上海联影智能医疗科技有限公司 Sample image labeling method, device, computer equipment and storage medium
CN111583199A (en) * 2020-04-24 2020-08-25 上海联影智能医疗科技有限公司 Sample image annotation method and device, computer equipment and storage medium
CN111581615A (en) * 2020-05-08 2020-08-25 南京大创师智能科技有限公司 Method and system for providing artificial intelligence platform for individuals
CN111898661A (en) * 2020-07-17 2020-11-06 交控科技股份有限公司 Method and device for monitoring working state of turnout switch machine
CN111858341A (en) * 2020-07-23 2020-10-30 深圳慕智科技有限公司 Test data measurement method based on neuron coverage
CN112000808A (en) * 2020-09-29 2020-11-27 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium
CN112000808B (en) * 2020-09-29 2024-04-16 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium
CN112446441A (en) * 2021-02-01 2021-03-05 北京世纪好未来教育科技有限公司 Model training data screening method, device, equipment and storage medium
CN112905835A (en) * 2021-02-26 2021-06-04 成都潜在人工智能科技有限公司 Multi-mode music title generation method and device and storage medium
CN112905835B (en) * 2021-02-26 2022-11-11 成都潜在人工智能科技有限公司 Multi-mode music title generation method and device and storage medium
CN113627509A (en) * 2021-08-04 2021-11-09 口碑(上海)信息技术有限公司 Data classification method and device, computer equipment and computer readable storage medium
CN113627509B (en) * 2021-08-04 2024-05-10 口碑(上海)信息技术有限公司 Data classification method, device, computer equipment and computer readable storage medium
CN114973056A (en) * 2022-03-28 2022-08-30 华中农业大学 Information density-based fast video image segmentation and annotation method

Similar Documents

Publication Publication Date Title
CN110610193A (en) Method and device for processing labeled data
CN112632385B (en) Course recommendation method, course recommendation device, computer equipment and medium
CN108960409B (en) Method and device for generating annotation data and computer-readable storage medium
CN108960719B (en) Method and device for selecting products and computer readable storage medium
CN110163647B (en) Data processing method and device
CN112494952B (en) Target game user detection method, device and equipment
CN109902708A (en) A kind of recommended models training method and relevant apparatus
CN109299258A (en) A kind of public sentiment event detecting method, device and equipment
CN111177569A (en) Recommendation processing method, device and equipment based on artificial intelligence
CN108197668A (en) The method for building up and cloud system of model data collection
CN109711424B (en) Behavior rule acquisition method, device and equipment based on decision tree
CN105069470A (en) Classification model training method and device
CN107545038B (en) Text classification method and equipment
CN112908436B (en) Clinical test data structuring method, clinical test recommending method and device
CN109993057A (en) Method for recognizing semantics, device, equipment and computer readable storage medium
CN107818491A (en) Electronic installation, Products Show method and storage medium based on user's Internet data
CN109902823B (en) Model training method and device based on generation countermeasure network
CN110647995A (en) Rule training method, device, equipment and storage medium
CN107368526A (en) A kind of data processing method and device
CN106203103A (en) The method for detecting virus of file and device
CN108345979B (en) Service testing method and device
CN110929169A (en) Position recommendation method based on improved Canopy clustering collaborative filtering algorithm
CN110020147A (en) Model generates, method for distinguishing, system, equipment and storage medium are known in comment
CN112308603A (en) Similarity expansion-based rapid store site selection method and device and storage medium
CN114048294B (en) Similar population extension model training method, similar population extension method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191224

RJ01 Rejection of invention patent application after publication