CN114330570A

CN114330570A - Open set data labeling method, device, equipment, storage medium and program product

Info

Publication number: CN114330570A
Application number: CN202111660140.8A
Authority: CN
Inventors: 赵珣; 宁鲲鹏; 李昱
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-12

Abstract

The application discloses a method, a device, equipment, a storage medium and a program product for labeling open set data, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring open set data comprising a plurality of samples; identifying N predicted known category samples from the open set data through an identifier; and selecting the prediction known class sample with the reliability meeting the condition from the N prediction known class samples as a training sample of the classifier. According to the method, the sample in the open set data is primarily identified through the identifier to obtain the predicted known class sample, and then the predicted known class sample is secondarily screened through the credibility to obtain the training sample for training the classifier. The method utilizes the recognizer to perform primary recognition, and then performs secondary screening based on credibility, so that the accuracy of the known class samples selected from the open set data is fully ensured.

Description

Open set data labeling method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for tagging open-set data.

Background

When performing subsequent training or analysis on the target data in the data set, the target data needs to be labeled from the data set.

In the data set only containing the samples with known classes, the classes of the samples in the data set are identified through the identifier, the classes of the samples in the data set can be obtained, and the samples belonging to the target classes are selected from the classes for subsequent training or analysis.

However, for a data set containing samples of known classes and samples of unknown classes, the above method cannot accurately distinguish the samples of known classes from the samples of unknown classes.

Disclosure of Invention

The embodiment of the application provides an open set data labeling method, device, equipment, storage medium and program product. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a method for annotating open set data, the method including:

acquiring open set data comprising a plurality of samples, the plurality of samples comprising at least one known class sample and at least one unknown class sample; the known class samples refer to samples of which real classes belong to K known classes, the unknown class samples refer to samples of which real classes do not belong to the K known classes, and K is a positive integer;

identifying N predicted known category samples from the open set data through an identifier; wherein the identifier is configured to identify the sample as the known class sample or the unknown class sample, the predicted known class sample refers to the sample identified by the identifier as the known class sample, and N is a positive integer;

selecting a prediction known class sample with the reliability meeting the condition from the N prediction known class samples as a training sample of the classifier; wherein the classifier is configured to classify the K known classes.

According to an aspect of an embodiment of the present application, there is provided an apparatus for distributing audio content, the apparatus including:

a sample acquisition module for acquiring open set data comprising a plurality of samples, the plurality of samples including at least one known class sample and at least one unknown class sample; the known class samples refer to samples of which real classes belong to K known classes, the unknown class samples refer to samples of which real classes do not belong to the K known classes, and K is a positive integer;

the sample identification module is used for identifying N predicted known class samples from the open set data through an identifier; wherein the identifier is configured to identify the sample as the known class sample or the unknown class sample, the predicted known class sample refers to the sample identified by the identifier as the known class sample, and N is a positive integer;

the sample selection module is used for selecting a prediction known class sample with the reliability meeting the condition from the N prediction known class samples as a training sample of the classifier; wherein the classifier is configured to classify the K known classes.

According to an aspect of the embodiments of the present application, there is provided a computer device, including a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for tagging open set data.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the method for annotating open set data described above.

According to an aspect of the embodiments of the present application, there is provided a computer program product, the computer program product includes computer instructions, the computer instructions are stored in a computer-readable storage medium, and a processor reads and executes the computer instructions from the computer-readable storage medium, so as to implement the method for annotating open set data.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

and carrying out primary identification on the samples in the open set data through an identifier to obtain a predicted known class sample, and then carrying out secondary screening on the predicted known class sample through the credibility to obtain a training sample for training the classifier. The method solves the problem that the common labeling method cannot be applied to the open set data, and provides the method for identifying and labeling the known class samples from the open set data. In addition, the method utilizes the recognizer to perform primary recognition, and then performs secondary screening based on the credibility, so that the accuracy of the known class samples selected from the open set data is fully ensured.

Drawings

FIG. 1 is a schematic illustration of an environment for implementing an embodiment provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method for annotating open-set data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a method for annotating open-set data according to an embodiment of the present application;

FIG. 4 is a flow chart of a method for annotating open-set data according to another embodiment of the present application;

FIG. 5 is a graph of experimental results provided by one embodiment of the present application;

FIG. 6 is a graph of experimental results provided by another embodiment of the present application;

FIG. 7 is a graph of experimental results provided by another embodiment of the present application;

FIG. 8 is a block diagram of an apparatus for annotating open-set data provided in an embodiment of the present application;

FIG. 9 is a block diagram of an apparatus for annotating open-set data provided in accordance with another embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML for short) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The present application relates to the field of machine learning of artificial intelligence technology, and the following describes the technical solution of the present application with several embodiments.

Before describing the embodiments of the present application, some terms referred to in the present application will be explained.

1. Active Learning (AL): it is an attempt to train better performing model training methods by selectively labeling less data.

2. Open-set data: is a set of samples that contains samples of known classes and samples of unknown classes.

3. Known class samples: is a sample of which the category is any of at least one category that can be identified by the identifier.

4. Unknown class sample: is a sample of any one of at least one category that is not identifiable by the identifier.

5. Open-set Annotation (OSA for short): refers to labeling the category of the sample in the open set data.

6. Open-set Recognition (OSR): refers to the identification of classes for samples in the open set of data.

7. Gaussian Mixture Model (GMM for short): the probability model can be used for representing that the overall distribution contains a plurality of sub-distributions, and the Gaussian mixture model can be regarded as a model formed by combining a plurality of single Gaussian models, and the plurality of sub-models are hidden variables of the mixture model.

Refer to fig. 1, which illustrates a schematic diagram of an environment for implementing an embodiment of the present application. The embodiment implementation environment can be implemented as an annotation system for open-set data. The embodiment implementation environment may include a model training apparatus 10 and a model using apparatus 20.

The model training device 10 may be an electronic device such as a computer, server, intelligent robot, or some other electronic device with greater computing power. The model training device 10 comprises a recognizer 30 and a classifier 40, wherein the recognizer 30 is used for labeling the open set data, obtaining a labeled known class sample, and training the classifier 40 through the labeled known class sample. In the embodiment of the present application, the labeling of the open set data refers to performing category labeling on samples in the open set data. The labeling of the open set data may also be used to train the classifier 40, and the classifier 40 is trained by labeling the class of the sample in the open set data to obtain a corresponding labeling result to obtain a training sample of the classifier 40. The recognizer 30 in the model training device 10 is trained in an active learning manner, the recognizer 30 labels samples in the open set data, obtains labeled known class samples through manual review, and trains the recognizer 30 according to the labeled known class samples. And obtaining the trained recognizer through multiple times of training. The classifier 40 is trained by a supervised learning method, the samples in the open set data are labeled by the recognizer 30 to obtain labeled known class samples, and the training of the classifier 40 is performed according to the labeled known class samples.

The trained classifier 40 can be deployed in the model using apparatus 20 for class recognition of the target sample. The model using device 20 may be a terminal device such as a mobile phone, a computer, a smart television, a multimedia playing device, a wearable device, a medical device, or a server, which is not limited in this application.

In some embodiments, as shown in fig. 1, the identifier 30 obtains the known class sample 32 by labeling and extracting the open set data 31, wherein the known class sample 32 is a picture labeled with a bold line frame in the open set data 31. Class labeling of the known class samples 32 results in training samples, which are then used to train the classifier 40.

In the following method embodiments, the using process and the training process of the recognizer 30, and the training process of the classifier 40 will be described in detail.

Referring to fig. 2, a flowchart of an annotation method for open-set data according to an embodiment of the present application is shown. The execution subject of the method may be the model training apparatus 10 shown in fig. 1, and the steps may be executed by the model training apparatus 10. The method can comprise at least one of the following steps (210-230):

at step 210, open set data comprising a plurality of samples is obtained.

The open set data comprises a plurality of samples, the plurality of samples comprise at least one known class sample and at least one unknown class sample, the known class sample refers to a sample of which a real class belongs to K known classes, the unknown class sample refers to a sample of which the real class does not belong to the K known classes, and K is a positive integer.

Illustratively, the samples in the open set data are taken as picture samples. The K known category hypotheses include 3 known categories, namely football, basketball and volleyball, the known category samples in the collected data refer to samples with a category of football, a category of basketball and a category of volleyball, the unknown category samples refer to samples with a category other than any of the football, basketball and volleyball, and the picture samples with a category of swimming, house, rainbow and the like belong to the unknown category samples.

It should be noted that, the model training apparatus acquires the open set data including a plurality of samples, each sample in the open set data is not labeled with a known or unknown label, and the known class sample and the unknown class sample need to be distinguished from the open set data by the method provided in this embodiment.

And step 220, identifying N predicted known class samples from the open set data through the identifier, wherein N is a positive integer.

Optionally, the identifier is used to identify the sample as a known class sample or an unknown class sample. The recognizer may be a neural network model, and in the embodiment of the present application, the network structure of the recognizer is not limited. In some embodiments, the identifier is used to identify known class samples and unknown class samples. In other embodiments, the identifier is configured to identify known class samples corresponding to the K known classes, and unknown class samples.

Predicting a known class sample refers to a sample that the recognizer recognizes as a known class sample. That is, predicting a known class sample refers to a recognizer that is selected from the open set of data and is considered to be a sample of the known class sample. However, since the recognition result of the recognizer is not necessarily accurate, it is possible that the prediction of the known class sample is a true known class sample, but may also be an unknown class sample.

Optionally, the identifier obtains a known class sample and an unknown class sample according to the prediction, wherein the known class sample and the unknown class sample are the prediction results obtained by the identifier through prediction, and the true classes of the known class sample and the unknown class sample can be different from the prediction results. For example, the type of the sample with the known type predicted by the model training device is basketball and football, at this time, the real type of a certain sample is football, the recognizer recognizes the sample to obtain that the predicted type of the sample is football, the recognizer marks the sample as the sample with the known type, and the recognizer predicts correctly. Optionally, if the sample is identified by the identifier and the predicted category of the sample is basketball, the identifier marks the sample as a known category sample, but the identifier predicts incorrectly, and the solution for this case is obtained in the following embodiment. Optionally, after the recognizer recognizes the sample, the prediction category of the sample is volleyball, and the recognizer marks the sample as an unknown category sample, and the prediction of the recognizer is incorrect. Alternatively, the identifier may also identify the unknown class sample as a known identification sample, which is not described in detail herein. In summary, the recognizer has a recognition error problem, and therefore, the recognizer needs to be trained in recognition accuracy, and a specific training method is described in the following embodiments.

Optionally, the identifier derives predicted known class samples and predicted unknown class samples from the open set data by prediction. Predicting the number of samples of known class and predicting the number of samples of unknown class and the number of samples in the open set data. The unknown class sample is predicted, wherein the unknown class sample is identified as the sample of the unknown class sample by the identifier. Since the prediction of the unknown class sample is not used in the training process of the recognizer and is not used in the training process of the classifier, only the prediction of the known class sample needs to be selected for the subsequent process.

In some embodiments, the predicted known class samples from the recognizer are sampled to obtain N predicted known class samples. For example, 1000000 samples are included in the open-set data, 100000 samples of the predicted known class and 900000 samples of the predicted unknown class are obtained through prediction by the recognizer, and then 1000 samples of the predicted known class are obtained through random sampling from the 100000 samples of the predicted known class, and the next operation is performed. In this example, the N prediction known class samples in step 220 may be understood as the above 1000 prediction known class samples.

In some embodiments, as shown in FIG. 3, FIG. 3 illustrates a schematic diagram of the recognizer recognizing samples in the open set data. Alternatively, fig. 3 is composed of a recognition process of the recognizer 310, a training process of the recognizer 30, and a training process of the classifier 40. The recognition process of the recognizer 30 is used to label and select known class samples from the open set data for training by the recognizer 30 and the classifier 40. In fig. 3, the open-set data is input into the identifier 30, and N samples of the predicted known classes are obtained.

Step 230, selecting a prediction known class sample with the reliability meeting the condition from the N prediction known class samples as a training sample of the classifier; wherein the classifier is configured to classify the K known classes.

And the credibility meeting the condition is used for carrying out secondary prediction on the predicted known class samples identified by the identifier, and distinguishing unknown class samples in the predicted known class samples as much as possible through secondary screening to obtain real known class samples in the predicted known class samples, so that the accuracy of the finally selected known class samples is improved. Optionally, as shown in fig. 3, the gaussian mixture model GMM performs secondary screening on the predicted known class samples predicted by the recognizer 30 to obtain the screened predicted known class samples. Optionally, the filtered prediction known class sample is used as a training sample for training the classifier.

According to the method, the sample in the open set data is primarily identified through the identifier to obtain the predicted known class sample, and then the predicted known class sample is secondarily screened through the credibility to obtain the training sample for training the classifier. The method solves the problem that the common labeling method cannot be applied to the open set data, and provides the method for identifying and labeling the known class samples from the open set data. In addition, the method utilizes the recognizer to perform primary recognition, and then performs secondary screening based on the credibility, so that the accuracy of the known class samples selected from the open set data is fully ensured.

Referring to fig. 4, a flowchart of an annotation method for open-set data according to another embodiment of the present application is shown. The execution subject of the method may be the model training apparatus 10 shown in fig. 1, and the steps may be executed by the model training apparatus 10. The method may comprise at least one of the following steps (410-470):

step 410, acquiring open set data comprising a plurality of samples, wherein the plurality of samples comprise at least one known class sample and at least one unknown class sample; the known class samples refer to samples of which the real class belongs to K known classes, the unknown class samples refer to samples of which the real class does not belong to the K known classes, and K is a positive integer.

For the description of step 410, please refer to the above embodiments, which are not described herein.

For the target sample in the open-set data, the target sample is input to the identifier, step 420.

The target sample may be any one of the open set data. The target sample may be a known class sample or an unknown class sample.

And inputting the target sample into a recognizer, and predicting the category of the target sample by the recognizer. As shown in fig. 3, the samples in the open set data are input to the identifier 30, and the identifier 30 performs the class prediction on the open set data.

430, obtaining K +1 activation values corresponding to the target sample through the identifier; the K +1 activation values correspond to the K +1 classes one by one, and the K +1 classes comprise unknown classes and K known classes.

And the recognizer obtains K +1 activation values corresponding to the target sample according to the K +1 categories. Wherein the K +1 classes include K known classes and 1 unknown class. Optionally, the K known classes are the first K classes, and the unknown class is the K +1 th class. The recognizer respectively recognizes each category of the target sample to obtain an activation value corresponding to each category. For example, if the model training device sets the value of K to 2, and the 2 known categories include football and basketball, the recognizer recognizes the category of the target sample, and may obtain activation values corresponding to 3 categories: the activation values corresponding to the football, the basketball and the unknown category respectively, for example, the activation value corresponding to the football is 1, the activation value corresponding to the basketball is 0.3, and the activation value corresponding to the unknown category is 0.2.

Optionally, for any one target class in the K +1 classes, the activation value corresponding to the target class is used to characterize the possibility that the target sample belongs to the target class. Optionally, the larger the activation value corresponding to the target class is, the higher the possibility that the target sample belongs to the target class is. For example, the activation value for soccer is 1, the activation value for basketball is 0.3, and the activation value for unknown category is 0.2. And if the activation value corresponding to the football is the maximum, the possibility that the target sample belongs to the football is the maximum. Accordingly, the recognizer may consider the target sample as a predicted known class sample, and the class to which the target sample belongs is football. For another example, the activation values for the target samples are as follows: if the activation value corresponding to the football is 0, the activation value corresponding to the basketball is 0.3, and the activation value corresponding to the unknown sample is 0.9, the recognizer predicts that the target sample is most likely to be the unknown class, and considers that the target sample is the predicted unknown class sample.

Step 440, if the K +1 activation values meet the condition, determining that the target sample is a known type sample.

The condition is a judgment basis for determining whether the sample is a known type sample or an unknown type sample according to the activation value corresponding to the sample.

Optionally, step 440 comprises: determining the maximum activation value of the K +1 activation values; and if the class corresponding to the maximum activation value belongs to K known classes, determining that the target sample is a known class sample.

The maximum activation value refers to the maximum of K +1 activation values. And determining the category of the sample as the category corresponding to the maximum activation value according to the category corresponding to the maximum activation value. If the category corresponding to the maximum activation value is any one of K known categories, the target sample is a known category sample; and if the category corresponding to the maximum activation value is the K +1 th category, namely the target category is the unknown category, the target sample is the unknown category sample. For example, the activation values for each target class in the target sample are as follows: if the activation value corresponding to the football is 1, the activation value corresponding to the basketball is 0.3, and the activation value corresponding to the unknown category is 0.2, the maximum activation value of the target sample is 1, and the corresponding category is the football. At this time, if the soccer ball is any one of the K categories, the target sample is a known category sample.

The category of the target sample is determined by calculating the size of the activation value corresponding to each category of the target sample, so that the target sample with more accurate category is obtained, and a basis is provided for calculating the probability value of the target sample.

Step 450, selecting N samples determined to be known class samples from the open set data to obtain N predicted known class samples.

And selecting N predicted known class samples from the predicted known class samples and the predicted unknown class samples predicted by the recognizer. Optionally, the N predicted known class samples are all or part of the predicted known class samples obtained by the identifier through prediction.

Step 460, using the GMM to model the maximum activation value distribution corresponding to the N predicted known class samples, to obtain probability values corresponding to the N predicted known class samples, respectively.

The probability value is obtained by predicting the activation value corresponding to each category in the known category sample through the processing of a Gaussian Mixture Model (GMM). The probability value is used for representing the credibility of predicting that the known class sample belongs to the known class, namely the credibility of the recognizer on the prediction result of the target sample, and the higher the probability value corresponding to the target sample is, the higher the credibility of the prediction result of the target sample is, namely the more accurate the prediction result of the target sample is. For example, if there are two samples a and B, the probability value corresponding to sample a is 0.8, and the probability value corresponding to sample B is 0.7, the confidence of sample a is higher than that of sample B, and the prediction result of sample a predicted by the recognizer is more accurate than that of sample B.

Optionally, step 460 comprises the following steps (1-3):

1. for a target known class of the K known classes, M predicted known class samples belonging to the target known class are selected from the N predicted known class samples, M being a positive integer less than or equal to N.

And randomly sampling the N predicted known class samples predicted by the recognizer to obtain M predicted known class samples. Optionally, the top M samples are selected as M predicted known class samples from the predicted known class samples predicted by the recognizer. Optionally, the prediction by the recognizer is stopped after the recognizer predicts the M predicted known class samples. The present application does not limit the manner in which the M samples of the predicted known class are obtained.

In some embodiments, as shown in fig. 3, the predicted known class samples in fig. 3 are the above M predicted known class samples, and are obtained from the N predicted known class samples obtained by the identifier 30 by means of random sampling.

2. And for each of the M predicted known class samples, obtaining an activation value corresponding to the target known class from the K +1 activation values corresponding to the predicted known class samples to obtain M activation values.

And determining the activation values of the M predicted known class samples, selecting the maximum activation value according to the activation values of all classes of any one of the M predicted known class samples, obtaining the maximum activation value of the predicted known class sample, and determining the class corresponding to the predicted known class sample as the class corresponding to the maximum activation value. And obtaining M activation values corresponding to M predicted known class samples according to the method.

3. And modeling the M activation values by using the GMM to obtain probability values corresponding to the M predicted known class samples respectively.

And the probability value corresponding to the ith prediction known class sample in the M prediction known class samples is used for representing the credibility of the ith prediction known class sample belonging to the target known class, and i is a positive integer less than or equal to M.

And modeling the maximum activation values of the M predicted known class samples based on a Gaussian Mixture Model (GMM) to obtain probability values corresponding to the M predicted known class samples. Any probability value is used to characterize the credibility of the corresponding predicted known class sample, that is, the credibility of the predicted known class sample is indeed the known class sample.

The probability value of the sample of the category is determined through the Gaussian mixture model and the activation value of the sample of the same category, and a foundation is laid for predicting the known category sample to obtain the predicted known category sample meeting the target condition according to the probability value.

And 470, obtaining a training sample of the classifier from the predicted known class samples with the probability values meeting the target conditions from the N predicted known class samples.

The target condition is used for further filtering from the predicted known class samples to obtain the predicted known class samples with probability values meeting the target condition. Alternatively, the target condition may be a threshold, and the predicted known class samples with probability values greater than or equal to the threshold are selected as training samples of the classifier. For example, the threshold is set to 0.7, and there are three samples of the predicted known class with probability values: and if the probability value of the sample A is 0.8, the probability value of the sample B is 0.7 and the probability value of the sample C is 0.6, selecting the sample A and the sample B as training samples of the classifier. Alternatively, the target condition may be a nominal number of training samples. For example, if the set nominal number of training samples is 100, the probability value with the top 100 is selected as the training sample according to the magnitude of the probability value.

Optionally, step 470 includes the following steps (471-):

step 471, obtaining a labeling category corresponding to the prediction known category sample with the probability value meeting the target condition, wherein the labeling category is one of the K +1 categories.

Optionally, before the training samples are obtained, labeling a labeling category of a prediction known category sample with the probability value satisfying a target condition, where the labeling category is any one of the K +1 categories. Optionally, the labeling of the labeled category may be performed by a manual labeling manner on the predicted known category sample with the probability value satisfying the target condition. Optionally, labeling the labeled class of the sample with the probability value satisfying the predicted known class in an active learning manner, and labeling to obtain a labeled known class sample and a labeled unknown class sample, wherein the labeled known class sample is a labeled known class sample, and the labeled unknown class sample is a labeled unknown class sample.

Optionally, after step 471, the following steps (a-B) are also included:

A. and calculating the training loss of the recognizer based on the labeled class corresponding to the predicted known class sample and the K +1 activation values corresponding to the predicted known class sample obtained by the recognizer. Wherein the prediction class is one of K +1 classes.

And determining the activation values corresponding to K +1 classes of the samples based on the labeling class of any one of the obtained labeled known class samples and labeled unknown class samples. Then, the activation values of the labeled known class samples or the labeled unknown class samples corresponding to the predicted known class samples are compared with the activation values of the predicted known class samples and calculated to obtain the training loss of the recognizer, and the recognizer is trained based on the training loss.

Optionally, the calculating the training loss of the recognizer based on the labeled class corresponding to the predicted known class sample and the K +1 activation values corresponding to the predicted known class sample obtained by the recognizer comprises the following steps (a-b):

a. determining K +1 labeling activation values corresponding to the samples of the predicted known category based on the labeling categories corresponding to the samples of the predicted known category; the labeling activation value corresponding to the labeling category is a first numerical value, the labeling activation values corresponding to the other categories except the labeling category in the K +1 categories are second numerical values, and the first numerical value and the second numerical value are different;

optionally, the labeling activation value of the labeled known class sample or the labeled unknown class sample is determined according to the class of the labeled known class sample or the labeled unknown class sample corresponding to the predicted known class sample. The labeling activation value corresponding to the type of the labeled known type sample or the labeled unknown type sample is a first numerical value, and the labeling activation values of other types are second numerical values. For example, if the category of a sample is labeled as football, the activation value of the football label is 1 (first numerical value), and the activation values of the other categories are 0 (second numerical value). Optionally, if a certain sample is an unknown labeling category, the labeling activation value corresponding to the unknown labeling category is 1, and the labeling activation values corresponding to other categories are 0. The present application does not limit the values of the first and second numerical values.

b. Calculating the training loss of the recognizer based on K +1 marked activation values corresponding to the predicted known class samples, K +1 activation values corresponding to the predicted known class samples obtained by the recognizer and the temperature coefficient; wherein the temperature coefficient is used to adjust the sharpness of the distribution of the activation values.

And calculating the training loss of the recognizer based on the obtained first numerical value and second numerical value and the corresponding activation value of each class of the predicted known class sample. Wherein the training loss is obtained from the difference between the first value, the second value, and the activation value of each class of the corresponding sample of the predicted known class. Alternatively, the training loss may also be adjusted by adjusting the temperature coefficient of the loss function, i.e. adjusting the coefficient of the loss function, to obtain a sharper activation value, resulting in a training loss that is more capable of training the recognizer. Wherein the loss function is a function for calculating a training loss of the recognizer. For example, by reducing the temperature coefficient of the loss function, a more sharply distributed loss function can be obtained, where the sharpness of the loss function refers to the size of the difference between the maximum value and the minimum value of the function, and the greater the difference between the maximum value and the minimum value of the function, the greater the sharpness of the loss function, and the better the training effect of the obtained training loss.

In some embodiments, as shown in FIG. 3, the recognizer 30 is trained by the temperature system 330 and labeling known class samples and labeling unknown class samples.

And marking the classes of the predicted known class samples through manual marking to obtain marked known class samples and marked unknown class samples, and marking the activation values of the marked known class samples and the marked unknown class samples through the first numerical value and the second numerical value, so that the trained recognizer can better recognize the classes of the samples.

B. Adjusting parameters of the recognizer based on training loss of the recognizer to obtain an updated recognizer; wherein the updated identifier is used to identify new predicted known class samples from the open set data.

And optionally, the activation value obtained by the recognizer through detecting the sample of the known type can be approximate to a first numerical value and a second numerical value corresponding to the same sample of the predicted known type.

Through training above-mentioned recognizer many times, obtain the recognizer of training many times, compare in original recognizer, the recognizer after training many times can obtain more accurate recognition result, has reduced the wrong problem of discernment that can appear in the follow-up artifical mark.

As shown in fig. 3, the activation values labeled in the known class samples and labeled in the unknown class samples in fig. 3 are the first numerical value and the second numerical value.

Step 472, using the prediction known class sample with the labeled class belonging to the K known classes as a training sample of the classifier.

And taking the obtained sample labeled with the known class as a training sample of the classifier.

Optionally, the model training device further needs to pre-train the recognizer, and the specific steps are as follows: acquiring a pre-training data set of the recognizer, wherein the pre-training data set comprises at least one pre-training sample with a class label, and the class label is one of an unknown class and K known classes; training the recognizer by adopting a pre-training data set to obtain a recognizer which is pre-trained; wherein the pre-trained recognizer is used to predict known class samples from the open set data.

And acquiring a pre-training data set, wherein samples in the pre-training data set are samples with class labels. Category-labeled categories refer to samples of known categories. For example, the samples in the pre-training dataset are all samples for which the class label is known to be football, basketball, volleyball, or unknown. Training the recognizer through a pre-training data set, obtaining training loss according to the prediction category of a sample in the pre-training data set obtained by the recognizer and the actual category of the sample in the pre-training data set, and pre-training the recognizer according to the training loss to obtain the recognizer which is pre-trained.

Optionally, for labeling the known class samples, training a class K classifier by minimizing the cross entropy loss of the criterion, wherein the number of classes that the classifier can recognize is the same as the number of known classes that the classifier can recognize. Optionally, the classifier can identify fewer classes than the number of known classes that the classifier can identify. The number of classes that the classifier can identify is not limited in this application. The specific formula is as follows:

wherein (x)_i,y_i)∈D_L，D_LIs the above labeled known class sample, θ_CIs a training parameter of a class K classifier, n^LIs the current set D_LThe size of (2).

Through the pre-training of the recognizer, the recognizer is enabled to have recognition capability preliminarily, the category can be recognized preliminarily, and the training process of the recognizer is accelerated.

The embodiment identifies the category of the sample in the open-set data through the identifier, and then selects a part of the predicted known category samples obtained by the identifier in a random sampling mode. And then calculating to obtain a probability value of the predicted known class sample through the activation value and the Gaussian mixture model, selecting the predicted known class sample meeting the condition according to the probability value, and labeling the predicted known class sample in a manual review mode to obtain a labeled known class sample and a labeled unknown class sample. And finally, inputting the samples marked with known classes into a classifier model for training a classifier, and inputting the samples marked with unknown classes into a recognizer for training the recognizer. On one hand, more accurate labeled known class samples and labeled unknown class samples are obtained through manual review, so that training of the recognizer and the classifier model is more effective; on the other hand, the probability value is calculated through the Gaussian mixture model and the activation value of the predicted known class sample, and the predicted known class sample corresponding to the probability value meeting the condition is selected for manual examination, so that the manpower and time required by manual examination are reduced, and the training efficiency of the annotation model of the open set data is improved.

The training process of the recognizer comprises the following specific steps:

the identifier may identify K known classes and may also identify a K +1 th unknown class, and for a sample with a labeled known class and a sample with a labeled unknown class obtained by manual labeling, taking a sample X as an example, the sample X is encoded by using one-hot (a way that one-hot encoding is performed by encoding N states by using an N-bit state register), so as to obtain a corresponding activation value. The activation value of the category corresponding to the sample X is 1, and the activation values of the other categories are labeled as 0. Optimizing the training recognizer through a cross entropy loss function with a temperature coefficient T, wherein the formula is as follows:

wherein

Wherein L is_DIs a cross entropy loss function, L_D(X, C) is the cross entropy loss function of class C for sample X, a_cThe class of sample X is the activation value of C,

is the cross entropy loss function probability distribution, calculated from the above formula, the known class sample has a larger activation value on the first K known classes and a smaller activation value on the (K +1) th unknown class, while the unknown class sample has the opposite phenomenon. Meanwhile, the temperature coefficient T of the cross entropy loss function can be reduced to increase the probability distribution

A value range of (a) to (b) of

Is more sharp. The specific formula is as follows:

wherein, due to the reduction of the temperature coefficient T, the

The value range of the activation value becomes larger, and the known class sample and the unknown class sample are easier to distinguish.

The following calculation process is that a Gaussian mixture model randomly samples N prediction known class samples to obtain M prediction known class samples, and extracts the prediction known class samples with probability values meeting target conditions from the M prediction known class samples, and the specific steps are as follows:

in the training process of the recognizer, we find that whether the samples in the open set data are known class samples or unknown class samples can be judged through the activation values, that is, the maximum activation value of the unknown class samples is significantly different from the average activation value of the known class samples. For each sample class C, setting the maximum activation value may be defined as follows:

wherein the content of the first and second substances,

is the maximum activation value for the sample class C,

is to ask for

Is the maximum value of the activation of the optical fiber,

is the activation value corresponding to the sample class C that labels the known class sample.

Inputting the prediction known class samples into a Gaussian mixture model, and performing Expectation-maximization (EM) algorithm pair

Modeling a Gaussian mixture model, and calculating to obtain a corresponding probability value:

W^c＝GMM(mav^c,θ_D)

wherein, W^cIs a probability value corresponding to the class C of each sample, theta_DAre parameters of the gaussian mixture model. For each unlabeled sample x of class C_iWhich knows the probability value w of the class sample_iIs a prior probability P (g | mav)_i) Where g is the gaussian component with the larger activation value. The probabilities for the various categories are then combined and sorted.

W＝sort(W¹∪W²∪…∪W^K)

Wherein sort means to rank the probability values in parentheses.

Then, the selected top b samples with the highest probability values are labeled by the annotator. Optionally, a threshold is set to construct a query set to be annotated by the annotator, and the formula is as follows:

wherein, X^queryIs the above mentioned query set and τ is the set threshold. After labeling the labels of the samples in the query set, the labeled known class samples and the labeled unknown class samples are updated for training of the recognizer.

In some embodiments, the LFOSA active learning method and 5 methods of Random, incertanty, OpenMax, Coreset and BALD of the scheme are used for selecting samples, and the recall, the precision and the target model performance are improved by three angles for comparison, and the adopted public data sets are three data sets of CIFAR10, CIFAR100 and Tiny-Imagenet. As shown in fig. 5, 6 and 7, fig. 5 illustrates recall performance comparisons at sample sampling for a number of different active learning methods. Fig. 6 illustrates the performance comparison of the accuracy of a sample sampling by various active learning methods. FIG. 7 illustrates model performance improvement comparisons of a plurality of different active learning approaches in a sample sampling strategy. The experiments in fig. 5, 6 and 7 were all performed 9 times, and 5 different sets of experimental results were obtained.

In some embodiments, as shown in fig. 5, curve 51 in fig. 5 is the sample recall performance of the LFOSA active learning method of the present application. Wherein the abscissa of each graph is the training times and the ordinate is the recall performance of sample sampling. It can be seen that the LFOSA active learning method in the present application is stronger than other methods, no matter how many times training is performed.

Also, in some embodiments, as shown in fig. 6, curve 61 in fig. 6 is the accuracy performance of the sample sampling of the LFOSA active learning method of the present application. The abscissa of each graph is the training frequency, and the ordinate is the precision performance of sample sampling. It can be seen that the LFOSA active learning method in the present application is stronger than other methods, no matter how many times training is performed.

Also, in some embodiments, as shown in fig. 7, curve 71 in fig. 7 is the model performance improvement amount of the sample sampling strategy of the LFOSA active learning method of the present application. The abscissa of each graph is the training frequency, and the ordinate is the model performance improvement amount of the sampling strategy. It can be seen that before 4 times of training, the LFOSA active learning method in the present application is comparable to other methods in terms of the model performance improvement amount of the sampling strategy, but after 4 times of training, the LFOSA active learning method in the present application is more obvious in terms of the model performance improvement amount of the sampling strategy than other methods.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 8, a block diagram of an apparatus for annotating open-set data according to an embodiment of the present application is shown. The device has the function of realizing the marking method of the open set data, and the function can be realized by hardware or by hardware executing corresponding software. The device may be the model training apparatus described above, or may be provided in the model training apparatus. The apparatus 800 may include: a sample acquisition module 810, a sample identification module 820, and a sample selection module 830.

A sample acquiring module 810, configured to acquire open-set data including a plurality of samples, where the plurality of samples include at least one known class sample and at least one unknown class sample; the known class samples refer to samples of which real classes belong to K known classes, the unknown class samples refer to samples of which real classes do not belong to the K known classes, and K is a positive integer.

A sample identification module 820, configured to identify, by an identifier, N predicted known class samples from the open set data; wherein the identifier is configured to identify the sample as the known class sample or the unknown class sample, the predicted known class sample refers to the sample identified by the identifier as the known class sample, and N is a positive integer.

A sample selection module 830, configured to select, from the N predicted known class samples, a predicted known class sample with a confidence level meeting a condition as a training sample of a classifier; wherein the classifier is configured to classify the K known classes.

In an exemplary embodiment, as shown in fig. 9, the sample identification module 820 may include: a sample input unit 821, an activation value acquisition unit 822, a category determination unit 833, and a predicted sample determination unit 834.

A sample input unit 821, configured to input, to the identifier, a target sample in the open-set data.

An activation value obtaining unit 822, configured to obtain, through the identifier, K +1 activation values corresponding to the target sample; the K +1 activation values correspond to K +1 categories one by one, and the K +1 categories comprise unknown categories and the K known categories.

A category determining unit 833, configured to determine that the target sample is the known category sample if the K +1 activation values meet a condition.

A prediction sample determining unit 834, configured to select N samples determined as the known class samples from the open set data, so as to obtain the N predicted known class samples.

In an exemplary embodiment, the category determining unit 833 is configured to:

determining a maximum activation value of the K +1 activation values;

and if the class corresponding to the maximum activation value belongs to the K known classes, determining that the target sample is the known class sample.

In an exemplary embodiment, as shown in fig. 9, the sample selection module 830 may include: a probability value acquisition unit 831 and a training sample acquisition unit 832.

A probability value obtaining unit 831, configured to use a gaussian mixture model GMM to model the maximum activation value distribution corresponding to the N predicted known class samples, so as to obtain probability values corresponding to the N predicted known class samples respectively; wherein the probability value is used to characterize the confidence level that the predicted known class sample belongs to the known class.

A training sample obtaining unit 832, configured to obtain a training sample of the classifier based on the predicted known class sample with the probability value satisfying a target condition from the N predicted known class samples.

In some embodiments, the probability value obtaining unit 831 is configured to:

for a target known class of the K known classes, selecting M predicted known class samples belonging to the target known class from the N predicted known class samples, M being a positive integer less than or equal to N;

for each predicted known class sample in the M predicted known class samples, acquiring an activation value corresponding to the target known class from K +1 activation values corresponding to the predicted known class samples to obtain M activation values;

modeling the M activation values by using the GMM to obtain probability values corresponding to the M predicted known class samples respectively; wherein, a probability value corresponding to the ith prediction known class sample in the M prediction known class samples is used for representing the credibility of the ith prediction known class sample belonging to the target known class, and i is a positive integer less than or equal to M.

In some embodiments, the training sample acquisition unit 832 is configured to:

acquiring a labeling category corresponding to a prediction known category sample with the probability value meeting the target condition, wherein the labeling category is one of the K +1 categories;

and taking a prediction known class sample of which the labeled class belongs to the K known classes as a training sample of the classifier.

In some embodiments, the training sample acquiring unit 832 is further configured to:

calculating the training loss of the recognizer based on the labeling class corresponding to the prediction known class sample and the K +1 activation values, corresponding to the prediction known class sample, obtained by the recognizer; wherein the prediction category is one of the K +1 categories;

adjusting parameters of the recognizer based on the training loss of the recognizer to obtain an updated recognizer; wherein the updated identifier is to identify a new predicted known class sample from the open set data.

In some embodiments, the training sample acquisition unit 832 is configured to:

determining K +1 labeling activation values corresponding to the prediction known class samples based on the labeling classes corresponding to the prediction known class samples; the labeling activation value corresponding to the labeling category is a first numerical value, the labeling activation values corresponding to the other categories except the labeling category in the K +1 categories are second numerical values, and the first numerical value and the second numerical value are different;

calculating the training loss of the recognizer based on the K +1 labeled activation values corresponding to the predicted known class samples, the K +1 activation values corresponding to the predicted known class samples obtained by the recognizer and the temperature coefficient; wherein the temperature coefficient is used to adjust the sharpness of the activation value distribution.

In some embodiments, the pre-training process of the recognizer is as follows:

acquiring a pre-training data set of the recognizer, wherein the pre-training data set comprises at least one pre-training sample with a class label, and the class label is one of the unknown class and the K known classes;

training the recognizer by adopting the pre-training data set to obtain a recognizer which is pre-trained;

wherein the pre-trained recognizer is configured to recognize the predicted known class samples from the open set data.

Referring to fig. 10, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. The Computer device may be any electronic device with data calculation, processing and storage functions, such as a mobile phone, a tablet Computer, a PC (Personal Computer) or a server. The computer device may be implemented as a model training device for implementing the open set data labeling method provided in the above embodiments. Specifically, the method comprises the following steps:

the computer apparatus 1100 includes a Central Processing Unit (e.g., a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), etc.) 1001, a system Memory 1004 including a RAM (Random-Access Memory) 1002 and a ROM (Read-Only Memory) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The computer device 1000 also includes a basic Input/Output System (I/O System) 1006 for facilitating information transfer between the various devices within the server, and a mass storage device 10010 for storing an operating System 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for user input of information. The display 1008 and the input device 1009 are connected to the central processing unit 1001 through an input/output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 10010 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 10010 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.

Without loss of generality, the computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical, magnetic, tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 10010 described above may be collectively referred to as memory.

The computer device 1000 may also operate as a remote computer connected to a network via a network, such as the internet, in accordance with embodiments of the present application. That is, the computer device 1000 may be connected to the network 1012 through the network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory also includes at least one instruction, at least one program, set of codes, or set of instructions stored in the memory and configured to be executed by the one or more processors to implement the method of tagging data in an open set as described above.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which when executed by a processor of a computer device, implements the method for annotating open set data provided by the above embodiments.

Optionally, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State drive), or optical disk. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or the computer program comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for labeling the open set data.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and is not intended to limit the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for annotating open-set data, the method comprising:

2. The method of claim 1, wherein the identifying by the identifier N predicted known class samples from the open set data comprises:

for a target sample in the open set data, inputting the target sample to the identifier;

obtaining K +1 activation values corresponding to the target sample through the identifier; the K +1 activation values correspond to K +1 categories one by one, and the K +1 categories comprise unknown categories and the K known categories;

if the K +1 activation values meet the condition, determining that the target sample is the known type sample;

and selecting N samples determined as the known class samples from the open set data to obtain the N predicted known class samples.

3. The method of claim 2, wherein determining the target sample as the known class sample if the K +1 activation values are satisfied comprises:

determining a maximum activation value of the K +1 activation values;

4. The method according to claim 2, wherein the selecting, from the N predicted known class samples, a predicted known class sample with a confidence level satisfying a condition as a training sample of a classifier comprises:

modeling the maximum activation value distribution corresponding to the N predicted known class samples by using a Gaussian Mixture Model (GMM) to obtain probability values corresponding to the N predicted known class samples respectively; wherein the probability value is used for representing the credibility of the prediction known class sample belonging to the known class;

and obtaining a training sample of the classifier based on the prediction known class sample with the probability value meeting the target condition from the N prediction known class samples.

5. The method according to claim 4, wherein the modeling the maximum activation value distribution corresponding to the N predicted known class samples using the GMM to obtain probability values corresponding to the N predicted known class samples respectively comprises:

6. The method of claim 4, wherein the deriving the training samples of the classifier based on the predicted known class samples with the probability values satisfying a target condition from the N predicted known class samples comprises:

7. The method of claim 6, wherein after obtaining the labeled category corresponding to the predicted known category sample with the conditional probability value, further comprising:

8. The method according to claim 6, wherein the calculating the training loss of the recognizer based on the labeled class corresponding to the sample of the predicted known class and the K +1 activation values corresponding to the sample of the predicted known class obtained by the recognizer comprises:

9. The method according to any one of claims 1 to 8, wherein the pre-training process of the recognizer is as follows:

10. An apparatus for annotating open-set data, the apparatus comprising:

11. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method of any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1 to 9.

13. A computer program product comprising computer instructions stored in a computer readable storage medium, from which a processor reads and executes the computer instructions to implement the method of any one of claims 1 to 9.