CN112131415B

CN112131415B - Method and device for improving data acquisition quality based on deep learning

Info

Publication number: CN112131415B
Application number: CN202010987992.7A
Authority: CN
Inventors: 秦浩达
Original assignee: Beijing Moviebook Science And Technology Co ltd
Current assignee: Beijing Moviebook Science And Technology Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2024-05-10
Anticipated expiration: 2040-09-18
Also published as: CN112131415A

Abstract

The application discloses a method and a device for improving data acquisition quality based on deep learning, and relates to the field of data acquisition. The method comprises the following steps: setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures; inputting all pictures corresponding to the multiple types of labels into a neural network, and obtaining a training model through error adjustment; when data are collected, searching the data by inputting the labels, inputting a training model to obtain a confidence coefficient list of each label, and determining a confidence coefficient threshold value; after the data acquisition is completed, the confidence coefficient is calculated by the input tag through a training model, and if the confidence coefficient is higher than a confidence coefficient threshold value, the acquired data is used as the data of the tag. The device comprises: the device comprises a setting module, a training module, an acquisition module and an adjustment module. The application improves the labeling efficiency and reduces the workload of labeling personnel.

Description

Method and device for improving data acquisition quality based on deep learning

Technical Field

The application relates to the field of data acquisition, in particular to a method and a device for improving data acquisition quality based on deep learning.

Background

It is well known that a large amount of data is required as support during the training of neural network models. The identification accuracy of the model not only depends on the quality of the algorithm, but also depends on the accuracy of the data. If a picture of a horse is placed in a sheep's tag, it is highly likely that the horse will be mistakenly identified as a sheep after model training is completed. In data preparation, some internet companies invest a great deal of money and manpower, and because everyone knows things differently, pictures close to labels have different labeling results. How to release manpower as much as possible from tedious labeling work has been a difficulty in data labeling.

In artificial intelligence (artificial intelligence) beginnings, the first source is an internet search engine when data needs to be acquired due to new business. When a search engine is used, a phenomenon is generally found that after keyword query is filled in, the picture with the top ranking can meet the requirement, and the picture with the top ranking can meet the training requirement in a very small quantity because of the algorithm of the search engine and the picture labels, and part of a large amount of waste pictures are data needed in other labels. Therefore, the recognition capability of the data acquisition model is not reliable, the recognition effect of each label needs to be manually controlled, and the situation of false recognition can occur with small probability.

Disclosure of Invention

The present application aims to overcome or at least partially solve or alleviate the above-mentioned problems.

According to one aspect of the present application, there is provided a method for improving data acquisition quality based on deep learning, comprising:

Setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;

inputting all pictures corresponding to the multi-type labels into a neural network, and obtaining scores through forward propagation;

The score is input into an error function to calculate an error, a gradient vector is determined through back propagation, the weight of each layer of neurons in the neural network is adjusted according to the trend that the error tends to zero or converges through the gradient vector, and the steps of calculating the error and adjusting the weight are repeated until the adjustment times reach the designated times or the average value of error loss is not reduced, so that a training model is obtained;

When data are collected, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into the training model to obtain a confidence coefficient list of each tag, determining a confidence coefficient median of each tag according to the confidence coefficient of the misidentified tag in the confidence coefficient list, summing the confidence coefficient median of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold;

after the data acquisition is completed, the input label calculates the confidence coefficient through the training model, and if the confidence coefficient is higher than the confidence coefficient threshold value, the acquired data is used as the data of the label.

Optionally, inputting all pictures corresponding to the multiple types of labels into a neural network, and obtaining a score through forward propagation, including:

And inputting all pictures corresponding to the multi-type labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation.

Optionally, calculating the score input error function to obtain an error includes:

And inputting the score into an error function, comparing the score with an expected value to obtain an error, and summing the error to obtain a plurality of errors to obtain the error.

Optionally, determining the confidence value of each label according to the confidence of the misidentified label in the confidence list includes:

and checking a confidence coefficient list of each label, finding out from which confidence coefficient the label is started to be wrongly recognized, and determining the confidence coefficient of the label which is started to be wrongly recognized as the confidence coefficient median of the label.

Optionally, the method further comprises:

when data checking is carried out, the confidence coefficient threshold value is increased by a designated percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.

According to another aspect of the present application, there is provided an apparatus for improving data acquisition quality based on deep learning, comprising:

The setting module is configured to set a plurality of types of labels, and each type of label corresponds to a plurality of standard pictures;

The training module is configured to input all pictures corresponding to the multiple types of labels into a neural network, and obtain scores through forward propagation; the method is further configured to calculate the score input error function to obtain an error, determine a gradient vector through back propagation, adjust the weight of each layer of neurons in the neural network according to the trend of enabling the error to be zero or converged through the gradient vector, and repeat the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced any more, so as to obtain a training model;

The acquisition module is configured to input label search data, place pictures of the same type of labels under the same folder, input pictures under all label folders into the training model to obtain a confidence coefficient list of each label, determine a confidence coefficient median of each label according to the confidence coefficient of the false identification label in the confidence coefficient list, sum the confidence coefficient median of all labels, and take a mean value, and set the mean value as a confidence coefficient threshold;

And the adjustment module is configured to calculate the confidence coefficient through the training model after the data acquisition is completed by inputting the label, and if the confidence coefficient is higher than the confidence coefficient threshold value, the acquired data is used as the data of the label.

Optionally, the training module is specifically configured to:

Optionally, the acquisition module is specifically configured to:

Optionally, the apparatus further comprises:

A verification module configured to raise the confidence threshold by a specified percentage when data verification is performed, and to delete the data if the collected data does not meet the raised confidence threshold.

According to yet another aspect of the present application there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

According to a further aspect of the present application there is provided a computer readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which when executed by a processor implements a method as described above.

According to yet another aspect of the present application, there is provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the above-described method.

According to the technical scheme, the training model is obtained by setting multiple types of labels and training the neural network based on errors by using multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched under the same folder by inputting the label searching data, a confidence coefficient list and a confidence coefficient threshold value are obtained by calculation through the training model, after the data acquisition is finished, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is purposefully optimized by the algorithm personnel.

The above, as well as additional objectives, advantages, and features of the present application will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present application when read in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the application will be described in detail hereinafter by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts or portions. It will be appreciated by those skilled in the art that the drawings are not necessarily drawn to scale. In the accompanying drawings:

FIG. 1 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with one embodiment of the present application;

FIG. 2 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with another embodiment of the present application;

FIG. 3 is a flow chart of determining a confidence threshold according to another embodiment of the present application;

FIG. 4 is a block diagram of an apparatus for improving data acquisition quality based on deep learning in accordance with another embodiment of the present application;

FIG. 5 is a block diagram of a computing device according to another embodiment of the application;

fig. 6 is a block diagram of a computer-readable storage medium according to another embodiment of the present application.

Detailed Description

FIG. 1 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with one embodiment of the present application. Referring to fig. 1, the method includes:

101: setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;

in this embodiment, the number of tags may be set as required, and the specific number is not limited. The standard picture is a picture which can typically represent a certain type of label, and covers a plurality of angles of the label as much as possible, so that the accuracy of a model can be improved. Wherein, each type of label can correspond to 3-5 standard pictures, and the specific number is not limited.

102: Inputting all pictures corresponding to the multiple types of labels into a neural network, and obtaining scores through forward propagation;

the types of the neural network are various and may be selected according to needs, and the embodiment is not limited specifically.

103: The score is input into an error function to calculate to obtain an error, a gradient vector is determined through back propagation, the weight of each layer of neurons in the neural network is adjusted according to the trend that the error tends to zero or converges through the gradient vector, the steps of calculating the error and adjusting the weight are repeated until the adjustment times reach the designated times or the average value of error loss is not reduced, and a training model is obtained;

The error function (loss function) can achieve the effect of preventing overfitting through regularization punishment. The recognition degree of the model is judged through the calculated error, the error tends to zero or the error converges, namely the representative loss value is as small as possible, and the training effect of the model is better. When the gradient vector is determined, the error function and each activation function in the neural network are subjected to reverse derivative, so that the minimum error can be ensured. Specifically, the trend adjustment that makes the error approach zero or makes the error converge may be implemented by an SGD algorithm, which is not described herein.

104: When data are acquired, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into a training model to obtain a confidence coefficient list of each tag, determining a confidence coefficient median of each tag according to the confidence coefficient of the misidentified tag in the confidence coefficient list, summing the confidence coefficient median of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold;

105: after the data acquisition is completed, the confidence coefficient is calculated by the input label through a training model, and if the confidence coefficient is higher than a confidence coefficient threshold value, the acquired data is used as the data of the label.

In this embodiment, optionally, inputting all the pictures corresponding to the multiple types of labels into the neural network, and obtaining the score through forward propagation includes:

All pictures corresponding to the multiple types of labels are input into a neural network, the input value of each neuron is weighted and accumulated, then an activation function is input as the output value of the neuron, and the score is obtained through forward propagation.

In this embodiment, optionally, calculating the score input error function to obtain the error includes:

The score is input into an error function, and the error is obtained by comparing the score with an expected value, and if a plurality of errors are obtained, the errors are summed to be used as errors.

In this embodiment, optionally, determining the median value of the confidence coefficient of each label according to the confidence coefficient of the misidentified label in the confidence coefficient list includes:

And checking the confidence coefficient list of each label, finding out from which confidence coefficient the label starts to be wrongly recognized, and determining the confidence coefficient of the start wrongly recognized label as the confidence coefficient median of the label.

In this embodiment, optionally, the method further includes:

When data checking is carried out, the confidence coefficient threshold value is increased by a specified percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.

According to the method provided by the embodiment, the training model is obtained by setting the multiple types of labels and training the neural network based on errors by using the multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched by inputting the label searching data, the confidence coefficient list and the confidence coefficient threshold value are obtained by calculating the training model, after the data acquisition is completed, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is optimized in a targeted manner by the algorithm personnel.

FIG. 2 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with another embodiment of the present application. Referring to fig. 2, the method includes:

201: setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;

202: inputting all pictures corresponding to the multiple types of labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation;

203: inputting the score into an error function, comparing the score with an expected value to obtain an error, summing the errors to obtain a plurality of errors, taking the errors as the errors, determining a gradient vector through back propagation, adjusting the weight of each layer of neurons in the neural network through the gradient vector according to the trend of enabling the errors to be zero or converging, and repeating the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced, so as to obtain a training model;

204: when data are collected, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into a training model to obtain a confidence coefficient list of each tag, checking the confidence coefficient list of each tag, finding out which confidence coefficient starts to misidentify the tag, determining the confidence coefficient of the starting misidentify tag as a confidence coefficient median value of the tag, summing the confidence coefficient median values of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold value;

The confidence list obtained by training the model can be shown in the following table 1.

TABLE 1

Directory name	File name	Label (Label)	Confidence level
				Character one	001	Character one	0.8
Character one	002	Character one	0.5
				Character one	003	Character one	0.4
Character one	004	Character three	0.7
				Character two	001	Character two	0.7
Character two	002	Character two	0.7
				Character two	003	Character two	0.4

For example, a confidence list of the label "character one" is obtained by training a model, wherein the confidence list comprises 3 confidence values, the numerical values are 0.8,0.5 and 0.4 respectively, and the label is wrongly recognized from the confidence value of 0.5, namely, only the picture corresponding to the file 001 in the folder with the catalog name of character one is character one, and the pictures corresponding to the files 002 and 003 are not character one, so that the confidence value of the label "character one" is 0.5.

For another example, a confidence coefficient list of the label "person two" is obtained by training a model, wherein the confidence coefficient list comprises 3 confidence coefficients, the numerical values are 0.7,0.7 and 0.4 respectively, and the label is wrongly recognized from the confidence coefficient of 0.4, namely, the corresponding pictures in the files 001 and 002 in the folder with the directory name of person two are person two, and the picture corresponding to the file 003 is not person two and is wrongly recognized, so that the confidence coefficient median of the label "person two" is 0.4.

In addition, in the folder with the directory name of character one in table 1, the corresponding picture of the file 004 is character three, and this is usually because the search engine searches the data of the tag "character one", but the contents of the pictures are not matched according to the search result obtained by the picture description. If the picture is described as "the first character and the third character together play a play", the first character and the third character are simultaneously extracted when the search engine divides the keywords, but only the third character is included in the picture, which results in that the picture without the first character is searched out and put under the folder of the first character, and in this case, the picture can be corrected under the correct folder through the subsequent adjustment step. If the confidence value of the tag "person three" determination is 0.6, then the file 004 is the identification correct.

In this embodiment, the median of confidence plays a role in classification, and the probability that data higher than the median of confidence belongs to the tag is high, and data lower than or equal to the median of confidence does not belong to the tag.

205: After the data acquisition is completed, the input label calculates the confidence coefficient through a training model, and if the confidence coefficient is higher than a confidence coefficient threshold value, the acquired data is used as the data of the label;

In this step, if the calculated confidence coefficient is less than or equal to the confidence coefficient threshold value, the collected data is deleted from the data corresponding to the tag.

206: When data checking is carried out, the confidence coefficient threshold value is increased by a specified percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.

The specified percentage value may be set as required, for example, 10% or 15%, etc., and is not particularly limited.

FIG. 3 is a flow chart of determining a confidence threshold according to another embodiment of the application. Referring to fig. 3, in the process of collecting data, 3 tag search data are input, the searched pictures of the same type of tag are placed under the same folder, the pictures under the 3 tag folders are input into a training model to obtain 3 confidence lists, each confidence list determines a confidence median value, the confidence median values of the 3 tags are summed, an average value is obtained, and the average value is set as a confidence threshold value.

Fig. 4 is a block diagram of an apparatus for improving data acquisition quality based on deep learning according to another embodiment of the present application. Referring to fig. 4, the apparatus includes:

a setting module 401 configured to set a plurality of types of tags, each type of tag corresponding to a plurality of standard pictures;

A training module 402 configured to input all pictures corresponding to the multiple types of labels into the neural network, and obtain a score through forward propagation; the method is further configured to calculate the score input error function to obtain an error, determine a gradient vector through back propagation, adjust the weight of each layer of neurons in the neural network according to the trend of enabling the error to be zero or converged through the gradient vector, and repeat the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced, so as to obtain a training model;

The collecting module 403 is configured to, when collecting data, search data by inputting tags, place the searched pictures of the same type of tags under the same folder, input the pictures under all the tag folders into a training model to obtain a confidence list of each tag, determine a median value of the confidence of each tag according to the confidence of the misidentified tag in the confidence list, sum the median values of the confidence of all the tags, and take a mean value, and set the mean value as a confidence threshold;

And an adjustment module 404 configured to calculate the confidence level by training the model after the data is acquired, and if the confidence level is higher than the confidence level threshold, taking the acquired data as the data of the tag.

In this embodiment, optionally, the training module is specifically configured to:

In this embodiment, optionally, the above-mentioned acquisition module is specifically configured to:

In this embodiment, optionally, the apparatus further includes:

And the checking module is configured to raise the confidence coefficient threshold by a specified percentage when checking the data, and delete the data if the acquired data does not accord with the raised confidence coefficient threshold.

The above device provided in this embodiment may perform the method provided in any one of the above method embodiments, and detailed procedures are detailed in the method embodiments and are not repeated herein.

According to the device provided by the embodiment, the training model is obtained by setting the multiple types of labels and training the neural network based on errors by using the multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched by inputting the label searching data, the confidence coefficient list and the confidence coefficient threshold value are obtained by calculating the training model, after the data acquisition is completed, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is optimized in a targeted manner by the algorithm personnel.

Embodiments of the present application also provide a computing device, referring to fig. 5, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, which computer program, when being executed by the processor 1110, is adapted to carry out any of the method steps 1131 according to the present application.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 6, the computer-readable storage medium includes a storage unit for program code, the storage unit being provided with a program 1131' for performing the method steps according to the present application, the program being executed by a processor.

Embodiments of the present application also provide a computer program product comprising instructions. The computer program product, when run on a computer, causes the computer to perform the method steps according to the application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (english) medium, such as a random access memory, a read-only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (english: MAGNETIC TAPE), a floppy disk (english: floppy disk), an optical disk (english: optical disk), and any combination thereof.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method for improving data acquisition quality based on deep learning, comprising:

2. The method according to claim 1, wherein inputting all pictures corresponding to the multiple types of labels into a neural network, obtaining a score by forward propagation, comprises:

3. The method of claim 1, wherein computing the score-in error function to obtain an error comprises:

4. The method of claim 1, wherein determining a median confidence value for each tag based on confidence levels of misidentified tags in the confidence level list comprises:

5. The method according to any one of claims 1-4, further comprising:

6. An apparatus for improving data acquisition quality based on deep learning, comprising:

7. The apparatus of claim 6, wherein the training module is specifically configured to:

8. The apparatus of claim 6, wherein the training module is specifically configured to:

9. The apparatus of claim 6, wherein the acquisition module is specifically configured to:

10. The apparatus according to any one of claims 6-9, wherein the apparatus further comprises: