CN112131415B - Method and device for improving data acquisition quality based on deep learning - Google Patents

Method and device for improving data acquisition quality based on deep learning Download PDF

Info

Publication number
CN112131415B
CN112131415B CN202010987992.7A CN202010987992A CN112131415B CN 112131415 B CN112131415 B CN 112131415B CN 202010987992 A CN202010987992 A CN 202010987992A CN 112131415 B CN112131415 B CN 112131415B
Authority
CN
China
Prior art keywords
confidence coefficient
label
data
error
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010987992.7A
Other languages
Chinese (zh)
Other versions
CN112131415A (en
Inventor
秦浩达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Science And Technology Co ltd
Original Assignee
Beijing Moviebook Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Science And Technology Co ltd filed Critical Beijing Moviebook Science And Technology Co ltd
Priority to CN202010987992.7A priority Critical patent/CN112131415B/en
Publication of CN112131415A publication Critical patent/CN112131415A/en
Application granted granted Critical
Publication of CN112131415B publication Critical patent/CN112131415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method and a device for improving data acquisition quality based on deep learning, and relates to the field of data acquisition. The method comprises the following steps: setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures; inputting all pictures corresponding to the multiple types of labels into a neural network, and obtaining a training model through error adjustment; when data are collected, searching the data by inputting the labels, inputting a training model to obtain a confidence coefficient list of each label, and determining a confidence coefficient threshold value; after the data acquisition is completed, the confidence coefficient is calculated by the input tag through a training model, and if the confidence coefficient is higher than a confidence coefficient threshold value, the acquired data is used as the data of the tag. The device comprises: the device comprises a setting module, a training module, an acquisition module and an adjustment module. The application improves the labeling efficiency and reduces the workload of labeling personnel.

Description

Method and device for improving data acquisition quality based on deep learning
Technical Field
The application relates to the field of data acquisition, in particular to a method and a device for improving data acquisition quality based on deep learning.
Background
It is well known that a large amount of data is required as support during the training of neural network models. The identification accuracy of the model not only depends on the quality of the algorithm, but also depends on the accuracy of the data. If a picture of a horse is placed in a sheep's tag, it is highly likely that the horse will be mistakenly identified as a sheep after model training is completed. In data preparation, some internet companies invest a great deal of money and manpower, and because everyone knows things differently, pictures close to labels have different labeling results. How to release manpower as much as possible from tedious labeling work has been a difficulty in data labeling.
In artificial intelligence (artificial intelligence) beginnings, the first source is an internet search engine when data needs to be acquired due to new business. When a search engine is used, a phenomenon is generally found that after keyword query is filled in, the picture with the top ranking can meet the requirement, and the picture with the top ranking can meet the training requirement in a very small quantity because of the algorithm of the search engine and the picture labels, and part of a large amount of waste pictures are data needed in other labels. Therefore, the recognition capability of the data acquisition model is not reliable, the recognition effect of each label needs to be manually controlled, and the situation of false recognition can occur with small probability.
Disclosure of Invention
The present application aims to overcome or at least partially solve or alleviate the above-mentioned problems.
According to one aspect of the present application, there is provided a method for improving data acquisition quality based on deep learning, comprising:
Setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;
inputting all pictures corresponding to the multi-type labels into a neural network, and obtaining scores through forward propagation;
The score is input into an error function to calculate an error, a gradient vector is determined through back propagation, the weight of each layer of neurons in the neural network is adjusted according to the trend that the error tends to zero or converges through the gradient vector, and the steps of calculating the error and adjusting the weight are repeated until the adjustment times reach the designated times or the average value of error loss is not reduced, so that a training model is obtained;
When data are collected, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into the training model to obtain a confidence coefficient list of each tag, determining a confidence coefficient median of each tag according to the confidence coefficient of the misidentified tag in the confidence coefficient list, summing the confidence coefficient median of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold;
after the data acquisition is completed, the input label calculates the confidence coefficient through the training model, and if the confidence coefficient is higher than the confidence coefficient threshold value, the acquired data is used as the data of the label.
Optionally, inputting all pictures corresponding to the multiple types of labels into a neural network, and obtaining a score through forward propagation, including:
And inputting all pictures corresponding to the multi-type labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation.
Optionally, calculating the score input error function to obtain an error includes:
And inputting the score into an error function, comparing the score with an expected value to obtain an error, and summing the error to obtain a plurality of errors to obtain the error.
Optionally, determining the confidence value of each label according to the confidence of the misidentified label in the confidence list includes:
and checking a confidence coefficient list of each label, finding out from which confidence coefficient the label is started to be wrongly recognized, and determining the confidence coefficient of the label which is started to be wrongly recognized as the confidence coefficient median of the label.
Optionally, the method further comprises:
when data checking is carried out, the confidence coefficient threshold value is increased by a designated percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.
According to another aspect of the present application, there is provided an apparatus for improving data acquisition quality based on deep learning, comprising:
The setting module is configured to set a plurality of types of labels, and each type of label corresponds to a plurality of standard pictures;
The training module is configured to input all pictures corresponding to the multiple types of labels into a neural network, and obtain scores through forward propagation; the method is further configured to calculate the score input error function to obtain an error, determine a gradient vector through back propagation, adjust the weight of each layer of neurons in the neural network according to the trend of enabling the error to be zero or converged through the gradient vector, and repeat the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced any more, so as to obtain a training model;
The acquisition module is configured to input label search data, place pictures of the same type of labels under the same folder, input pictures under all label folders into the training model to obtain a confidence coefficient list of each label, determine a confidence coefficient median of each label according to the confidence coefficient of the false identification label in the confidence coefficient list, sum the confidence coefficient median of all labels, and take a mean value, and set the mean value as a confidence coefficient threshold;
And the adjustment module is configured to calculate the confidence coefficient through the training model after the data acquisition is completed by inputting the label, and if the confidence coefficient is higher than the confidence coefficient threshold value, the acquired data is used as the data of the label.
Optionally, the training module is specifically configured to:
And inputting all pictures corresponding to the multi-type labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation.
Optionally, the training module is specifically configured to:
And inputting the score into an error function, comparing the score with an expected value to obtain an error, and summing the error to obtain a plurality of errors to obtain the error.
Optionally, the acquisition module is specifically configured to:
and checking a confidence coefficient list of each label, finding out from which confidence coefficient the label is started to be wrongly recognized, and determining the confidence coefficient of the label which is started to be wrongly recognized as the confidence coefficient median of the label.
Optionally, the apparatus further comprises:
A verification module configured to raise the confidence threshold by a specified percentage when data verification is performed, and to delete the data if the collected data does not meet the raised confidence threshold.
According to yet another aspect of the present application there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.
According to a further aspect of the present application there is provided a computer readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which when executed by a processor implements a method as described above.
According to yet another aspect of the present application, there is provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the above-described method.
According to the technical scheme, the training model is obtained by setting multiple types of labels and training the neural network based on errors by using multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched under the same folder by inputting the label searching data, a confidence coefficient list and a confidence coefficient threshold value are obtained by calculation through the training model, after the data acquisition is finished, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is purposefully optimized by the algorithm personnel.
The above, as well as additional objectives, advantages, and features of the present application will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present application when read in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the application will be described in detail hereinafter by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts or portions. It will be appreciated by those skilled in the art that the drawings are not necessarily drawn to scale. In the accompanying drawings:
FIG. 1 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with one embodiment of the present application;
FIG. 2 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with another embodiment of the present application;
FIG. 3 is a flow chart of determining a confidence threshold according to another embodiment of the present application;
FIG. 4 is a block diagram of an apparatus for improving data acquisition quality based on deep learning in accordance with another embodiment of the present application;
FIG. 5 is a block diagram of a computing device according to another embodiment of the application;
fig. 6 is a block diagram of a computer-readable storage medium according to another embodiment of the present application.
Detailed Description
FIG. 1 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with one embodiment of the present application. Referring to fig. 1, the method includes:
101: setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;
in this embodiment, the number of tags may be set as required, and the specific number is not limited. The standard picture is a picture which can typically represent a certain type of label, and covers a plurality of angles of the label as much as possible, so that the accuracy of a model can be improved. Wherein, each type of label can correspond to 3-5 standard pictures, and the specific number is not limited.
102: Inputting all pictures corresponding to the multiple types of labels into a neural network, and obtaining scores through forward propagation;
the types of the neural network are various and may be selected according to needs, and the embodiment is not limited specifically.
103: The score is input into an error function to calculate to obtain an error, a gradient vector is determined through back propagation, the weight of each layer of neurons in the neural network is adjusted according to the trend that the error tends to zero or converges through the gradient vector, the steps of calculating the error and adjusting the weight are repeated until the adjustment times reach the designated times or the average value of error loss is not reduced, and a training model is obtained;
The error function (loss function) can achieve the effect of preventing overfitting through regularization punishment. The recognition degree of the model is judged through the calculated error, the error tends to zero or the error converges, namely the representative loss value is as small as possible, and the training effect of the model is better. When the gradient vector is determined, the error function and each activation function in the neural network are subjected to reverse derivative, so that the minimum error can be ensured. Specifically, the trend adjustment that makes the error approach zero or makes the error converge may be implemented by an SGD algorithm, which is not described herein.
104: When data are acquired, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into a training model to obtain a confidence coefficient list of each tag, determining a confidence coefficient median of each tag according to the confidence coefficient of the misidentified tag in the confidence coefficient list, summing the confidence coefficient median of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold;
105: after the data acquisition is completed, the confidence coefficient is calculated by the input label through a training model, and if the confidence coefficient is higher than a confidence coefficient threshold value, the acquired data is used as the data of the label.
In this embodiment, optionally, inputting all the pictures corresponding to the multiple types of labels into the neural network, and obtaining the score through forward propagation includes:
All pictures corresponding to the multiple types of labels are input into a neural network, the input value of each neuron is weighted and accumulated, then an activation function is input as the output value of the neuron, and the score is obtained through forward propagation.
In this embodiment, optionally, calculating the score input error function to obtain the error includes:
The score is input into an error function, and the error is obtained by comparing the score with an expected value, and if a plurality of errors are obtained, the errors are summed to be used as errors.
In this embodiment, optionally, determining the median value of the confidence coefficient of each label according to the confidence coefficient of the misidentified label in the confidence coefficient list includes:
And checking the confidence coefficient list of each label, finding out from which confidence coefficient the label starts to be wrongly recognized, and determining the confidence coefficient of the start wrongly recognized label as the confidence coefficient median of the label.
In this embodiment, optionally, the method further includes:
When data checking is carried out, the confidence coefficient threshold value is increased by a specified percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.
According to the method provided by the embodiment, the training model is obtained by setting the multiple types of labels and training the neural network based on errors by using the multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched by inputting the label searching data, the confidence coefficient list and the confidence coefficient threshold value are obtained by calculating the training model, after the data acquisition is completed, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is optimized in a targeted manner by the algorithm personnel.
FIG. 2 is a flow chart of a method for improving data acquisition quality based on deep learning in accordance with another embodiment of the present application. Referring to fig. 2, the method includes:
201: setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;
202: inputting all pictures corresponding to the multiple types of labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation;
203: inputting the score into an error function, comparing the score with an expected value to obtain an error, summing the errors to obtain a plurality of errors, taking the errors as the errors, determining a gradient vector through back propagation, adjusting the weight of each layer of neurons in the neural network through the gradient vector according to the trend of enabling the errors to be zero or converging, and repeating the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced, so as to obtain a training model;
204: when data are collected, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into a training model to obtain a confidence coefficient list of each tag, checking the confidence coefficient list of each tag, finding out which confidence coefficient starts to misidentify the tag, determining the confidence coefficient of the starting misidentify tag as a confidence coefficient median value of the tag, summing the confidence coefficient median values of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold value;
The confidence list obtained by training the model can be shown in the following table 1.
TABLE 1
Directory name File name Label (Label) Confidence level
Character one 001 Character one 0.8
Character one 002 Character one 0.5
Character one 003 Character one 0.4
Character one 004 Character three 0.7
Character two 001 Character two 0.7
Character two 002 Character two 0.7
Character two 003 Character two 0.4
For example, a confidence list of the label "character one" is obtained by training a model, wherein the confidence list comprises 3 confidence values, the numerical values are 0.8,0.5 and 0.4 respectively, and the label is wrongly recognized from the confidence value of 0.5, namely, only the picture corresponding to the file 001 in the folder with the catalog name of character one is character one, and the pictures corresponding to the files 002 and 003 are not character one, so that the confidence value of the label "character one" is 0.5.
For another example, a confidence coefficient list of the label "person two" is obtained by training a model, wherein the confidence coefficient list comprises 3 confidence coefficients, the numerical values are 0.7,0.7 and 0.4 respectively, and the label is wrongly recognized from the confidence coefficient of 0.4, namely, the corresponding pictures in the files 001 and 002 in the folder with the directory name of person two are person two, and the picture corresponding to the file 003 is not person two and is wrongly recognized, so that the confidence coefficient median of the label "person two" is 0.4.
In addition, in the folder with the directory name of character one in table 1, the corresponding picture of the file 004 is character three, and this is usually because the search engine searches the data of the tag "character one", but the contents of the pictures are not matched according to the search result obtained by the picture description. If the picture is described as "the first character and the third character together play a play", the first character and the third character are simultaneously extracted when the search engine divides the keywords, but only the third character is included in the picture, which results in that the picture without the first character is searched out and put under the folder of the first character, and in this case, the picture can be corrected under the correct folder through the subsequent adjustment step. If the confidence value of the tag "person three" determination is 0.6, then the file 004 is the identification correct.
In this embodiment, the median of confidence plays a role in classification, and the probability that data higher than the median of confidence belongs to the tag is high, and data lower than or equal to the median of confidence does not belong to the tag.
205: After the data acquisition is completed, the input label calculates the confidence coefficient through a training model, and if the confidence coefficient is higher than a confidence coefficient threshold value, the acquired data is used as the data of the label;
In this step, if the calculated confidence coefficient is less than or equal to the confidence coefficient threshold value, the collected data is deleted from the data corresponding to the tag.
206: When data checking is carried out, the confidence coefficient threshold value is increased by a specified percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.
The specified percentage value may be set as required, for example, 10% or 15%, etc., and is not particularly limited.
FIG. 3 is a flow chart of determining a confidence threshold according to another embodiment of the application. Referring to fig. 3, in the process of collecting data, 3 tag search data are input, the searched pictures of the same type of tag are placed under the same folder, the pictures under the 3 tag folders are input into a training model to obtain 3 confidence lists, each confidence list determines a confidence median value, the confidence median values of the 3 tags are summed, an average value is obtained, and the average value is set as a confidence threshold value.
According to the method provided by the embodiment, the training model is obtained by setting the multiple types of labels and training the neural network based on errors by using the multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched by inputting the label searching data, the confidence coefficient list and the confidence coefficient threshold value are obtained by calculating the training model, after the data acquisition is completed, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is optimized in a targeted manner by the algorithm personnel.
Fig. 4 is a block diagram of an apparatus for improving data acquisition quality based on deep learning according to another embodiment of the present application. Referring to fig. 4, the apparatus includes:
a setting module 401 configured to set a plurality of types of tags, each type of tag corresponding to a plurality of standard pictures;
A training module 402 configured to input all pictures corresponding to the multiple types of labels into the neural network, and obtain a score through forward propagation; the method is further configured to calculate the score input error function to obtain an error, determine a gradient vector through back propagation, adjust the weight of each layer of neurons in the neural network according to the trend of enabling the error to be zero or converged through the gradient vector, and repeat the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced, so as to obtain a training model;
The collecting module 403 is configured to, when collecting data, search data by inputting tags, place the searched pictures of the same type of tags under the same folder, input the pictures under all the tag folders into a training model to obtain a confidence list of each tag, determine a median value of the confidence of each tag according to the confidence of the misidentified tag in the confidence list, sum the median values of the confidence of all the tags, and take a mean value, and set the mean value as a confidence threshold;
And an adjustment module 404 configured to calculate the confidence level by training the model after the data is acquired, and if the confidence level is higher than the confidence level threshold, taking the acquired data as the data of the tag.
In this embodiment, optionally, the training module is specifically configured to:
All pictures corresponding to the multiple types of labels are input into a neural network, the input value of each neuron is weighted and accumulated, then an activation function is input as the output value of the neuron, and the score is obtained through forward propagation.
In this embodiment, optionally, the training module is specifically configured to:
The score is input into an error function, and the error is obtained by comparing the score with an expected value, and if a plurality of errors are obtained, the errors are summed to be used as errors.
In this embodiment, optionally, the above-mentioned acquisition module is specifically configured to:
And checking the confidence coefficient list of each label, finding out from which confidence coefficient the label starts to be wrongly recognized, and determining the confidence coefficient of the start wrongly recognized label as the confidence coefficient median of the label.
In this embodiment, optionally, the apparatus further includes:
And the checking module is configured to raise the confidence coefficient threshold by a specified percentage when checking the data, and delete the data if the acquired data does not accord with the raised confidence coefficient threshold.
The above device provided in this embodiment may perform the method provided in any one of the above method embodiments, and detailed procedures are detailed in the method embodiments and are not repeated herein.
According to the device provided by the embodiment, the training model is obtained by setting the multiple types of labels and training the neural network based on errors by using the multiple standard pictures, when data are acquired, the pictures of the same type of labels are searched by inputting the label searching data, the confidence coefficient list and the confidence coefficient threshold value are obtained by calculating the training model, after the data acquisition is completed, the calculated confidence coefficient and the confidence coefficient threshold value are compared to further screen the data, so that the labeling efficiency is improved to a great extent, the workload of labeling personnel is reduced, the accuracy degree of the training model on each label can be reflected by the label labeling, and the algorithm model is optimized in a targeted manner by the algorithm personnel.
The above, as well as additional objectives, advantages, and features of the present application will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present application when read in conjunction with the accompanying drawings.
Embodiments of the present application also provide a computing device, referring to fig. 5, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, which computer program, when being executed by the processor 1110, is adapted to carry out any of the method steps 1131 according to the present application.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 6, the computer-readable storage medium includes a storage unit for program code, the storage unit being provided with a program 1131' for performing the method steps according to the present application, the program being executed by a processor.
Embodiments of the present application also provide a computer program product comprising instructions. The computer program product, when run on a computer, causes the computer to perform the method steps according to the application.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (english) medium, such as a random access memory, a read-only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (english: MAGNETIC TAPE), a floppy disk (english: floppy disk), an optical disk (english: optical disk), and any combination thereof.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (10)

1. A method for improving data acquisition quality based on deep learning, comprising:
Setting a plurality of types of labels, wherein each type of label corresponds to a plurality of standard pictures;
inputting all pictures corresponding to the multi-type labels into a neural network, and obtaining scores through forward propagation;
The score is input into an error function to calculate an error, a gradient vector is determined through back propagation, the weight of each layer of neurons in the neural network is adjusted according to the trend that the error tends to zero or converges through the gradient vector, and the steps of calculating the error and adjusting the weight are repeated until the adjustment times reach the designated times or the average value of error loss is not reduced, so that a training model is obtained;
When data are collected, searching data by inputting tags, placing the searched pictures of the same type of tags under the same folder, inputting the pictures under all the tag folders into the training model to obtain a confidence coefficient list of each tag, determining a confidence coefficient median of each tag according to the confidence coefficient of the misidentified tag in the confidence coefficient list, summing the confidence coefficient median of all the tags, taking a mean value, and setting the mean value as a confidence coefficient threshold;
after the data acquisition is completed, the input label calculates the confidence coefficient through the training model, and if the confidence coefficient is higher than the confidence coefficient threshold value, the acquired data is used as the data of the label.
2. The method according to claim 1, wherein inputting all pictures corresponding to the multiple types of labels into a neural network, obtaining a score by forward propagation, comprises:
And inputting all pictures corresponding to the multi-type labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation.
3. The method of claim 1, wherein computing the score-in error function to obtain an error comprises:
And inputting the score into an error function, comparing the score with an expected value to obtain an error, and summing the error to obtain a plurality of errors to obtain the error.
4. The method of claim 1, wherein determining a median confidence value for each tag based on confidence levels of misidentified tags in the confidence level list comprises:
and checking a confidence coefficient list of each label, finding out from which confidence coefficient the label is started to be wrongly recognized, and determining the confidence coefficient of the label which is started to be wrongly recognized as the confidence coefficient median of the label.
5. The method according to any one of claims 1-4, further comprising:
when data checking is carried out, the confidence coefficient threshold value is increased by a designated percentage, and if the collected data does not accord with the increased confidence coefficient threshold value, the data is deleted.
6. An apparatus for improving data acquisition quality based on deep learning, comprising:
The setting module is configured to set a plurality of types of labels, and each type of label corresponds to a plurality of standard pictures;
The training module is configured to input all pictures corresponding to the multiple types of labels into a neural network, and obtain scores through forward propagation; the method is further configured to calculate the score input error function to obtain an error, determine a gradient vector through back propagation, adjust the weight of each layer of neurons in the neural network according to the trend of enabling the error to be zero or converged through the gradient vector, and repeat the steps of calculating the error and adjusting the weight until the adjustment times reach the designated times or the average value of error loss is not reduced any more, so as to obtain a training model;
The acquisition module is configured to input label search data, place pictures of the same type of labels under the same folder, input pictures under all label folders into the training model to obtain a confidence coefficient list of each label, determine a confidence coefficient median of each label according to the confidence coefficient of the false identification label in the confidence coefficient list, sum the confidence coefficient median of all labels, and take a mean value, and set the mean value as a confidence coefficient threshold;
And the adjustment module is configured to calculate the confidence coefficient through the training model after the data acquisition is completed by inputting the label, and if the confidence coefficient is higher than the confidence coefficient threshold value, the acquired data is used as the data of the label.
7. The apparatus of claim 6, wherein the training module is specifically configured to:
And inputting all pictures corresponding to the multi-type labels into a neural network, carrying out weighted accumulation on the input value of each neuron, inputting an activation function as the output value of the neuron, and obtaining a score through forward propagation.
8. The apparatus of claim 6, wherein the training module is specifically configured to:
And inputting the score into an error function, comparing the score with an expected value to obtain an error, and summing the error to obtain a plurality of errors to obtain the error.
9. The apparatus of claim 6, wherein the acquisition module is specifically configured to:
and checking a confidence coefficient list of each label, finding out from which confidence coefficient the label is started to be wrongly recognized, and determining the confidence coefficient of the label which is started to be wrongly recognized as the confidence coefficient median of the label.
10. The apparatus according to any one of claims 6-9, wherein the apparatus further comprises:
A verification module configured to raise the confidence threshold by a specified percentage when data verification is performed, and to delete the data if the collected data does not meet the raised confidence threshold.
CN202010987992.7A 2020-09-18 2020-09-18 Method and device for improving data acquisition quality based on deep learning Active CN112131415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010987992.7A CN112131415B (en) 2020-09-18 2020-09-18 Method and device for improving data acquisition quality based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010987992.7A CN112131415B (en) 2020-09-18 2020-09-18 Method and device for improving data acquisition quality based on deep learning

Publications (2)

Publication Number Publication Date
CN112131415A CN112131415A (en) 2020-12-25
CN112131415B true CN112131415B (en) 2024-05-10

Family

ID=73841503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010987992.7A Active CN112131415B (en) 2020-09-18 2020-09-18 Method and device for improving data acquisition quality based on deep learning

Country Status (1)

Country Link
CN (1) CN112131415B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667616B (en) * 2020-12-31 2022-07-22 杭州趣链科技有限公司 Traffic data evaluation method and system based on block chain and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN109345515A (en) * 2018-09-17 2019-02-15 代黎明 Sample label confidence calculations method, apparatus, equipment and model training method
CN111125365A (en) * 2019-12-24 2020-05-08 京东数字科技控股有限公司 Address data labeling method and device, electronic equipment and storage medium
CN111291809A (en) * 2020-02-03 2020-06-16 华为技术有限公司 Processing device, method and storage medium
CN111309912A (en) * 2020-02-24 2020-06-19 深圳市华云中盛科技股份有限公司 Text classification method and device, computer equipment and storage medium
CN111344712A (en) * 2017-06-04 2020-06-26 去识别化有限公司 System and method for image de-identification
CN111353549A (en) * 2020-03-10 2020-06-30 创新奇智(重庆)科技有限公司 Image tag verification method and device, electronic device and storage medium
WO2020160664A1 (en) * 2019-02-06 2020-08-13 The University Of British Columbia Neural network image analysis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111344712A (en) * 2017-06-04 2020-06-26 去识别化有限公司 System and method for image de-identification
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN109345515A (en) * 2018-09-17 2019-02-15 代黎明 Sample label confidence calculations method, apparatus, equipment and model training method
WO2020160664A1 (en) * 2019-02-06 2020-08-13 The University Of British Columbia Neural network image analysis
CN111125365A (en) * 2019-12-24 2020-05-08 京东数字科技控股有限公司 Address data labeling method and device, electronic equipment and storage medium
CN111291809A (en) * 2020-02-03 2020-06-16 华为技术有限公司 Processing device, method and storage medium
CN111309912A (en) * 2020-02-24 2020-06-19 深圳市华云中盛科技股份有限公司 Text classification method and device, computer equipment and storage medium
CN111353549A (en) * 2020-03-10 2020-06-30 创新奇智(重庆)科技有限公司 Image tag verification method and device, electronic device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Pinar Donmez 等.Efficiently learning the accuracy of labeling sources for selective sampling.《KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining》.2009,259-268. *
基于半监督学习的图像自动标注方法研究;林兰;《中国优秀硕士学位论文全文数据库信息科技辑》(第01期);I138-3982 *

Also Published As

Publication number Publication date
CN112131415A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
TWI689871B (en) Gradient lifting decision tree (GBDT) model feature interpretation method and device
JP3209163B2 (en) Classifier
CN109634924B (en) File system parameter automatic tuning method and system based on machine learning
CN110852755A (en) User identity identification method and device for transaction scene
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN111368867B (en) File classifying method and system and computer readable storage medium
CN109871891B (en) Object identification method and device and storage medium
CN110377739A (en) Text sentiment classification method, readable storage medium storing program for executing and electronic equipment
CN112528022A (en) Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112131415B (en) Method and device for improving data acquisition quality based on deep learning
CN114329022A (en) Method for training erotic classification model, method for detecting image and related device
CN111563361B (en) Text label extraction method and device and storage medium
CN113688263B (en) Method, computing device, and storage medium for searching for image
CN116049528A (en) Evaluation method and device of search system, electronic equipment and readable storage medium
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN114118305A (en) Sample screening method, device, equipment and computer medium
CN113792879A (en) Case reasoning attribute weight adjusting method based on introspection learning
CN111813975A (en) Image retrieval method and device and electronic equipment
CN115860012B (en) User intention recognition method, device, electronic equipment and medium
CN117131920B (en) Model pruning method based on network structure search
CN113139076B (en) Automatic neural network image marking method for deep feature learning multi-label
CN112069806A (en) Resume screening method and device, electronic equipment and storage medium
CN117520754B (en) Pretreatment system for model training data
CN114463330B (en) CT data collection system, method and storage medium
KR102399833B1 (en) synopsis production service providing apparatus using log line based on artificial neural network and method therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant