CN110083728B - Method, device and system for optimizing automatic picture data cleaning quality - Google Patents

Method, device and system for optimizing automatic picture data cleaning quality Download PDF

Info

Publication number
CN110083728B
CN110083728B CN201910267802.1A CN201910267802A CN110083728B CN 110083728 B CN110083728 B CN 110083728B CN 201910267802 A CN201910267802 A CN 201910267802A CN 110083728 B CN110083728 B CN 110083728B
Authority
CN
China
Prior art keywords
confidence
grained
picture
classifier
fine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910267802.1A
Other languages
Chinese (zh)
Other versions
CN110083728A (en
Inventor
吴英平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai re SR Information Technology Co.,Ltd.
Original Assignee
Shanghai Re Sr Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Re Sr Information Technology Co ltd filed Critical Shanghai Re Sr Information Technology Co ltd
Priority to CN201910267802.1A priority Critical patent/CN110083728B/en
Publication of CN110083728A publication Critical patent/CN110083728A/en
Application granted granted Critical
Publication of CN110083728B publication Critical patent/CN110083728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a method, a device and a system for optimizing automatic picture data cleaning quality, which comprises the following steps: sequentially inputting the picture set to be cleaned into a coarse-grained secondary classifier and a fine-grained secondary classifier to obtain the confidence coefficient of class prediction of the picture data to be cleaned; screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold; obtaining the model accuracy of a fine-grained secondary classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning; and (4) performing model optimization of the fine-grained second classifier by taking the model accuracy of the fine-grained second classifier and the model optimization frequency threshold as optimization conditions. The invention can obtain very high image cleaning quality through a small amount of fine-grained two-classifier model iteration on the basis of the original data cleaning method, and can even completely replace manual cleaning after the model iteration is finished under certain conditions.

Description

Method, device and system for optimizing automatic picture data cleaning quality
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a method, a device and a system for optimizing automatic picture data cleaning quality.
Background
With the breakthrough progress of deep learning technology in the image recognition field, neural networks have become mainstream application algorithms in the image recognition field. However, the neural network is a supervised learning algorithm, so-called supervised learning, which means that a developer learns based on labeled input and output data by using a known data set, so that model parameters of the neural network are continuously optimized, and the neural network is continuously smart, that is, a great amount of picture data with accurate labels are required to be trained to obtain good identification accuracy. Theoretically, the more data for learning the higher the accuracy of the model. However, the optimal condition is established in the case that the data for learning are all good, and if the data are mixed with wrong data, the accuracy of learning is obviously affected. Therefore, the cleaning of massive picture data becomes a bottleneck problem restricting the development of the neural network technology. The image data cleaning method mainly used in the industry at present is still a traditional method based on manual cleaning.
The patent application number is 2018107215159, and the patent name is Chinese invention patent application of a method and a device for cleaning data, which discloses: in the process of data cleaning, data to be cleaned, which are determined to be correct data and wrong data in a large probability, are firstly picked out, data which are difficult to confirm are screened in the middle, and then positive samples and negative samples are picked out.
Disclosure of Invention
Aiming at the problems, the invention provides a method, a device and a system for optimizing the cleaning quality of automatic picture data, which can obtain very high image cleaning quality through a small amount of iteration of a fine-grained two-classifier model on the basis of the original data cleaning method, and can completely replace manual cleaning even after the iteration of the model is finished under certain conditions.
In order to achieve the technical purpose and achieve the technical effects, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a method for optimizing the quality of automatic image data cleaning, comprising the following steps:
acquiring a picture set to be cleaned, inputting the picture set to be cleaned into a preset coarse-grained secondary classifier, and screening out a first-class picture set meeting requirements;
inputting the first type of picture set into a preset fine-grained second classifier to obtain the confidence coefficient of class prediction of each picture to be cleaned;
screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold;
obtaining the model accuracy of a fine-grained secondary classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning;
taking the model accuracy and the model optimization frequency threshold of the fine-grained second classifier as optimization conditions, and performing model optimization of the fine-grained second classifier based on the confidence degrees of class prediction of all the pictures to be manually cleaned, the feedback result of manual cleaning and the sample picture;
and repeating the process until a fine-grained secondary classifier meeting the requirements is obtained, and cleaning all the pictures.
Preferably, the training process of the preset fine-grained second classifier is as follows:
providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set;
and training to obtain a fine-grained two-classifier according to the positive sample test picture set and the negative sample test picture set.
Preferably, the screening of the pictures to be manually cleaned based on the set confidence threshold and the first picture number threshold corresponding to the confidence threshold specifically includes the following substeps:
comparing the set confidence threshold with the obtained confidence of the class prediction of each picture to be cleaned;
and when the number of the pictures with the predicted confidence degrees smaller than the set confidence degree threshold value is larger than a preset first picture number threshold value, the part of the pictures are regarded as the pictures needing to be manually cleaned.
Preferably, in the process of optimizing the model of the fine-grained two-classifier, after the step of screening out the pictures that need to be manually cleaned based on the set confidence threshold and the first picture number threshold corresponding to the confidence threshold, the method further includes:
and based on a set rule, selecting pictures needing to be manually cleaned from each prediction confidence coefficient distribution interval.
Preferably, the obtaining of the model accuracy of the fine-grained two-classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning specifically includes the following substeps:
screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold, and judging that the classification is wrong when the confidence of class prediction of each picture to be cleaned is in conflict with a manual cleaning feedback result, otherwise, judging that the classification is correct;
selecting pictures needing manual cleaning from each prediction confidence coefficient distribution interval, and judging that the classification is wrong when the confidence coefficient of the class prediction of each picture to be cleaned is in conflict with the feedback result of the manual cleaning, otherwise, judging that the classification is correct;
and calculating the model accuracy of the fine-grained secondary classifier based on the classification judgment result.
Preferably, the step of screening out the pictures that need to be manually cleaned based on the set confidence threshold and the first picture number threshold corresponding to the confidence threshold further includes:
and when the number of the pictures with the predicted confidence degrees larger than the set confidence degree threshold value in the pictures to be cleaned is smaller than a preset second picture number threshold value, the set confidence degree threshold value is reduced, and then the picture data higher than the adjusted confidence degree threshold value is provided for manual cleaning.
Preferably, the model optimization of the fine-grained two-classifier specifically includes the following sub-steps:
acquiring an error-prone sample, wherein the error-prone sample comprises an error-prone positive sample and an error-prone negative sample;
and when the model accuracy of the fine-grained two-classifier is smaller than the set accuracy threshold value and the optimization times of the model are smaller than the model optimization times threshold value, the obtained error-prone positive sample and error-prone negative sample, and other positive samples and negative samples are used as a training set together to re-refine the fine-grained two-classifier, so that the fine-grained two-classifier is optimized, and meanwhile, the model optimization times of the fine-grained two-classifier is increased by one.
Preferably, the Confidence of the class prediction of the picture to be artificially cleaned is ConfidencepredictThe feedback result of the manual cleaning is ConfidencegroundtruthThe calculation formula of the error-prone sample is as follows:
|Confidencegroundtruth-Confidencepredict|>threshold
wherein, ConfidencepredictHas a value range of (0, 1), ConfidencegroundtruthAnd taking 0 or 1, wherein the threshold is a preset threshold value.
In a second aspect, the present invention provides an apparatus for optimizing the quality of automatic image data cleaning, comprising:
the first screening module is used for acquiring a picture set to be cleaned, inputting the picture set to be cleaned into a preset coarse-grained second classifier, and screening a first-class picture set meeting requirements;
the first calculation module is used for inputting the first type of picture set into a preset fine-grained second classifier to obtain the confidence coefficient of class prediction of each picture to be cleaned;
the second screening module is used for screening out the pictures needing to be manually cleaned based on the set confidence threshold value and the first picture quantity threshold value corresponding to the confidence threshold value;
the second calculation module is used for obtaining the model accuracy of the fine-grained second classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning;
and the optimization module is used for performing model optimization on the fine-grained secondary classifier based on the confidence degrees of class prediction of all the pictures to be manually cleaned, the feedback result of manual cleaning and the sample picture by taking the model accuracy of the fine-grained secondary classifier and the threshold value of the number of model optimization times as optimization conditions.
In a third aspect, the present invention provides a system for optimizing the quality of automatic image data cleaning, including:
a processor adapted to implement instructions; and
a storage device adapted to store a plurality of instructions adapted to be loaded by a processor and to perform the steps recited in the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
the method, the device and the system for optimizing the automatic image data cleaning quality can obtain very high image cleaning quality through a small number of fine-grained two-classifier model iterations on the basis of the original data cleaning method, and can even completely replace manual cleaning after the model iteration is finished under certain conditions.
Drawings
Fig. 1 is a flowchart illustrating a method for optimizing the cleaning quality of automatic picture data according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
In order to realize learning of the neural network based on labeled input and output data, so that model parameters of the neural network are continuously optimized and the neural network is continuously smart, massive picture data with accurate labels are required to be provided for training, and good identification accuracy can be obtained. Theoretically, the more data for learning the higher the accuracy of the model. However, the optimal condition is established in the case that the data for learning are all good, and if the data are mixed with wrong data, the accuracy of learning is obviously affected. Therefore, the cleaning of massive picture data becomes a bottleneck problem restricting the development of the neural network technology. At present, the image data cleaning mode mainly used in the industry is still a traditional mode based on manual cleaning, and manual screening is large in workload, and the acquired data are judged and classified by a user due to the similarity of the data caused by the artificial subjectivity, so that the performance of the neural network model is influenced by the possibly wrong data of the neural network model. Therefore, the invention provides a method, a device and a system for optimizing the cleaning quality of automatic picture data, which can obtain very high image cleaning quality through a small amount of fine-grained two-classifier model iteration on the basis of the original data cleaning method, and can completely replace manual cleaning even after the model iteration is finished under certain conditions.
Example 1
As shown in fig. 1, an embodiment of the present invention provides a method for optimizing automatic picture data cleaning quality, including the following steps:
(1) acquiring a picture set to be cleaned, inputting the picture set to be cleaned into a preset coarse-grained secondary classifier, and screening out a first-class picture set meeting requirements;
(2) inputting the first type of picture set into a preset fine-grained second classifier to obtain the confidence coefficient of class prediction of each picture to be cleaned; said steps (1) and (2) correspond to the automatic cleaning phase in fig. 1;
(3) screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold; said step (3) corresponds to the manual cleaning phase in fig. 1;
(4) obtaining the model accuracy of a fine-grained secondary classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning;
(5) taking the model accuracy and the model optimization frequency threshold of the fine-grained second classifier as optimization conditions, and performing model optimization of the fine-grained second classifier based on the confidence degrees of class prediction of all the pictures to be manually cleaned, the feedback result of manual cleaning and the sample picture; said step (5) corresponds to the model optimization phase in fig. 1;
(6) and repeating the process until a fine-grained secondary classifier meeting the requirements is obtained, and cleaning all the pictures.
In a specific implementation manner of the embodiment of the present invention, the step (1) specifically includes:
the method comprises the steps of obtaining an initial test picture set by using a web crawler, training the initial test picture set according to a preset coarse-grained secondary classifier, and obtaining the test picture set. In order to acquire a large amount of image sample data required by training a neural network model, the most convenient mode is a method acquired by a web crawler, the web crawler can capture information meeting the conditions from massive information of the internet according to the set conditions, but the image information acquired by the web crawler is massive and many image information are not required. Assuming that the related picture data with the category of A is obtained through the web crawler, the crawling result usually obtains large non-A picture data, so that the massive picture data obtained through the crawler network is initially classified through a coarse-grained secondary classifier, the non-A picture data is removed, and the A picture data is obtained. For example, the relevant dish pictures of the tomato-fried eggs are obtained through the web crawler, the dish pictures of the tomato-fried eggs which are not obtained are often crawled, and the dish pictures of the tomato-fried eggs are obtained through the coarse-grained secondary classifier. According to the technical scheme, the image data of the massive image is initially classified through the coarse-grained secondary classifier, and accurate sample image data are provided for the training of the subsequent fine-grained classifier.
In a specific implementation manner of the embodiment of the present invention, the training process of the preset fine-grained second classifier in the step (2) is as follows:
providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set; preferably, the clustering analysis algorithm is a K-means algorithm, and the specific clustering process is as follows: s201, dividing the picture data set into k types, and selecting k typical pictures from the test picture set as initial clustering centers of each type; s202, calculating the distance between each picture in the test picture set and the initial clustering center of each type, forming an initial clustering center value according to the minimum distance, and finishing one iteration; s203, repeatedly executing the iterative process of the step S202 until the calculated clustering center value is equal to the original center value to obtain the clustering center of each type; s204, calculating the distance between each picture and each type of clustering center, forming a positive sample test picture set by the picture with the closest distance, and forming a negative sample set by the picture with the farthest distance, wherein the number of the positive sample test picture sets is consistent with that of the negative sample test picture sets;
and training to obtain a fine-grained two-classifier according to the positive sample test picture set and the negative sample test picture set.
Inputting the first-Class picture set (Class A pictures) into a preset fine-grained second classifier, obtaining the confidence coefficient of Class prediction of each picture to be cleaned, for example, obtaining the confidence coefficient of Class1 prediction in the Class A pictures, and sending the confidence coefficient into a data management system;
in a specific implementation manner of the embodiment of the present invention, the screening out the pictures that need to be manually cleaned based on the set confidence threshold and the first picture quantity threshold corresponding to the confidence threshold specifically includes the following sub-steps:
comparing the set confidence threshold with the obtained confidence of the class prediction of each picture to be cleaned; the set confidence threshold needs to be set according to actual conditions, for example, may be set to 0.99;
when the number of the pictures with the predicted confidence degrees smaller than the set confidence degree threshold value is larger than a preset first picture number threshold value, the part of the pictures are regarded as the pictures needing to be manually cleaned; when the number of the pictures with the predicted confidence degrees smaller than the set confidence degree threshold value is smaller than a preset first picture number threshold value, the part of the pictures are determined as the pictures which do not need to be manually cleaned; the preset picture quantity threshold value also needs to be set according to the actual situation, for example, 150 pictures can be set;
in the process of optimizing the model of the fine-grained two-classifier, after the step of screening out the pictures to be manually cleaned based on the set confidence threshold and the first picture quantity threshold corresponding to the confidence threshold, the method further comprises:
based on a set rule, selecting pictures needing manual cleaning from each prediction confidence coefficient distribution interval, wherein the step is only executed in the model optimization process of a fine-grained two-classifier; for example, the picture selection can be performed according to the setting rules in table 1:
watch 1
Prediction confidence distribution Number of random picks
0-20% 5
20%-40% 10
40%-60% 15
60%-80% 10
80%-100% 5
In another implementation manner of the embodiment of the present invention, the pictures to be manually cleaned are selected from the prediction confidence distribution intervals, and the selection may also be performed without following the rule in table one, specifically, a specific selection rule is determined according to an actual need.
Preferably, the step of screening out the pictures that need to be manually cleaned based on the set confidence threshold and the first picture number threshold corresponding to the confidence threshold further includes:
when the number of the pictures with the predicted confidence degrees larger than the set confidence degree threshold value in the pictures to be cleaned is smaller than a preset second picture number threshold value (such as 10% of the total number of the pictures co-crawled at this time), the set confidence degree threshold value threshold is setcleanAnd (4) reducing, and then providing picture data higher than the adjusted confidence coefficient threshold value for manual cleaning.
In a specific implementation manner of the embodiment of the present invention, the step (4) of obtaining the model accuracy of the fine-grained two-classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning specifically includes the following sub-steps:
(401) screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold, and judging that the classification is wrong when the confidence of class prediction of each picture to be cleaned is in conflict with a manual cleaning feedback result, otherwise, judging that the classification is correct;
(402) selecting pictures needing manual cleaning from each prediction confidence coefficient distribution interval, and judging that the classification is wrong when the confidence coefficient of the class prediction of each picture to be cleaned conflicts with the manual cleaning feedback result, namely the picture prediction confidence coefficient obtained by a fine-grained secondary classifier is not consistent with the actual result, otherwise, judging that the classification is correct;
(403) and calculating the model accuracy of the fine-grained secondary classifier based on the classification judgment result.
In a specific implementation manner of the embodiment of the present invention, the model optimization of the fine-grained two-classifier in step (5) specifically includes the following sub-steps:
acquiring an error-prone sample, wherein the error-prone sample comprises an error-prone positive sample and an error-prone negative sample; the error-prone positive sample refers to a picture with a prediction confidence coefficient smaller than a set confidence coefficient threshold value but actually belongs to the class; the error-prone negative sample refers to a picture with a prediction confidence degree larger than a set confidence degree threshold value but does not actually belong to the same class;
and when the model accuracy of the fine-grained two-classifier is smaller than the set accuracy threshold value and the optimization times of the model are smaller than the model optimization times threshold value, the obtained error-prone positive sample and error-prone negative sample, and other positive samples and negative samples are used as a training set together to re-refine the fine-grained two-classifier, so that the fine-grained two-classifier is optimized, and meanwhile, the model optimization times of the fine-grained two-classifier is increased by one.
Preferably, the Confidence of the class prediction of the picture to be artificially cleaned is ConfidencepredictThe feedback result of the manual cleaning is ConfidencegroundtruthThe calculation formula of the error-prone sample is as follows:
|Confidencegroundtruth-Confidencepredict|>threshold
wherein, ConfidencepredictHas a value range of (0, 1), ConfidencegroundtruthAnd taking 0 or 1, wherein the threshold is a preset threshold value.
Example 2
Based on the same inventive concept as embodiment 1, an embodiment of the present invention provides an apparatus for optimizing automatic picture data cleaning quality, including:
the first screening module is used for acquiring a picture set to be cleaned, inputting the picture set to be cleaned into a preset coarse-grained second classifier, and screening a first-class picture set meeting requirements;
the first calculation module is used for inputting the first type of picture set into a preset fine-grained second classifier to obtain the confidence coefficient of class prediction of each picture to be cleaned;
the second screening module is used for screening out the pictures needing to be manually cleaned based on the set confidence threshold value and the first picture quantity threshold value corresponding to the confidence threshold value;
the second calculation module is used for obtaining the model accuracy of the fine-grained second classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning;
and the optimization module is used for performing model optimization on the fine-grained secondary classifier based on the confidence degrees of class prediction of all the pictures to be manually cleaned, the feedback result of manual cleaning and the sample picture by taking the model accuracy of the fine-grained secondary classifier and the threshold value of the number of model optimization times as optimization conditions.
The rest of the process was the same as in example 1.
Example 3
Based on the same inventive concept as embodiment 1, an embodiment of the present invention provides a system for optimizing automatic picture data cleaning quality, including:
a processor adapted to implement instructions; and
a storage device adapted to store a plurality of instructions adapted to be loaded by a processor and to perform the steps described in embodiment 1.
The following takes the picture cleaning of three dishes (anchovy, fried squid, onion slices) as an example.
The accuracy of the crawler data set is as follows:
vegetable product Total number of Number of positive samples Number of negative samples Rate of accuracy
Quick-fried squid 1610 1080 530 67.1%
Phoenix tail shrimp 1716 936 780 54.5%
Onion meat slice 1568 697 871 44.5%
The accuracy of the data set after automatic cleaning in the prior art is as follows:
Figure BDA0002017401140000081
Figure BDA0002017401140000091
after processing by the method of the invention, the accuracy of the data set is as follows:
Figure BDA0002017401140000092
based on tables 1-3, it can be seen that the prediction accuracy of the data set is further improved after the processing by the method of the present invention compared with that before the optimization is not performed.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. A method for optimizing the cleaning quality of automatic picture data is characterized by comprising the following steps:
acquiring a picture set to be cleaned, inputting the picture set to be cleaned into a preset coarse-grained secondary classifier, and screening out a first-class picture set meeting requirements;
inputting the first type of picture set into a preset fine-grained second classifier to obtain the confidence coefficient of class prediction of each picture to be cleaned;
screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold;
obtaining the model accuracy of a fine-grained secondary classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning;
taking the model accuracy and the model optimization frequency threshold of the fine-grained second classifier as optimization conditions, and performing model optimization of the fine-grained second classifier based on the confidence degrees of class prediction of all the pictures to be manually cleaned, the feedback result of manual cleaning and the sample picture;
repeating the process until a fine-grained second classifier meeting the requirements is obtained, and cleaning all the pictures;
the model optimization of the fine-grained two-classifier specifically comprises the following substeps:
acquiring an error-prone sample, wherein the error-prone sample comprises an error-prone positive sample and an error-prone negative sample;
and when the model accuracy of the fine-grained two-classifier is smaller than the set accuracy threshold value and the optimization times of the model are smaller than the model optimization times threshold value, the obtained error-prone positive sample and error-prone negative sample, and other positive samples and negative samples are used as a training set together to re-refine the fine-grained two-classifier, so that the fine-grained two-classifier is optimized, and meanwhile, the model optimization times of the fine-grained two-classifier is increased by one.
2. The method of claim 1, wherein the method further comprises: the training process of the preset fine-grained second classifier is as follows:
providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set;
and training to obtain a fine-grained two-classifier according to the positive sample test picture set and the negative sample test picture set.
3. The method of claim 1, wherein the method further comprises: screening out the pictures needing to be manually cleaned based on the set confidence threshold value and the first picture quantity threshold value corresponding to the confidence threshold value, and specifically comprising the following substeps:
comparing the set confidence threshold with the obtained confidence of the class prediction of each picture to be cleaned;
and when the number of the pictures with the predicted confidence degrees smaller than the set confidence degree threshold value is larger than a preset first picture number threshold value, the part of the pictures are regarded as the pictures needing to be manually cleaned.
4. The method of claim 3, wherein the method further comprises: in the process of optimizing the model of the fine-grained two-classifier, after the step of screening out the pictures to be manually cleaned based on the set confidence threshold and the first picture quantity threshold corresponding to the confidence threshold, the method further comprises:
and based on a set rule, selecting pictures needing to be manually cleaned from each prediction confidence coefficient distribution interval.
5. The method of claim 4, wherein the method further comprises the steps of: the method for obtaining the model accuracy of the fine-grained two-classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning specifically comprises the following substeps:
screening out pictures needing manual cleaning based on a set confidence threshold and a first picture quantity threshold corresponding to the confidence threshold, and judging that the classification is wrong when the confidence of class prediction of each picture to be cleaned is in conflict with a manual cleaning feedback result, otherwise, judging that the classification is correct;
selecting pictures needing manual cleaning from each prediction confidence coefficient distribution interval, and judging that the classification is wrong when the confidence coefficient of the class prediction of each picture to be cleaned is in conflict with the feedback result of the manual cleaning, otherwise, judging that the classification is correct;
and calculating the model accuracy of the fine-grained secondary classifier based on the classification judgment result.
6. The method of claim 1, wherein the method further comprises: the step of screening out the pictures which need to be manually cleaned based on the set confidence threshold and the first picture quantity threshold corresponding to the confidence threshold further comprises:
and when the number of the pictures with the predicted confidence degrees larger than the set confidence degree threshold value in the pictures to be cleaned is smaller than a preset second picture number threshold value, the set confidence degree threshold value is reduced, and then the picture data higher than the adjusted confidence degree threshold value is provided for manual cleaning.
7. The method of claim 1, wherein the method further comprises:
recording the Confidence coefficient of the category prediction of the picture to be artificially cleaned as ConfidencepredictThe feedback result of the manual cleaning is ConfidencegroundtruthThe calculation formula of the error-prone sample is as follows:
|Confidencegroundtruth-Confidencepredict|>threshold
wherein, ConfidencepredictHas a value range of (0, 1), ConfidencegroundtruthAnd taking 0 or 1, wherein the threshold is a preset threshold value.
8. An apparatus for optimizing automated picture data cleaning quality, comprising:
the first screening module is used for acquiring a picture set to be cleaned, inputting the picture set to be cleaned into a preset coarse-grained second classifier, and screening a first-class picture set meeting requirements;
the first calculation module is used for inputting the first type of picture set into a preset fine-grained second classifier to obtain the confidence coefficient of class prediction of each picture to be cleaned;
the second screening module is used for screening out the pictures needing to be manually cleaned based on the set confidence threshold value and the first picture quantity threshold value corresponding to the confidence threshold value;
the second calculation module is used for obtaining the model accuracy of the fine-grained second classifier based on the confidence degrees of the class predictions of all the pictures to be manually cleaned and the feedback result of the manual cleaning;
the optimization module is used for performing model optimization on the fine-grained secondary classifier based on confidence degrees of class prediction of all pictures to be manually cleaned, feedback results of manual cleaning and sample pictures by taking model accuracy and a model optimization frequency threshold of the fine-grained secondary classifier as optimization conditions;
the model optimization of the fine-grained two-classifier specifically comprises the following substeps:
acquiring an error-prone sample, wherein the error-prone sample comprises an error-prone positive sample and an error-prone negative sample;
and when the model accuracy of the fine-grained two-classifier is smaller than the set accuracy threshold value and the optimization times of the model are smaller than the model optimization times threshold value, the obtained error-prone positive sample and error-prone negative sample, and other positive samples and negative samples are used as a training set together to re-refine the fine-grained two-classifier, so that the fine-grained two-classifier is optimized, and meanwhile, the model optimization times of the fine-grained two-classifier is increased by one.
9. The utility model provides a system for optimize automatic picture data cleaning quality which characterized in that: the method comprises the following steps:
a processor adapted to implement instructions; and
a storage device adapted to store a plurality of instructions adapted to be loaded by a processor and to perform the steps of any of claims 1 to 7.
CN201910267802.1A 2019-04-03 2019-04-03 Method, device and system for optimizing automatic picture data cleaning quality Active CN110083728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910267802.1A CN110083728B (en) 2019-04-03 2019-04-03 Method, device and system for optimizing automatic picture data cleaning quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910267802.1A CN110083728B (en) 2019-04-03 2019-04-03 Method, device and system for optimizing automatic picture data cleaning quality

Publications (2)

Publication Number Publication Date
CN110083728A CN110083728A (en) 2019-08-02
CN110083728B true CN110083728B (en) 2021-08-20

Family

ID=67414238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910267802.1A Active CN110083728B (en) 2019-04-03 2019-04-03 Method, device and system for optimizing automatic picture data cleaning quality

Country Status (1)

Country Link
CN (1) CN110083728B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667003B (en) * 2020-06-05 2023-11-03 北京百度网讯科技有限公司 Data cleaning method, device, equipment and storage medium
CN112633320B (en) * 2020-11-26 2023-04-07 西安电子科技大学 Radar radiation source data cleaning method based on phase image coefficient and DBSCAN
CN112529851B (en) * 2020-11-27 2023-07-18 中冶赛迪信息技术(重庆)有限公司 Hydraulic pipe state determining method, system, terminal and medium
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data
CN113344098A (en) * 2021-06-22 2021-09-03 北京三快在线科技有限公司 Model training method and device
CN114495291B (en) * 2022-04-01 2022-07-12 杭州魔点科技有限公司 Method, system, electronic device and storage medium for in vivo detection
CN118332343B (en) * 2024-06-13 2024-08-16 健数(长春)科技有限公司 Blood routine-based semi-supervised model optimized pulmonary tuberculosis disease classification method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977412A (en) * 2017-11-22 2018-05-01 上海大学 It is a kind of based on iterative with interactive perceived age database cleaning method
CN108874900A (en) * 2018-05-24 2018-11-23 四川斐讯信息技术有限公司 A kind of acquisition methods and system of samples pictures data acquisition system
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109165665A (en) * 2018-07-06 2019-01-08 上海康斐信息技术有限公司 A kind of category analysis method and system
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN109241397A (en) * 2018-07-06 2019-01-18 四川斐讯信息技术有限公司 A kind of method and apparatus for cleaning data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012008037A1 (en) * 2010-07-15 2012-01-19 富士通株式会社 Moving image decoding apparatus, moving image decoding method, moving image encoding apparatus and moving image encoding method
CN108664497B (en) * 2017-03-30 2020-11-03 大有秦鼎(北京)科技有限公司 Data matching method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977412A (en) * 2017-11-22 2018-05-01 上海大学 It is a kind of based on iterative with interactive perceived age database cleaning method
CN108874900A (en) * 2018-05-24 2018-11-23 四川斐讯信息技术有限公司 A kind of acquisition methods and system of samples pictures data acquisition system
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109165665A (en) * 2018-07-06 2019-01-08 上海康斐信息技术有限公司 A kind of category analysis method and system
CN109241397A (en) * 2018-07-06 2019-01-18 四川斐讯信息技术有限公司 A kind of method and apparatus for cleaning data
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110083728A (en) 2019-08-02

Similar Documents

Publication Publication Date Title
CN110083728B (en) Method, device and system for optimizing automatic picture data cleaning quality
Lee et al. Cost-aware Bayesian optimization
WO2022062419A1 (en) Target re-identification method and system based on non-supervised pyramid similarity learning
KR20170091716A (en) Automatic defect classification without sampling and feature selection
EP3620982B1 (en) Sample processing method and device
CN108491302B (en) Method for detecting spark cluster node state
CN109063030A (en) A method of theme and descriptor are implied based on streaming LDA topic model discovery document
CN115688913A (en) Cloud-side collaborative personalized federal learning method, system, equipment and medium
CN115526093A (en) Training method, equipment and storage medium for SMT printing parameter optimization model
CN109598307A (en) Data screening method, apparatus, server and storage medium
CN112784918A (en) Node identification method, system and device based on unsupervised graph representation learning
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium
CN112115996B (en) Image data processing method, device, equipment and storage medium
CN111652257A (en) Sample data cleaning method and system
CN111340233A (en) Training method and device of machine learning model, and sample processing method and device
CN113673482A (en) Cell antinuclear antibody fluorescence recognition method and system based on dynamic label distribution
CN114417095A (en) Data set partitioning method and device
CN111582442B (en) Image recognition method based on optimized deep neural network model
CN116188834B (en) Full-slice image classification method and device based on self-adaptive training model
CN109934352B (en) Automatic evolution method of intelligent model
CN110705889A (en) Enterprise screening method, device, equipment and storage medium
CN114581470B (en) Image edge detection method based on plant community behaviors
US20230041338A1 (en) Graph data processing method, device, and computer program product
WO2023273171A1 (en) Image processing method and apparatus, device, and storage medium
CN109543771A (en) A kind of method and device of data classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201012

Address after: 201615 room 1001, building 21, No. 1158, Zhongxin Road, Jiuting Town, Songjiang District, Shanghai

Applicant after: Shanghai re SR Information Technology Co.,Ltd.

Address before: The new town of Pudong New Area Nanhui lake west two road 201306 Shanghai City No. 888 building C

Applicant before: Shanghai Lianyin Electronic Technology Partnership (L.P.)

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant