CN111652257A

CN111652257A - Sample data cleaning method and system

Info

Publication number: CN111652257A
Application number: CN201910239563.9A
Authority: CN
Inventors: 熊杰成
Original assignee: Shanghai Re Sr Information Technology Co ltd
Current assignee: Shanghai Re Sr Information Technology Co ltd
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2020-09-11

Abstract

The invention relates to the field of data processing, and discloses a sample data cleaning method, which comprises the following steps: providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set; training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set; performing class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the class prediction of each piece of picture data to be cleaned; and cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. The invention also discloses a sample data cleaning system. According to the technical scheme provided by the invention, the data can be automatically cleaned, and the accuracy of the original image data is improved.

Description

Sample data cleaning method and system

Technical Field

The invention relates to the field of data processing, in particular to a sample data cleaning method and system.

Background

With the breakthrough progress of the deep learning technology in the image recognition field, the neural network has become the mainstream application algorithm in the image recognition field at present. The neural network model algorithm has the advantages that any manually marked features are not needed in the process of training the model, the features hidden by the input variables can be automatically explored, and meanwhile, the weight sharing characteristics of the network greatly reduce the complexity of the model and reduce the number of weights. The advantages are particularly obvious when the input of the network is an image, and the original image can be directly used as the input of the network, so that the complex characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided.

However, the neural network is a supervised learning algorithm, and a good identification accuracy rate can be obtained only by training massive picture data with accurate labels. In order to acquire a large amount of image sample data required by training a neural network model, the most convenient mode is to acquire the image sample data through a network, a web crawler method is adopted, the web crawler can capture information meeting the conditions from massive information of the internet according to the set conditions, and then manual screening and cleaning are carried out, so that the problems that the workload is extremely large, the subjectivity of screening results is high, the screening results are prone to errors, and meanwhile, the training is carried out by stacking the neural network through wrong image sample data, and wrong classification results can be brought. Therefore, the cleaning of massive picture data becomes a bottleneck problem restricting the development of neural network technology. In the process of data cleaning, correct data and wrong data which are determined to be in a large probability in data to be cleaned are firstly picked out, data which are difficult to confirm are screened in the middle, and then positive samples and negative samples are picked out.

Therefore, the invention provides a method for automatically cleaning sample data, which reduces the manual cleaning cost and improves the accuracy of the original image data.

Disclosure of Invention

The invention aims to provide a sample data cleaning method and system, which realize automatic data cleaning and improve the accuracy of original image data.

In order to achieve the above object, the present invention provides a method for cleaning various sample data, the method comprising: providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set; training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set; performing class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the class prediction of each piece of picture data to be cleaned; and cleaning the sample data according to a preset confidence coefficient threshold interval and the confidence coefficient of the class prediction of each piece of picture data to be cleaned. The method greatly reduces the manual workload, reduces the data screening errors caused by the subjectivity of manual screening, and improves the robustness of the neural network model.

Optionally, the step of obtaining the test picture set includes: acquiring an initial test picture set by using a web crawler; and training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set. And initially classifying the massive picture data through the coarse-grained secondary classifier, and providing accurate sample picture data for the training of a subsequent fine-grained classifier.

Optionally, the step S1 further includes: the clustering analysis algorithm is a K-means algorithm. The step S1 further includes: dividing the picture data set into k classes, and selecting k typical pictures from the test picture set as initial clustering centers of each class; calculating the distance between each picture in the test picture set and the initial clustering center of each type, forming an initial clustering center value according to the minimum distance, and finishing one iteration; repeating the step S11 iterative process until the calculated clustering center value is equal to the original center value, and obtaining the clustering center of each type; and calculating the distance between each picture and the clustering center of each type, forming a positive sample test picture set by the picture with the closest distance, and forming a negative sample set by the picture with the farthest distance. The number of the positive sample test picture sets is consistent with that of the negative sample test picture sets. And acquiring a positive sample test picture set and a negative sample test picture set according to the clustering function of the clustering algorithm, and providing a training set for a subsequent fine-grained two-classifier.

Optionally, the data to be cleaned is trained according to a preset coarse-grained secondary classifier, so as to obtain initial data to be cleaned. And initially classifying the massive picture data through the coarse-grained secondary classifier, and providing accurate sample picture data for the training of a subsequent fine-grained classifier.

Optionally, the step S4 includes: setting a confidence interval; and classifying the image data to be cleaned into a corresponding confidence set according to the confidence of the class prediction of each image data to be cleaned and the confidence interval. And acquiring picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as positive sample picture data. The confidence coefficient of category prediction is calculated for the acquired image data to be cleaned in real time, so that the image data can be cleaned according to the confidence coefficient, and more accurate image data can be obtained.

The invention provides a sample data cleaning system, which comprises: the clustering module is used for providing a test picture set, clustering the test picture set according to a clustering analysis algorithm and acquiring a positive sample test picture set and a negative sample test picture set; the training module is used for training to obtain a fine-grained second classifier according to the positive sample test picture set and the negative sample test picture set; the classification module is used for carrying out class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and obtaining the confidence coefficient of the class prediction of each piece of picture data to be cleaned; and the cleaning module is used for cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. Greatly reduces the manual workload, reduces the data screening errors caused by the subjectivity of manual screening, and improves the robustness of the neural network model

Optionally, the clustering module further includes: the acquisition unit is used for acquiring an initial test picture set by using a web crawler; and the coarse-grained secondary classifier unit is used for training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set.

Optionally, the cleaning module includes: the setting unit is used for setting a confidence interval; the statistical unit is used for classifying the image data to be cleaned into a corresponding confidence coefficient set according to the confidence coefficient of the class prediction of each image data to be cleaned and the confidence coefficient interval; and the positive sample unit is used for acquiring the picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as the positive sample picture data.

Compared with the prior art, the data cleaning method and the data cleaning system provided by the invention have the beneficial effects that: according to the clustering algorithm, a positive sample test picture set and a negative sample test picture set are generated, and are classified by fine-grained secondary classification, so that the cleaning of mass data is completed, the workload of manual cleaning is greatly reduced, the automatic cleaning work is realized, the data screening errors caused by the subjectivity of manual screening are reduced, and the robustness of the neural network is improved.

Drawings

FIG. 1 is a schematic flow chart of a sample data cleaning method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a sample data cleaning system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments shown in the drawings. In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. The thickness of the components may be exaggerated where appropriate in the figures to improve clarity.

As shown in fig. 1, an embodiment of the present invention provides a sample data cleaning method, including:

s1, providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set;

s2, training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set;

s3, performing category prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the category prediction of each piece of picture data to be cleaned;

and S4, cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned.

In the prior art, in order to improve the performance of the neural network model, a manual screening mode is generally adopted for data screening, the manual screening is large in workload, and the acquired data is judged and classified by a user by the similarity of the data due to the manual subjectivity, so that the performance of the neural network model is influenced by the possibly wrong data of the neural network model. According to the method, a positive sample test picture set and a negative sample test picture set are generated according to a clustering algorithm, the positive sample test picture set and the negative sample test picture set are trained to obtain a fine-grained two-classifier, and the picture data to be cleaned are classified according to the fine-grained two-classifier, so that the manual workload is greatly reduced, the data screening errors caused by the subjectivity of manual screening are reduced, and the robustness of a neural network model is improved.

In a specific embodiment of the present invention, the step of obtaining the test picture set includes: the method comprises the steps of obtaining an initial test picture set by using a web crawler, training the initial test picture set according to a preset coarse-grained secondary classifier, and obtaining the test picture set. In order to acquire a large amount of image sample data required by training a neural network model, the most convenient mode is a method acquired by a web crawler, the web crawler can capture information meeting the conditions from massive information of the internet according to the set conditions, but the image information acquired by the web crawler is massive and many image information are not required. Assuming that the related picture data with the category of A is obtained through the web crawler, a large amount of non-A picture data is often obtained from a crawling result, so that the massive picture data obtained through the crawler network is initially classified through a coarse-granularity two-classifier, the non-A picture data is removed, and the A picture data is obtained. For example, the relevant dish pictures of the tomato-fried eggs are obtained through the web crawler, the dish pictures of the tomato-fried eggs which are not obtained are often crawled, and the dish pictures of the tomato-fried eggs are obtained through the coarse-grained secondary classifier. According to the technical scheme, the image data of the massive image is initially classified through the coarse-grained secondary classifier, and accurate sample image data are provided for the training of the subsequent fine-grained classifier.

In a specific embodiment of the present invention, the step S1 further includes that the cluster analysis algorithm is a K-means algorithm. The K-means cluster analysis algorithm is an indirect clustering algorithm based on similarity between samples, and belongs to an unsupervised learning method. The algorithm takes k as a parameter and divides n objects into k clusters, so that the clusters have higher similarity and the similarity between the clusters is lower. The similarity is calculated based on the average of the objects in a cluster (seen as the center of gravity of the cluster). The algorithm randomly selects k objects for the first time, each object representing the centroid of a cluster. For each of the other objects, the object is assigned to the cluster that is most similar to the object based on the distance between the object and the respective cluster centroid. Then, a new centroid for each cluster is calculated. The above process is repeated until the criterion function converges. The clustering analysis algorithm is a more typical dynamic clustering algorithm which modifies iteration point by point, and the essential point is that the sum of squares of errors is taken as a rule function.

The invention just utilizes a K-means cluster analysis algorithm to cluster the test picture set according to the cluster analysis algorithm to obtain a positive sample test picture set and a negative sample test picture set. And manually selecting k typical pictures from the test picture set, generating k classes for the test picture set, taking out the picture closest to each typical picture to form a positive sample test picture set, and taking out the picture farthest from each typical picture to form a negative sample set.

Specifically, the test picture set is used as a data object, and the picture data set is divided into k types. Selecting k typical pictures from the test picture set as an initial clustering center a of each class₁、a₂，...a_k. And acquiring the mean value of each group of data objects, namely acquiring the central value of the clustering object, calculating the distance between each group of data objects and the central value, and re-dividing the corresponding data objects according to the minimum distance. k is typically chosen to be a single digit. A typical picture refers to a picture that best fits the picture category. That is, each picture in the test picture set is calculated and its category center a is calculated separately₁、a₂，...a_kAccording to the minimum difference value, the corresponding data object is divided again, and an initial clustering result is formed, thus finishing the processOne iteration is formed. And if the central value of the data objects after being divided again changes, the average value of the data objects with the changed central value is counted again. And repeating the iteration process, changing the central value until the calculated new central value is equal to the original central value, representing function convergence, and finishing the algorithm. Suppose and_jand marking the pictures as the j class with the nearest distance, namely classifying the pictures into the nearest j class. Recalculating a for all picture data marked as class j_j，a_jRepeating the above steps for the average value of each feature of all the picture data marked as j, and calculating the distance between each picture object and the central value until a_jNo further changes were made, resulting in the center of each class. And calculating the distance between each picture and the center of each class, taking out the picture with the closest distance to form a positive sample test picture set, and taking out the picture with the farthest distance to form a negative sample set. The number of the positive sample test picture sets is consistent with that of the negative sample test picture sets.

And performing class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the class prediction of each piece of picture data to be cleaned. In an embodiment of the present invention, the step S3 further includes: and training the data to be cleaned according to a preset coarse-grained secondary classifier to obtain initial data to be cleaned. According to the technical scheme, the image data of the massive image is initially classified through the coarse-grained secondary classifier, and accurate sample image data are provided for the training of the subsequent fine-grained classifier.

Specifically, the obtained image data to be cleaned is screened and classified by a fine-grained two-classifier obtained through training, for example, 10 ten thousand pieces of image data to be cleaned are crawled from a network, the 10 ten thousand pieces of image data to be cleaned are input into the fine-grained two-classifier, and each piece of image data to be cleaned is subjected to class prediction by the fine-grained two-classifier. In a preferred embodiment of the present invention, on the basis of the above embodiment, the data to be cleaned is trained by using a preset coarse-grained second classifier, so as to obtain the initial data to be cleaned. Inputting the initial data to be cleaned into the fine-grained two-classifier, and performing class prediction on each image data to be cleaned through the fine-grained two-classifier. After the fine-grained second classifier is obtained through training, all the picture data to be cleaned are input into the fine-grained second classifier, category prediction is carried out on each picture data to be cleaned through the fine-grained second classifier, confidence corresponding to the prediction categories is obtained, each confidence represents the probability of the prediction category of the picture data to be cleaned, the higher the confidence is, the higher the possibility that the picture data to be cleaned is in accordance with the prediction categories is, namely, the more similar the picture to be cleaned is to the correct category. According to the method, the confidence coefficient of the category prediction is calculated on the acquired image data to be cleaned in real time, so that the image data can be cleaned according to the confidence coefficient, and more accurate image data can be obtained.

And cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. Specifically, a confidence interval is set; and classifying the image data to be cleaned into a corresponding confidence set according to the confidence of the class prediction of each image data to be cleaned and the confidence interval. And acquiring picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as positive sample picture data. According to the method, each image data to be cleaned is classified into the corresponding confidence level set according to the confidence level, so that the image data in each confidence level set can be conveniently screened subsequently to obtain the required sample image data to be combined. The confidence interval may adjust settings. For example, a high confidence level of 0.99 is set, and a picture with a confidence level higher than 0.99 in the picture data to be cleaned is taken as positive sample picture data. Pictures below 0.99 are used as negative sample picture data for manual cleaning. The set confidence is high, the obtained positive sample picture data are less, and the negative sample picture data are less. When the picture to be cleaned cannot be reproduced into a picture with a certain proportion of high confidence intervals or more, the preset high confidence interval is reduced, for example, the confidence coefficient is set to be 0.95, and the positive sample picture data and the negative sample picture data are obtained in the same way, and so on. The image data of which the preset confidence coefficient reaches the preset confidence coefficient interval are the images similar to the real categories, and the image data are the images required by the user.

As shown in fig. 2, a sample data washing system, the system comprising:

the clustering module 20 is configured to provide a test picture set, cluster the test picture set according to a clustering analysis algorithm, and obtain a positive sample test picture set and a negative sample test picture set;

the training module 21 is configured to train to obtain a fine-grained second classifier according to the positive sample test picture set and the negative sample test picture set;

the classification module 22 is configured to perform class prediction on the image data to be cleaned according to the fine-grained second classifier, and obtain a confidence of the class prediction of each image data to be cleaned;

and the cleaning module 23 is configured to clean the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of the to-be-cleaned picture data.

The clustering module provides a test picture set, and clusters the test picture set according to a clustering analysis algorithm to obtain a positive sample test picture set and a negative sample test picture set. Specifically, the clustering module further comprises an obtaining unit and a coarse-grained secondary classifier unit. And the acquisition unit is used for acquiring the initial test picture set by using the web crawler. And the coarse-grained secondary classifier unit trains the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set. And initially classifying the massive picture data through the coarse-grained secondary classifier, and providing accurate sample picture data for the training of a subsequent fine-grained classifier. The clustering analysis algorithm is a K-means algorithm. And clustering the test picture set according to a clustering analysis K-means algorithm to obtain a positive sample test picture set and a negative sample test picture set. And manually selecting k typical pictures from the test picture set, generating k classes for the test picture set, taking out the picture closest to each typical picture to form a positive sample test picture set, and taking out the picture farthest from each typical picture to form a negative sample set.

And training by the training module according to the positive sample test picture set and the negative sample test picture set to obtain a fine-grained two-classifier.

The classification module carries out class prediction on the picture data to be cleaned according to the fine-grained second classifier, obtains confidence coefficient of the class prediction of each picture data to be cleaned, trains to obtain the fine-grained second classifier, then inputs all the picture data to be cleaned into the fine-grained second classifier, carries out class prediction on each picture data to be cleaned through the fine-grained second classifier, and obtains confidence coefficient corresponding to the prediction class, each confidence coefficient represents probability of the prediction class of the picture data to be cleaned, the higher the confidence coefficient is, the higher the possibility that the picture data to be cleaned is in accordance with the prediction class is, namely, the more similar the picture to be cleaned is to the correct class.

And the cleaning module cleans the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. The cleaning module comprises a setting unit, a statistical unit and a positive sample unit. The setting unit sets a confidence interval. And the statistical unit classifies the image data to be cleaned into a corresponding confidence coefficient set according to the confidence coefficient of the class prediction of each image data to be cleaned and the confidence coefficient interval. And the positive sample unit is used for acquiring the picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as the positive sample picture data. The image data of which the preset confidence coefficient reaches the preset confidence coefficient interval are the images similar to the real categories, and the image data are the images required by the user.

By the technical scheme, the manual workload can be greatly reduced, errors in data screening caused by subjectivity of manual screening are reduced, and the robustness of the neural network model is improved.

While the invention has been described in detail in the foregoing with reference to the drawings and examples, such illustration and description are to be considered illustrative or exemplary and not restrictive. The invention is not limited to the disclosed embodiments. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" or "a particular plurality" should be understood to mean at least one or at least a particular plurality. Any reference signs in the claims shall not be construed as limiting the scope. Other variations to the above-described embodiments can be understood and effected by those skilled in the art without inventive faculty, from a study of the drawings, the description and the appended claims, which will still fall within the scope of the invention as claimed.

Claims

1. A sample data cleaning method, characterized in that the method comprises:

2. The sample data cleaning method according to claim 1, wherein the step of acquiring the test image set comprises:

acquiring an initial test picture set by using a web crawler;

and training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set.

3. The sample data cleaning method according to claim 1, wherein said step S1 further includes: the clustering analysis algorithm is a K-means algorithm.

4. The sample data cleaning method according to claim 3, wherein said step S1 further includes:

s10, dividing the picture data set into k types, and selecting k typical pictures from the test picture set as initial clustering centers of each type;

s11, calculating the distance between each picture in the test picture set and the initial clustering center of each type, forming an initial clustering center value according to the minimum distance, and finishing one iteration;

s12, repeating the step S11 iterative process until the calculated clustering center value is equal to the original center value, and obtaining the clustering center of each type;

s13, calculating the distance between each picture and the clustering center of each type, forming a positive sample test picture set by the picture with the closest distance, and forming a negative sample set by the picture with the farthest distance, wherein the number of the positive sample test picture sets is consistent with that of the negative sample test picture sets.

5. The sample data cleaning method according to claim 1, wherein said step S3 further includes: and training the data to be cleaned according to a preset coarse-grained secondary classifier to obtain initial data to be cleaned.

6. The method for cleaning sample data according to claim 1, wherein said step S4 includes:

setting a confidence interval;

and classifying the image data to be cleaned into a corresponding confidence set according to the confidence of the class prediction of each image data to be cleaned and the confidence interval.

7. The sample data cleaning method according to claim 6, wherein said step S4 further includes:

and acquiring picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as positive sample picture data.

8. A sample data cleaning system, the system comprising:

the clustering module is used for providing a test picture set, clustering the test picture set according to a clustering analysis algorithm and acquiring a positive sample test picture set and a negative sample test picture set;

the training module is used for training to obtain a fine-grained second classifier according to the positive sample test picture set and the negative sample test picture set;

the classification module is used for carrying out class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and obtaining the confidence coefficient of the class prediction of each piece of picture data to be cleaned;

and the cleaning module is used for cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned.

9. The sample data cleaning system of claim 8, wherein the clustering module further comprises:

the acquisition unit is used for acquiring an initial test picture set by using a web crawler;

and the coarse-grained secondary classifier unit is used for training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set.

10. The sample data cleaning system of claim 8, wherein the cleaning module comprises:

the setting unit is used for setting a confidence interval;

the statistical unit is used for classifying the image data to be cleaned into a corresponding confidence coefficient set according to the confidence coefficient of the class prediction of each image data to be cleaned and the confidence coefficient interval;

and the positive sample unit is used for acquiring the picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as the positive sample picture data.