CN111652257A - Sample data cleaning method and system - Google Patents

Sample data cleaning method and system Download PDF

Info

Publication number
CN111652257A
CN111652257A CN201910239563.9A CN201910239563A CN111652257A CN 111652257 A CN111652257 A CN 111652257A CN 201910239563 A CN201910239563 A CN 201910239563A CN 111652257 A CN111652257 A CN 111652257A
Authority
CN
China
Prior art keywords
data
cleaned
picture
test picture
picture set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910239563.9A
Other languages
Chinese (zh)
Inventor
熊杰成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Re Sr Information Technology Co ltd
Original Assignee
Shanghai Re Sr Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Re Sr Information Technology Co ltd filed Critical Shanghai Re Sr Information Technology Co ltd
Priority to CN201910239563.9A priority Critical patent/CN111652257A/en
Publication of CN111652257A publication Critical patent/CN111652257A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes

Abstract

The invention relates to the field of data processing, and discloses a sample data cleaning method, which comprises the following steps: providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set; training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set; performing class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the class prediction of each piece of picture data to be cleaned; and cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. The invention also discloses a sample data cleaning system. According to the technical scheme provided by the invention, the data can be automatically cleaned, and the accuracy of the original image data is improved.

Description

Sample data cleaning method and system
Technical Field
The invention relates to the field of data processing, in particular to a sample data cleaning method and system.
Background
With the breakthrough progress of the deep learning technology in the image recognition field, the neural network has become the mainstream application algorithm in the image recognition field at present. The neural network model algorithm has the advantages that any manually marked features are not needed in the process of training the model, the features hidden by the input variables can be automatically explored, and meanwhile, the weight sharing characteristics of the network greatly reduce the complexity of the model and reduce the number of weights. The advantages are particularly obvious when the input of the network is an image, and the original image can be directly used as the input of the network, so that the complex characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided.
However, the neural network is a supervised learning algorithm, and a good identification accuracy rate can be obtained only by training massive picture data with accurate labels. In order to acquire a large amount of image sample data required by training a neural network model, the most convenient mode is to acquire the image sample data through a network, a web crawler method is adopted, the web crawler can capture information meeting the conditions from massive information of the internet according to the set conditions, and then manual screening and cleaning are carried out, so that the problems that the workload is extremely large, the subjectivity of screening results is high, the screening results are prone to errors, and meanwhile, the training is carried out by stacking the neural network through wrong image sample data, and wrong classification results can be brought. Therefore, the cleaning of massive picture data becomes a bottleneck problem restricting the development of neural network technology. In the process of data cleaning, correct data and wrong data which are determined to be in a large probability in data to be cleaned are firstly picked out, data which are difficult to confirm are screened in the middle, and then positive samples and negative samples are picked out.
Therefore, the invention provides a method for automatically cleaning sample data, which reduces the manual cleaning cost and improves the accuracy of the original image data.
Disclosure of Invention
The invention aims to provide a sample data cleaning method and system, which realize automatic data cleaning and improve the accuracy of original image data.
In order to achieve the above object, the present invention provides a method for cleaning various sample data, the method comprising: providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set; training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set; performing class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the class prediction of each piece of picture data to be cleaned; and cleaning the sample data according to a preset confidence coefficient threshold interval and the confidence coefficient of the class prediction of each piece of picture data to be cleaned. The method greatly reduces the manual workload, reduces the data screening errors caused by the subjectivity of manual screening, and improves the robustness of the neural network model.
Optionally, the step of obtaining the test picture set includes: acquiring an initial test picture set by using a web crawler; and training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set. And initially classifying the massive picture data through the coarse-grained secondary classifier, and providing accurate sample picture data for the training of a subsequent fine-grained classifier.
Optionally, the step S1 further includes: the clustering analysis algorithm is a K-means algorithm. The step S1 further includes: dividing the picture data set into k classes, and selecting k typical pictures from the test picture set as initial clustering centers of each class; calculating the distance between each picture in the test picture set and the initial clustering center of each type, forming an initial clustering center value according to the minimum distance, and finishing one iteration; repeating the step S11 iterative process until the calculated clustering center value is equal to the original center value, and obtaining the clustering center of each type; and calculating the distance between each picture and the clustering center of each type, forming a positive sample test picture set by the picture with the closest distance, and forming a negative sample set by the picture with the farthest distance. The number of the positive sample test picture sets is consistent with that of the negative sample test picture sets. And acquiring a positive sample test picture set and a negative sample test picture set according to the clustering function of the clustering algorithm, and providing a training set for a subsequent fine-grained two-classifier.
Optionally, the data to be cleaned is trained according to a preset coarse-grained secondary classifier, so as to obtain initial data to be cleaned. And initially classifying the massive picture data through the coarse-grained secondary classifier, and providing accurate sample picture data for the training of a subsequent fine-grained classifier.
Optionally, the step S4 includes: setting a confidence interval; and classifying the image data to be cleaned into a corresponding confidence set according to the confidence of the class prediction of each image data to be cleaned and the confidence interval. And acquiring picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as positive sample picture data. The confidence coefficient of category prediction is calculated for the acquired image data to be cleaned in real time, so that the image data can be cleaned according to the confidence coefficient, and more accurate image data can be obtained.
The invention provides a sample data cleaning system, which comprises: the clustering module is used for providing a test picture set, clustering the test picture set according to a clustering analysis algorithm and acquiring a positive sample test picture set and a negative sample test picture set; the training module is used for training to obtain a fine-grained second classifier according to the positive sample test picture set and the negative sample test picture set; the classification module is used for carrying out class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and obtaining the confidence coefficient of the class prediction of each piece of picture data to be cleaned; and the cleaning module is used for cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. Greatly reduces the manual workload, reduces the data screening errors caused by the subjectivity of manual screening, and improves the robustness of the neural network model
Optionally, the clustering module further includes: the acquisition unit is used for acquiring an initial test picture set by using a web crawler; and the coarse-grained secondary classifier unit is used for training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set.
Optionally, the cleaning module includes: the setting unit is used for setting a confidence interval; the statistical unit is used for classifying the image data to be cleaned into a corresponding confidence coefficient set according to the confidence coefficient of the class prediction of each image data to be cleaned and the confidence coefficient interval; and the positive sample unit is used for acquiring the picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as the positive sample picture data.
Compared with the prior art, the data cleaning method and the data cleaning system provided by the invention have the beneficial effects that: according to the clustering algorithm, a positive sample test picture set and a negative sample test picture set are generated, and are classified by fine-grained secondary classification, so that the cleaning of mass data is completed, the workload of manual cleaning is greatly reduced, the automatic cleaning work is realized, the data screening errors caused by the subjectivity of manual screening are reduced, and the robustness of the neural network is improved.
Drawings
FIG. 1 is a schematic flow chart of a sample data cleaning method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a sample data cleaning system according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments shown in the drawings. In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. The thickness of the components may be exaggerated where appropriate in the figures to improve clarity.
As shown in fig. 1, an embodiment of the present invention provides a sample data cleaning method, including:
s1, providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set;
s2, training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set;
s3, performing category prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the category prediction of each piece of picture data to be cleaned;
and S4, cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned.
In the prior art, in order to improve the performance of the neural network model, a manual screening mode is generally adopted for data screening, the manual screening is large in workload, and the acquired data is judged and classified by a user by the similarity of the data due to the manual subjectivity, so that the performance of the neural network model is influenced by the possibly wrong data of the neural network model. According to the method, a positive sample test picture set and a negative sample test picture set are generated according to a clustering algorithm, the positive sample test picture set and the negative sample test picture set are trained to obtain a fine-grained two-classifier, and the picture data to be cleaned are classified according to the fine-grained two-classifier, so that the manual workload is greatly reduced, the data screening errors caused by the subjectivity of manual screening are reduced, and the robustness of a neural network model is improved.
In a specific embodiment of the present invention, the step of obtaining the test picture set includes: the method comprises the steps of obtaining an initial test picture set by using a web crawler, training the initial test picture set according to a preset coarse-grained secondary classifier, and obtaining the test picture set. In order to acquire a large amount of image sample data required by training a neural network model, the most convenient mode is a method acquired by a web crawler, the web crawler can capture information meeting the conditions from massive information of the internet according to the set conditions, but the image information acquired by the web crawler is massive and many image information are not required. Assuming that the related picture data with the category of A is obtained through the web crawler, a large amount of non-A picture data is often obtained from a crawling result, so that the massive picture data obtained through the crawler network is initially classified through a coarse-granularity two-classifier, the non-A picture data is removed, and the A picture data is obtained. For example, the relevant dish pictures of the tomato-fried eggs are obtained through the web crawler, the dish pictures of the tomato-fried eggs which are not obtained are often crawled, and the dish pictures of the tomato-fried eggs are obtained through the coarse-grained secondary classifier. According to the technical scheme, the image data of the massive image is initially classified through the coarse-grained secondary classifier, and accurate sample image data are provided for the training of the subsequent fine-grained classifier.
In a specific embodiment of the present invention, the step S1 further includes that the cluster analysis algorithm is a K-means algorithm. The K-means cluster analysis algorithm is an indirect clustering algorithm based on similarity between samples, and belongs to an unsupervised learning method. The algorithm takes k as a parameter and divides n objects into k clusters, so that the clusters have higher similarity and the similarity between the clusters is lower. The similarity is calculated based on the average of the objects in a cluster (seen as the center of gravity of the cluster). The algorithm randomly selects k objects for the first time, each object representing the centroid of a cluster. For each of the other objects, the object is assigned to the cluster that is most similar to the object based on the distance between the object and the respective cluster centroid. Then, a new centroid for each cluster is calculated. The above process is repeated until the criterion function converges. The clustering analysis algorithm is a more typical dynamic clustering algorithm which modifies iteration point by point, and the essential point is that the sum of squares of errors is taken as a rule function.
The invention just utilizes a K-means cluster analysis algorithm to cluster the test picture set according to the cluster analysis algorithm to obtain a positive sample test picture set and a negative sample test picture set. And manually selecting k typical pictures from the test picture set, generating k classes for the test picture set, taking out the picture closest to each typical picture to form a positive sample test picture set, and taking out the picture farthest from each typical picture to form a negative sample set.
Specifically, the test picture set is used as a data object, and the picture data set is divided into k types. Selecting k typical pictures from the test picture set as an initial clustering center a of each class1、a2,...ak. And acquiring the mean value of each group of data objects, namely acquiring the central value of the clustering object, calculating the distance between each group of data objects and the central value, and re-dividing the corresponding data objects according to the minimum distance. k is typically chosen to be a single digit. A typical picture refers to a picture that best fits the picture category. That is, each picture in the test picture set is calculated and its category center a is calculated separately1、a2,...akAccording to the minimum difference value, the corresponding data object is divided again, and an initial clustering result is formed, thus finishing the processOne iteration is formed. And if the central value of the data objects after being divided again changes, the average value of the data objects with the changed central value is counted again. And repeating the iteration process, changing the central value until the calculated new central value is equal to the original central value, representing function convergence, and finishing the algorithm. Suppose andjand marking the pictures as the j class with the nearest distance, namely classifying the pictures into the nearest j class. Recalculating a for all picture data marked as class jj,ajRepeating the above steps for the average value of each feature of all the picture data marked as j, and calculating the distance between each picture object and the central value until ajNo further changes were made, resulting in the center of each class. And calculating the distance between each picture and the center of each class, taking out the picture with the closest distance to form a positive sample test picture set, and taking out the picture with the farthest distance to form a negative sample set. The number of the positive sample test picture sets is consistent with that of the negative sample test picture sets.
And performing class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the class prediction of each piece of picture data to be cleaned. In an embodiment of the present invention, the step S3 further includes: and training the data to be cleaned according to a preset coarse-grained secondary classifier to obtain initial data to be cleaned. According to the technical scheme, the image data of the massive image is initially classified through the coarse-grained secondary classifier, and accurate sample image data are provided for the training of the subsequent fine-grained classifier.
Specifically, the obtained image data to be cleaned is screened and classified by a fine-grained two-classifier obtained through training, for example, 10 ten thousand pieces of image data to be cleaned are crawled from a network, the 10 ten thousand pieces of image data to be cleaned are input into the fine-grained two-classifier, and each piece of image data to be cleaned is subjected to class prediction by the fine-grained two-classifier. In a preferred embodiment of the present invention, on the basis of the above embodiment, the data to be cleaned is trained by using a preset coarse-grained second classifier, so as to obtain the initial data to be cleaned. Inputting the initial data to be cleaned into the fine-grained two-classifier, and performing class prediction on each image data to be cleaned through the fine-grained two-classifier. After the fine-grained second classifier is obtained through training, all the picture data to be cleaned are input into the fine-grained second classifier, category prediction is carried out on each picture data to be cleaned through the fine-grained second classifier, confidence corresponding to the prediction categories is obtained, each confidence represents the probability of the prediction category of the picture data to be cleaned, the higher the confidence is, the higher the possibility that the picture data to be cleaned is in accordance with the prediction categories is, namely, the more similar the picture to be cleaned is to the correct category. According to the method, the confidence coefficient of the category prediction is calculated on the acquired image data to be cleaned in real time, so that the image data can be cleaned according to the confidence coefficient, and more accurate image data can be obtained.
And cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. Specifically, a confidence interval is set; and classifying the image data to be cleaned into a corresponding confidence set according to the confidence of the class prediction of each image data to be cleaned and the confidence interval. And acquiring picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as positive sample picture data. According to the method, each image data to be cleaned is classified into the corresponding confidence level set according to the confidence level, so that the image data in each confidence level set can be conveniently screened subsequently to obtain the required sample image data to be combined. The confidence interval may adjust settings. For example, a high confidence level of 0.99 is set, and a picture with a confidence level higher than 0.99 in the picture data to be cleaned is taken as positive sample picture data. Pictures below 0.99 are used as negative sample picture data for manual cleaning. The set confidence is high, the obtained positive sample picture data are less, and the negative sample picture data are less. When the picture to be cleaned cannot be reproduced into a picture with a certain proportion of high confidence intervals or more, the preset high confidence interval is reduced, for example, the confidence coefficient is set to be 0.95, and the positive sample picture data and the negative sample picture data are obtained in the same way, and so on. The image data of which the preset confidence coefficient reaches the preset confidence coefficient interval are the images similar to the real categories, and the image data are the images required by the user.
As shown in fig. 2, a sample data washing system, the system comprising:
the clustering module 20 is configured to provide a test picture set, cluster the test picture set according to a clustering analysis algorithm, and obtain a positive sample test picture set and a negative sample test picture set;
the training module 21 is configured to train to obtain a fine-grained second classifier according to the positive sample test picture set and the negative sample test picture set;
the classification module 22 is configured to perform class prediction on the image data to be cleaned according to the fine-grained second classifier, and obtain a confidence of the class prediction of each image data to be cleaned;
and the cleaning module 23 is configured to clean the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of the to-be-cleaned picture data.
The clustering module provides a test picture set, and clusters the test picture set according to a clustering analysis algorithm to obtain a positive sample test picture set and a negative sample test picture set. Specifically, the clustering module further comprises an obtaining unit and a coarse-grained secondary classifier unit. And the acquisition unit is used for acquiring the initial test picture set by using the web crawler. And the coarse-grained secondary classifier unit trains the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set. And initially classifying the massive picture data through the coarse-grained secondary classifier, and providing accurate sample picture data for the training of a subsequent fine-grained classifier. The clustering analysis algorithm is a K-means algorithm. And clustering the test picture set according to a clustering analysis K-means algorithm to obtain a positive sample test picture set and a negative sample test picture set. And manually selecting k typical pictures from the test picture set, generating k classes for the test picture set, taking out the picture closest to each typical picture to form a positive sample test picture set, and taking out the picture farthest from each typical picture to form a negative sample set.
And training by the training module according to the positive sample test picture set and the negative sample test picture set to obtain a fine-grained two-classifier.
The classification module carries out class prediction on the picture data to be cleaned according to the fine-grained second classifier, obtains confidence coefficient of the class prediction of each picture data to be cleaned, trains to obtain the fine-grained second classifier, then inputs all the picture data to be cleaned into the fine-grained second classifier, carries out class prediction on each picture data to be cleaned through the fine-grained second classifier, and obtains confidence coefficient corresponding to the prediction class, each confidence coefficient represents probability of the prediction class of the picture data to be cleaned, the higher the confidence coefficient is, the higher the possibility that the picture data to be cleaned is in accordance with the prediction class is, namely, the more similar the picture to be cleaned is to the correct class.
And the cleaning module cleans the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned. The cleaning module comprises a setting unit, a statistical unit and a positive sample unit. The setting unit sets a confidence interval. And the statistical unit classifies the image data to be cleaned into a corresponding confidence coefficient set according to the confidence coefficient of the class prediction of each image data to be cleaned and the confidence coefficient interval. And the positive sample unit is used for acquiring the picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as the positive sample picture data. The image data of which the preset confidence coefficient reaches the preset confidence coefficient interval are the images similar to the real categories, and the image data are the images required by the user.
By the technical scheme, the manual workload can be greatly reduced, errors in data screening caused by subjectivity of manual screening are reduced, and the robustness of the neural network model is improved.
While the invention has been described in detail in the foregoing with reference to the drawings and examples, such illustration and description are to be considered illustrative or exemplary and not restrictive. The invention is not limited to the disclosed embodiments. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" or "a particular plurality" should be understood to mean at least one or at least a particular plurality. Any reference signs in the claims shall not be construed as limiting the scope. Other variations to the above-described embodiments can be understood and effected by those skilled in the art without inventive faculty, from a study of the drawings, the description and the appended claims, which will still fall within the scope of the invention as claimed.

Claims (10)

1. A sample data cleaning method, characterized in that the method comprises:
s1, providing a test picture set, clustering the test picture set according to a clustering analysis algorithm, and acquiring a positive sample test picture set and a negative sample test picture set;
s2, training to obtain a fine-grained secondary classifier according to the positive sample test picture set and the negative sample test picture set;
s3, performing category prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and acquiring the confidence coefficient of the category prediction of each piece of picture data to be cleaned;
and S4, cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned.
2. The sample data cleaning method according to claim 1, wherein the step of acquiring the test image set comprises:
acquiring an initial test picture set by using a web crawler;
and training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set.
3. The sample data cleaning method according to claim 1, wherein said step S1 further includes: the clustering analysis algorithm is a K-means algorithm.
4. The sample data cleaning method according to claim 3, wherein said step S1 further includes:
s10, dividing the picture data set into k types, and selecting k typical pictures from the test picture set as initial clustering centers of each type;
s11, calculating the distance between each picture in the test picture set and the initial clustering center of each type, forming an initial clustering center value according to the minimum distance, and finishing one iteration;
s12, repeating the step S11 iterative process until the calculated clustering center value is equal to the original center value, and obtaining the clustering center of each type;
s13, calculating the distance between each picture and the clustering center of each type, forming a positive sample test picture set by the picture with the closest distance, and forming a negative sample set by the picture with the farthest distance, wherein the number of the positive sample test picture sets is consistent with that of the negative sample test picture sets.
5. The sample data cleaning method according to claim 1, wherein said step S3 further includes: and training the data to be cleaned according to a preset coarse-grained secondary classifier to obtain initial data to be cleaned.
6. The method for cleaning sample data according to claim 1, wherein said step S4 includes:
setting a confidence interval;
and classifying the image data to be cleaned into a corresponding confidence set according to the confidence of the class prediction of each image data to be cleaned and the confidence interval.
7. The sample data cleaning method according to claim 6, wherein said step S4 further includes:
and acquiring picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as positive sample picture data.
8. A sample data cleaning system, the system comprising:
the clustering module is used for providing a test picture set, clustering the test picture set according to a clustering analysis algorithm and acquiring a positive sample test picture set and a negative sample test picture set;
the training module is used for training to obtain a fine-grained second classifier according to the positive sample test picture set and the negative sample test picture set;
the classification module is used for carrying out class prediction on the picture data to be cleaned according to the fine-grained secondary classifier, and obtaining the confidence coefficient of the class prediction of each piece of picture data to be cleaned;
and the cleaning module is used for cleaning the sample data according to a preset confidence interval and the confidence of the class prediction of each piece of picture data to be cleaned.
9. The sample data cleaning system of claim 8, wherein the clustering module further comprises:
the acquisition unit is used for acquiring an initial test picture set by using a web crawler;
and the coarse-grained secondary classifier unit is used for training the initial test picture set according to a preset coarse-grained secondary classifier to obtain the test picture set.
10. The sample data cleaning system of claim 8, wherein the cleaning module comprises:
the setting unit is used for setting a confidence interval;
the statistical unit is used for classifying the image data to be cleaned into a corresponding confidence coefficient set according to the confidence coefficient of the class prediction of each image data to be cleaned and the confidence coefficient interval;
and the positive sample unit is used for acquiring the picture data in the confidence coefficient set with the confidence coefficient level reaching a preset level as the positive sample picture data.
CN201910239563.9A 2019-03-27 2019-03-27 Sample data cleaning method and system Pending CN111652257A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910239563.9A CN111652257A (en) 2019-03-27 2019-03-27 Sample data cleaning method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910239563.9A CN111652257A (en) 2019-03-27 2019-03-27 Sample data cleaning method and system

Publications (1)

Publication Number Publication Date
CN111652257A true CN111652257A (en) 2020-09-11

Family

ID=72344378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910239563.9A Pending CN111652257A (en) 2019-03-27 2019-03-27 Sample data cleaning method and system

Country Status (1)

Country Link
CN (1) CN111652257A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348107A (en) * 2020-11-17 2021-02-09 百度(中国)有限公司 Image data cleaning method and apparatus, electronic device, and medium
CN112906488A (en) * 2021-01-26 2021-06-04 广东电网有限责任公司 Security protection video quality evaluation system based on artificial intelligence
CN113158889A (en) * 2021-04-15 2021-07-23 上海芯翌智能科技有限公司 Data cleaning and training method and device, computer readable storage medium and terminal
CN113822130A (en) * 2021-07-05 2021-12-21 腾讯科技(深圳)有限公司 Model training method, scene recognition method, computing device, and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239896A (en) * 2014-09-04 2014-12-24 四川省绵阳西南自动化研究所 Method for classifying crowd density degrees in video image
CN106096561A (en) * 2016-06-16 2016-11-09 重庆邮电大学 Infrared pedestrian detection method based on image block degree of depth learning characteristic
CN108874900A (en) * 2018-05-24 2018-11-23 四川斐讯信息技术有限公司 A kind of acquisition methods and system of samples pictures data acquisition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239896A (en) * 2014-09-04 2014-12-24 四川省绵阳西南自动化研究所 Method for classifying crowd density degrees in video image
CN106096561A (en) * 2016-06-16 2016-11-09 重庆邮电大学 Infrared pedestrian detection method based on image block degree of depth learning characteristic
CN108874900A (en) * 2018-05-24 2018-11-23 四川斐讯信息技术有限公司 A kind of acquisition methods and system of samples pictures data acquisition system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348107A (en) * 2020-11-17 2021-02-09 百度(中国)有限公司 Image data cleaning method and apparatus, electronic device, and medium
CN112906488A (en) * 2021-01-26 2021-06-04 广东电网有限责任公司 Security protection video quality evaluation system based on artificial intelligence
CN113158889A (en) * 2021-04-15 2021-07-23 上海芯翌智能科技有限公司 Data cleaning and training method and device, computer readable storage medium and terminal
CN113822130A (en) * 2021-07-05 2021-12-21 腾讯科技(深圳)有限公司 Model training method, scene recognition method, computing device, and medium

Similar Documents

Publication Publication Date Title
CN111652257A (en) Sample data cleaning method and system
CN110334765B (en) Remote sensing image classification method based on attention mechanism multi-scale deep learning
CN110210486B (en) Sketch annotation information-based generation countermeasure transfer learning method
CN103559504B (en) Image target category identification method and device
CN107633255A (en) A kind of rock lithology automatic recognition classification method under deep learning pattern
CN110929679B (en) GAN-based unsupervised self-adaptive pedestrian re-identification method
CN109919252B (en) Method for generating classifier by using few labeled images
WO2022062419A1 (en) Target re-identification method and system based on non-supervised pyramid similarity learning
CN111914902B (en) Traditional Chinese medicine identification and surface defect detection method based on deep neural network
CN109598307B (en) Data screening method and device, server and storage medium
CN111488911B (en) Image entity extraction method based on Mask R-CNN and GAN
CN110110845B (en) Learning method based on parallel multi-level width neural network
CN107392251B (en) Method for improving target detection network performance by using classified pictures
CN110688888B (en) Pedestrian attribute identification method and system based on deep learning
CN113221956B (en) Target identification method and device based on improved multi-scale depth model
CN112668698A (en) Neural network training method and system
CN112766218A (en) Cross-domain pedestrian re-identification method and device based on asymmetric joint teaching network
CN111371611B (en) Weighted network community discovery method and device based on deep learning
CN111652259B (en) Method and system for cleaning data
CN109829887B (en) Image quality evaluation method based on deep neural network
CN111652264A (en) Negative migration sample screening method based on maximum mean difference
CN116665039A (en) Small sample target identification method based on two-stage causal intervention
CN111160077A (en) Large-scale dynamic face clustering method
CN115063630A (en) Application of decoupling migration-based federated learning method in computer vision
CN110427973B (en) Classification method for ambiguity-oriented annotation samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination