CN117636042A - Data rechecking method, device and computer readable storage medium - Google Patents

Data rechecking method, device and computer readable storage medium Download PDF

Info

Publication number
CN117636042A
CN117636042A CN202311656863.XA CN202311656863A CN117636042A CN 117636042 A CN117636042 A CN 117636042A CN 202311656863 A CN202311656863 A CN 202311656863A CN 117636042 A CN117636042 A CN 117636042A
Authority
CN
China
Prior art keywords
data
subset
convex hull
images
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311656863.XA
Other languages
Chinese (zh)
Inventor
陈志强
张丽
李强
孙运达
傅罡
陈祥凤
张进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Nuctech Co Ltd
Original Assignee
Tsinghua University
Nuctech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Nuctech Co Ltd filed Critical Tsinghua University
Priority to CN202311656863.XA priority Critical patent/CN117636042A/en
Publication of CN117636042A publication Critical patent/CN117636042A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The present disclosure relates to a data review method, apparatus, and computer-readable storage medium, and relates to the field of data processing. The data rechecking method comprises the following steps: obtaining a data set to be rechecked, wherein the data set to be rechecked comprises: a plurality of images and labeling information thereof; dividing a data set to be rechecked into a plurality of subsets according to the characteristic data of each image; determining convex hulls of each subset and data on the convex hulls according to the characteristic data of the images in each subset; and determining whether the labeling information of the plurality of images is accurate according to the data on the convex hull of each subset. The data rechecking method can effectively improve the efficiency of finding error marked data.

Description

Data rechecking method, device and computer readable storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a data review method, apparatus, and computer readable storage medium.
Background
Data annotation plays a role of a foundation stone in the field of artificial intelligence. Artificial intelligence enables various applications such as license plate recognition, face recognition, etc. through learning of annotation data. Thus, the accuracy of the annotation data directly affects the accuracy of the artificial intelligence application.
At present, the review of the marked data is mainly performed by manual examination, which is time-consuming and labor-consuming and has quite low efficiency.
Disclosure of Invention
One technical problem to be solved by the present disclosure is: how to improve the efficiency of finding error labeling data.
According to some embodiments of the present disclosure, there is provided a data review method including:
obtaining a data set to be rechecked, wherein the data set to be rechecked comprises: a plurality of images and labeling information thereof; dividing a data set to be rechecked into a plurality of subsets according to the characteristic data of each image; determining convex hulls of each subset and data on the convex hulls according to the characteristic data of the images in each subset; and determining whether the labeling information of the plurality of images is accurate according to the data on the convex hull of each subset.
In some embodiments, determining whether the annotation information for the plurality of images is accurate based on the data on the convex hull for each subset comprises: determining an anomaly of the data on the convex hull of each subset; selecting recheck data from the data on the convex hull of each subset according to the anomaly degree; and determining whether the labeling information of the plurality of images is accurate according to the rechecking data.
In some embodiments, determining whether the annotation information for the plurality of images is accurate based on the review data comprises: and under the condition that error data does not exist in the recheck data, determining the labeling information of the plurality of images is accurate.
In some embodiments, determining whether the annotation information for the plurality of images is accurate based on the review data comprises: and under the condition that error data appear in the recheck data, re-determining the subset corresponding to the error data as a data set to be rechecked, re-executing the feature data according to each image, dividing the data set to be rechecked into a plurality of subsets, determining convex hulls and data on the convex hulls of each subset according to the feature data of the images in each subset, and determining whether the labeling information of the plurality of images is accurate or not according to the data on the convex hulls of each subset until the labeling information of the plurality of images is accurate.
In some embodiments, determining the degree of anomaly of the data on the convex hull of each subset comprises: and determining the abnormality degree of the data on the convex hull of each subset according to at least one of the distance between the data on the convex hull of each subset and the center of the subset to which the data on the convex hull belongs, the confidence that the data on the convex hull belongs to the subset to which the data on the convex hull belongs and the outlier degree of the data on the convex hull.
In some embodiments, the greater the distance of the data on the convex hull of each subset from the center of the subset to which it belongs, the greater the degree of anomaly; the greater the confidence that the data on the convex hull belongs to the belonging subset, the smaller the abnormality; the greater the degree of outlier of the data on the convex hull, the greater the degree of anomaly.
In some embodiments, selecting review data from the data on the convex hull for each subset based on the anomaly degree comprises: for the data on the convex hull of each subset, ordering the data on the convex hull according to the descending order of the anomaly degree; and selecting the data of the preset quantity as the rechecking data of the subset.
In some embodiments, selecting review data from the data on the convex hull for each subset based on the anomaly degree comprises: and selecting data with the anomaly degree higher than a preset threshold value as review data of the subsets for the data on the convex hull of each subset.
In some embodiments, dividing the data set to be reviewed into a plurality of subsets according to the feature data of each image comprises: determining, for the feature data of each image and each subset, a probability that the feature data belongs to the subset by each classifier of the plurality of classifiers; averaging the probabilities of the feature data belonging to the subsets determined by each classifier to obtain the average probability of the feature data belonging to the subsets; and obtaining the subset corresponding to the characteristic data according to the average probability that the characteristic data belongs to each subset.
In some embodiments, obtaining the subset corresponding to the feature data according to the average probability that the feature data belongs to each subset includes: and determining the subset corresponding to the maximum average probability as the subset corresponding to the characteristic data.
In some embodiments, the data review method further comprises: for each image, acquiring intermediate feature data of the image by each of a plurality of feature extractors; and fusing the intermediate feature data acquired by each feature extractor to obtain feature data of the image.
According to further embodiments of the present disclosure, there is provided a data review device including: the acquisition module is configured to acquire a data set to be rechecked, wherein the data set to be rechecked comprises: a plurality of images and labeling information thereof; the dividing module is configured to divide the data set to be rechecked into a plurality of subsets according to the characteristic data of each image; a first determining module configured to determine a convex hull of each subset and data on the convex hull according to the feature data of the image in each subset; and the second determining module is configured to determine whether the labeling information of the plurality of images is accurate according to the data on the convex hull of each subset.
According to still further embodiments of the present disclosure, there is provided a data review device including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the data review method as described above.
According to still further embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the steps of the data review method as described above.
Aiming at the problem of low verification efficiency of the marking data, the method divides the data set to be verified into a plurality of subsets, determines the convex hull of each subset and the data on the convex hull, and determines the accuracy of marking information of a plurality of images in the data set to be verified according to the accuracy of the data on the convex hull of the subset. Because the data on the convex hull only occupies a small part of the data to be checked, the check workload is greatly reduced, and the efficiency of finding the error marked data is effectively improved.
Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
Fig. 1A illustrates a schematic structure diagram of a convex hull of a two-dimensional planar set in accordance with some embodiments of the present disclosure.
Fig. 1B illustrates a schematic structure diagram of a convex hull of a two-dimensional planar set in accordance with some embodiments of the present disclosure.
Fig. 2 illustrates a flow diagram of a data review method of some embodiments of the present disclosure.
Fig. 3 shows a flow diagram of a data review method of further embodiments of the present disclosure.
Fig. 4 illustrates a schematic structural diagram of a data review device of some embodiments of the present disclosure.
Fig. 5 shows a schematic structural diagram of a data review device of other embodiments of the present disclosure.
Fig. 6 shows a schematic structural diagram of a data review device of further embodiments of the present disclosure.
Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
According to the objective continuity of the pattern, the erroneous samples are more prone to appear in the outer layers of the sample feature spatial distribution. Therefore, aiming at the problem of low efficiency of review of the annotation data, the inventor researches and discovers that the quantity of the review data can be reduced by utilizing the convex hull of the data set to be reviewed, so that the efficiency of discovering the error annotation data can be effectively improved.
Convex hulls are a concept in graphics. For a set, the intersection of all convex sets that contain the set is referred to as the convex hull of the set. Fig. 1A illustrates a schematic structure diagram of a convex hull of a two-dimensional planar set in accordance with some embodiments of the present disclosure. As shown in fig. 1A, for a set on a two-dimensional plane, its convex hull can be understood as a convex polygon formed by connecting elements of the outermost layer.
Therefore, after the convex hull of the data set to be rechecked and the data on the convex hull are determined, the accuracy of the data set to be rechecked can be determined according to the accuracy of the data on the convex hull. Meanwhile, in order to further improve the efficiency and accuracy of rechecking, the data set to be rechecked can be divided into a plurality of subsets, and the accuracy of all data in each subset is determined according to the data on the convex hull of each subset. Multiple rechecks are performed by utilizing multiple subsets, and a small amount of data is rechecked each time, so that the rechecking efficiency and accuracy are ensured simultaneously.
Fig. 1B shows a schematic structure diagram of a convex hull of a two-dimensional planar set in accordance with further embodiments of the present disclosure. Fig. 1B is a convex hull of two subsets determined based on the set of fig. 1A after being divided into the two subsets. In the case where data on the convex hull of the set in fig. 1A has erroneous data, it is considered that the set needs to be reviewed. The set is then divided into a plurality of subsets (for example, two subsets in fig. 1B), the subset to which the error data belongs is redetermined as the subset to be checked, and the above operation is performed again. And regarding all the data on the convex hull as a subset of the correct data, considering all the data in the subset as the correct data, and finishing rechecking.
Fig. 2 illustrates a flow diagram of a data review method of some embodiments of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S202 to S208.
In step S202, a data set to be rechecked is acquired, where the data set to be rechecked includes: a plurality of images and annotation information thereof.
The data set to be reviewed may refer to a data set whose labeling accuracy needs to be determined in any case.
In some embodiments, the data set to be reviewed may refer to the data set after labeling and before application by the data model in the artificial intelligence field, i.e., the data review method of the present disclosure may be used to review only labeled data.
In some embodiments, the data set to be rechecked may also refer to a data set already used in the process of updating and iterating the data model, that is, by rechecking the data set by the data rechecking method disclosed by the disclosure, a more accurate data set can be obtained, so that the result of updating and iterating the data model is more accurate, and the training efficiency of the data model can be improved, thereby improving the application effect of the data model.
In some embodiments, the data review method of the present disclosure is not limited to determining whether the labeling information in the data set to be reviewed is accurate, but may also be used to find out the missing labeling data in the data set to be reviewed. Data other than the manual annotation, which is possibly missed is retrieved or detected through an algorithm, and in the data, most of the data are noise, and only a small amount of missed marks exist. Therefore, noise can be regarded as 'label', and potential missed labels can be regarded as 'wrong labels', so that the missed label data can be found. That is, the error data in the data set to be checked may refer to the data with the information error, or may refer to the data without the information.
In step S204, the data set to be reviewed is divided into a plurality of subsets according to the feature data of each image.
And acquiring the characteristic data of each image in the data set to be checked by utilizing a characteristic extractor, so that the data set to be checked is divided into a plurality of subsets according to the characteristic data. The number of the divided subsets can be set according to actual conditions, and the more the divided subsets are, the more data are checked each time, but the number of times of checking is less. For example, a division of 10 subsets may be provided.
In some embodiments, the feature extractor may be an untrained, known pre-trained model, may be a self-supervised model (e.g., momentum contrast MoCo model), a generative model (e.g., markov model), a classification (e.g., decision tree model), or a detection model (e.g., 2-stage detection model), etc.
In some embodiments, the disclosed pre-training model may be trained using the data set to be reviewed, and feature extraction is performed using the trained model, thereby obtaining more accurate feature data.
In some embodiments, the feature data of the image may be acquired by a plurality of different feature extractors, and the feature data acquired by the plurality of feature extractors may be fused, thereby obtaining more accurate feature data.
In some embodiments, for each image, obtaining intermediate feature data of the image by each of a plurality of feature extractors; and fusing the intermediate feature data acquired by each feature extractor to obtain feature data of the image.
The intermediate feature data extracted by the plurality of feature extractors are fused to obtain feature data, so that the feature data can fully combine the advantages of the plurality of feature extractors, and the accuracy of the feature data for representing the image is improved.
Although the subsequent step is to divide the data set to be reviewed into a plurality of subsets based on the feature data, the feature data extracted by the different feature extractors and the different ways does not affect the effectiveness of the data review method of the present disclosure. In the case that the feature data is more accurate, the number of interactions of the data review method of the present disclosure will be smaller.
After the feature data of each image in the data set to be rechecked is acquired, the feature data set of the data set to be rechecked is acquired. The feature data set is partitioned into a plurality of subsets using a classifier.
In some embodiments, the feature dataset is partitioned into a plurality of subsets using a known clustering algorithm (e.g., a K-means algorithm, a mean shift algorithm, etc.) as a classifier.
In some embodiments, dividing the data set to be reviewed into a plurality of subsets according to the feature data of each image comprises: determining, for the feature data of each image and each subset, a probability that the feature data belongs to the subset by each classifier of the plurality of classifiers; averaging the probabilities of the feature data belonging to the subsets determined by each classifier to obtain the average probability of the feature data belonging to the subsets; and obtaining the subset corresponding to the characteristic data according to the average probability that the characteristic data belongs to each subset.
And classifying the characteristic data set by adopting a plurality of classifiers, and fusing classification results of the plurality of classifiers, so that the classification results are more accurate.
In some embodiments, obtaining the subset corresponding to the feature data according to the average probability that the feature data belongs to each subset includes: and determining the subset corresponding to the maximum average probability as the subset corresponding to the characteristic data.
In some embodiments, obtaining the subset corresponding to the feature data according to the average probability that the feature data belongs to each subset includes: selecting a subset corresponding to which the average probability is larger than a preset threshold value, and determining the subset corresponding to the characteristic data by combining historical data analysis.
In step S206, the convex hull of each subset and the data on the convex hull are determined according to the feature data of the image in each subset.
The convex hull of each subset and the data on the convex hull are determined using known convex hull algorithms (e.g., parcel Jarvis March algorithm, gram Graham algorithm, anderuw algorithm, etc.), and then the accuracy of the data in the subset where the convex hull is located is determined from the small amount of data on the convex hull. Therefore, the number of rechecking data is greatly reduced, and the efficiency of finding error labeling data can be improved.
In step S208, it is determined whether the labeling information of the plurality of images is accurate according to the data on the convex hull of each subset.
In some embodiments, determining whether the annotation information for the plurality of images is accurate based on the data on the convex hull for each subset comprises: determining an anomaly of the data on the convex hull of each subset; selecting recheck data from the data on the convex hull of each subset according to the anomaly degree; and determining whether the labeling information of the plurality of images is accurate according to the rechecking data.
The degree of anomaly of the data on the convex hull may be measured by a variety of indicators, and in some embodiments, determining the degree of anomaly of the data on the convex hull for each subset includes: and determining the abnormality degree of the data on the convex hull of each subset according to at least one of the distance between the data on the convex hull of each subset and the center of the subset to which the data on the convex hull belongs, the confidence that the data on the convex hull belongs to the subset to which the data on the convex hull belongs and the outlier degree of the data on the convex hull.
In some embodiments, the greater the distance of the data on the convex hull of each subset from the center of the subset to which it belongs, the greater the degree of anomaly; the greater the confidence that the data on the convex hull belongs to the belonging subset, the smaller the abnormality; the greater the degree of outlier of the data on the convex hull, the greater the degree of anomaly.
In some embodiments, when the classifier is used to divide the data set to be rechecked into a plurality of subsets, a classifier capable of outputting the confidence coefficient is selected, so that the confidence coefficient can be obtained as an index for measuring the data anomaly degree, the classification result of the data in the data set to be rechecked is associated with the anomaly degree index, the calculation cost is reduced, and the rechecking efficiency is improved.
In some embodiments, the degree of outlier of the data on the convex hull is determined by an outlier detection method. For example, the outlier degree of the data on the convex hull can be determined using a single Class support vector machine One Class Svm algorithm.
In some embodiments, selecting review data from the data on the convex hull for each subset based on the anomaly degree comprises: for the data on the convex hull of each subset, ordering the data on the convex hull according to the descending order of the anomaly degree; and selecting the data of the preset quantity as the rechecking data of the subset.
In some embodiments, selecting review data from the data on the convex hull for each subset based on the anomaly degree comprises: and selecting data with the anomaly degree higher than a preset threshold value as review data of the subsets for the data on the convex hull of each subset.
Data with higher anomaly degree is selected from the data on the convex hull for rechecking, namely, the data with higher possibility of error is checked preferentially, so that the error data can be found out more quickly, the rechecking times and quantity can be reduced, and the rechecking efficiency can be improved.
In some embodiments, determining whether the annotation information for the plurality of images is accurate based on the review data comprises: and under the condition that error data does not exist in the recheck data, determining the labeling information of the plurality of images is accurate.
In some embodiments, determining whether the annotation information for the plurality of images is accurate based on the review data comprises: and under the condition that error data appear in the recheck data, re-determining the subset corresponding to the error data as a data set to be rechecked, re-executing the feature data according to each image, dividing the data set to be rechecked into a plurality of subsets, determining convex hulls and data on the convex hulls of each subset according to the feature data of the images in each subset, and determining whether the labeling information of the plurality of images is accurate or not according to the data on the convex hulls of each subset until the labeling information of the plurality of images is accurate.
Since the rechecking data comes from the data on the convex hull of each subset, when error data appears in the rechecking data, the subset to which the error data belongs needs to be rechecked, that is, the subset to which the error data belongs is redetermined as the data set to be rechecked, and the steps are executed until the error data is not found any more, that is, the standard information of all the data is accurate.
The error data comes from the recheck data, and the recheck data comes from the data on the convex packages of each subset, so that a plurality of data sets to be rechecked can be determined according to the subset to which the error data belongs in the early stage of the data recheck method, and the number of the data sets to be rechecked determined in the later stage can be in a descending trend until no data sets to be rechecked are generated.
Therefore, at the early stage, a plurality of data sets to be checked are checked. In some embodiments, the data sets to be checked may be pushed in batches randomly or simultaneously for checking, or may be pushed in a predetermined order for checking, for example, the data set to be checked with larger divergence in the current data set to be checked is pushed preferentially, so that the probability of finding error data as early as possible can be improved. In some embodiments, an average of the distances of each data in a subset from the center of the subset may be determined, with the average of the distances being taken as the divergence of the subset.
Fig. 3 shows a flow diagram of a data review method of further embodiments of the present disclosure. As shown in fig. 3, the data review method includes steps S302 to S312.
In step S302, a data set to be rechecked is acquired.
In step S304, feature extraction is performed on each image in the data set to be checked, and feature data of each image is obtained.
In step S306, feature clustering is performed according to the feature data of each image to divide the data set to be reviewed into a plurality of subsets.
In step S308, the convex hull of each subset and the data on the convex hull are found.
In step S310, the data on the convex hull is sorted based on the anomaly degree, and a preset number of data is selected as the review data, or data with the anomaly degree higher than a preset threshold value is selected as the review data.
In step S312, the review data is reviewed, the subset to which the error data in the review data belongs is a problem subset, the problem subset is redetermined as a subset to be reviewed, and step S306 is executed.
In step S314, the review data is reviewed, and the subset to which the correct data in the review data belongs is a problem-free subset, and the data in the problem-free subset is reviewed and considered as correct data.
In the above embodiment, the data set to be checked is divided into a plurality of subsets, the convex hull of each subset and the data on the convex hull are determined, and the accuracy of the labeling information of the plurality of images in the data set to be checked is determined according to the accuracy of the data on the convex hull of the subset. Because the data on the convex hull only occupies a small part of the data to be checked, the check workload is greatly reduced, and the efficiency of finding the error marked data is effectively improved.
The data rechecking method disclosed by the invention uses a small amount of data on the convex hull of the collection to represent all data in the collection, so that the rechecking workload is obviously reduced; and simultaneously, re-determining the set corresponding to the error data as a data set to be re-checked, and re-executing operations such as dividing a plurality of subsets, determining convex hulls of each self-subset, determining data on the convex hulls and the like until the error data is not found. Therefore, the whole data set to be checked is divided from thick to thin for multiple times, so that the checking fineness is higher and higher, and the error data is checked more comprehensively.
Fig. 4 illustrates a schematic structural diagram of a data review device of some embodiments of the present disclosure. As shown in fig. 4, the data review device 40 includes:
an obtaining module 410 configured to obtain a data set to be rechecked, where the data set to be rechecked includes: a plurality of images and labeling information thereof;
a dividing module 420 configured to divide the data set to be reviewed into a plurality of subsets according to the feature data of each image;
a first determining module 430 configured to determine a convex hull for each subset and data on the convex hull according to the feature data of the images in each subset;
a second determining module 440 is configured to determine whether the labeling information of the plurality of images is accurate based on the data on the convex hull of each subset.
In some embodiments, the second determination module 440 is configured to determine an anomaly of the data on the convex hull of each subset; selecting recheck data from the data on the convex hull of each subset according to the anomaly degree; and determining whether the labeling information of the plurality of images is accurate according to the rechecking data.
In some embodiments, the second determination module 440 is configured to determine that the annotation information for the plurality of images is accurate without erroneous data in the review data.
In some embodiments, the second determining module 440 is configured to re-determine the subset corresponding to the error data as the to-be-re-checked data set in the case that the error data occurs in the re-checked data, re-execute the feature data according to each image, divide the to-be-re-checked data set into a plurality of subsets, determine the convex hull of each subset and the data on the convex hull according to the feature data of the image in each subset, and determine whether the labeling information of the plurality of images is accurate according to the data on the convex hull of each subset until the labeling information of the plurality of images is determined to be accurate.
In some embodiments, the second determining module 440 is configured to determine the anomaly degree of the data on the convex hull of each subset based on at least one of a distance of the data on the convex hull of each subset from a center of the subset to which it belongs, a confidence that the data on the convex hull belongs to the subset to which it belongs, and an outlier degree of the data on the convex hull.
In some embodiments, the greater the distance of the data on the convex hull of each subset from the center of the subset to which it belongs, the greater the degree of anomaly; the greater the confidence that the data on the convex hull belongs to the belonging subset, the smaller the abnormality; the greater the degree of outlier of the data on the convex hull, the greater the degree of anomaly.
In some embodiments, the second determination module 440 is configured to sort the data on the convex hull in descending order of anomaly for the data on the convex hull for each subset; and selecting the data of the preset quantity as the rechecking data of the subset.
In some embodiments, the second determining module 440 is configured to select, for the data on the convex hull of each subset, data having an anomaly above a preset threshold as review data for the subset.
In some embodiments, the partitioning module 420 is configured to determine, for each subset and feature data of each image, a probability that the feature data belongs to the subset by each classifier of the plurality of classifiers; averaging the probabilities of the feature data belonging to the subsets determined by each classifier to obtain the average probability of the feature data belonging to the subsets; and obtaining the subset corresponding to the characteristic data according to the average probability that the characteristic data belongs to each subset.
In some embodiments, the partitioning module 420 is configured to determine the subset corresponding to the maximum average probability as the subset corresponding to the feature data.
In some embodiments, the data review device 40 further comprises: for each image, acquiring intermediate feature data of the image by each of a plurality of feature extractors; and fusing the intermediate feature data acquired by each feature extractor to obtain feature data of the image.
In the above embodiment, the data set to be checked is divided into a plurality of subsets, the convex hull of each subset and the data on the convex hull are determined, and the accuracy of the labeling information of the plurality of images in the data set to be checked is determined according to the accuracy of the data on the convex hull of the subset. Because the data on the convex hull only occupies a small part of the data to be checked, the check workload is greatly reduced, and the efficiency of finding the error marked data is effectively improved.
The data rechecking method disclosed by the invention uses a small amount of data on the convex hull of the collection to represent all data in the collection, so that the rechecking workload is obviously reduced; and simultaneously, re-determining the set corresponding to the error data as a data set to be re-checked, and re-executing operations such as dividing a plurality of subsets, determining convex hulls of each self-subset, determining data on the convex hulls and the like until the error data is not found. Therefore, the whole data set to be checked is divided from thick to thin for multiple times, so that the checking fineness is higher and higher, and the error data is checked more comprehensively.
The data review means in embodiments of the present disclosure may each be implemented by various computing devices or computer systems, as described below in connection with fig. 5 and 6.
Fig. 5 shows a schematic structural diagram of a data review device of other embodiments of the present disclosure. As shown in fig. 5, the apparatus 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, the processor 520 being configured to perform the data review method in any of the embodiments of the present disclosure based on instructions stored in the memory 510.
The memory 50 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), database, and other programs.
Fig. 6 shows a schematic structural diagram of a data review device of further embodiments of the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: memory 610 and processor 620 are similar to memory 510 and processor 520, respectively. Input/output interface 630, network interface 640, storage interface 650, and the like may also be included. These interfaces 630, 640, 650 and the memory 610 and processor 620 may be connected by, for example, a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices, such as may be connected to a database server or cloud storage server, or the like. The storage interface 650 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.
Embodiments of the present disclosure also provide a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements any one of the foregoing data review methods.
It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to enable any modification, equivalent replacement, improvement or the like, which fall within the spirit and principles of the present disclosure.

Claims (14)

1. A data review method, comprising:
obtaining a data set to be rechecked, wherein the data set to be rechecked comprises: a plurality of images and labeling information thereof;
dividing the data set to be rechecked into a plurality of subsets according to the characteristic data of each image;
determining convex hulls of the subsets and data on the convex hulls according to the characteristic data of the images in the subsets;
and determining whether the labeling information of the plurality of images is accurate or not according to the data on the convex hull of each subset.
2. The data review method of claim 1 wherein the determining whether the labeling information of the plurality of images is accurate based on the data on the convex hull of each subset comprises:
determining the degree of abnormality of the data on the convex hull of each subset;
selecting recheck data from the data on the convex hull of each subset according to the abnormality degree;
and determining whether the labeling information of the plurality of images is accurate or not according to the rechecking data.
3. The data review method of claim 2 wherein the determining whether the annotation information for the plurality of images is accurate based on the review data comprises:
and under the condition that error data does not exist in the rechecking data, determining the labeling information of the plurality of images to be accurate.
4. The data review method of claim 2 wherein the determining whether the annotation information for the plurality of images is accurate based on the review data comprises:
in the case where erroneous data occurs in the review data,
and re-determining the subset corresponding to the error data as a data set to be re-checked, re-executing the feature data according to each image, dividing the data set to be re-checked into a plurality of subsets, determining convex hulls of each subset and data on the convex hulls according to the feature data of the images in each subset, and determining whether the labeling information of the plurality of images is accurate or not according to the data on the convex hulls of each subset until the labeling information of the plurality of images is accurate.
5. The data review method of claim 2 wherein the determining the degree of anomaly of the data on the convex hull of each subset comprises:
and determining the abnormality degree of the data on the convex hull of each subset according to at least one of the distance between the data on the convex hull of each subset and the center of the subset to which the data belongs, the confidence that the data on the convex hull belongs to the subset and the outlier degree of the data on the convex hull.
6. The data review method of claim 5 wherein,
the greater the distance between the data on the convex hull of each subset and the center of the subset to which the data belongs, the greater the degree of anomaly;
the greater the confidence that the data on the convex hull belongs to the subset, the smaller the anomaly;
the greater the degree of outlier of the data on the convex hull, the greater the degree of outlier.
7. The data review method of claim 2 wherein the selecting review data from the data on the convex hull of each subset according to the degree of anomaly comprises:
for data on the convex hull of each subset,
sorting the data on the convex hull according to the descending order of the degree of abnormality;
and selecting the data of the preset quantity as the rechecking data of the subset.
8. The data review method of claim 2 wherein the selecting review data from the data on the convex hull of each subset according to the degree of anomaly comprises:
for data on the convex hull of each subset,
and selecting data with the anomaly degree higher than a preset threshold value as review data of the subset.
9. The data review method of claim 1 wherein the dividing the data set to be reviewed into a plurality of subsets according to the feature data of each image comprises:
determining, for each subset and feature data of each image, a probability that the feature data belongs to the subset by each classifier of a plurality of classifiers;
averaging the probabilities of the feature data belonging to the subsets determined by each classifier to obtain average probabilities of the feature data belonging to the subsets;
and obtaining the subset corresponding to the characteristic data according to the average probability that the characteristic data belongs to each subset.
10. The data review method of claim 9 wherein the deriving the subset corresponding to the feature data from the average probability that the feature data belongs to each subset comprises:
and determining the subset corresponding to the maximum average probability as the subset corresponding to the characteristic data.
11. The data review method of claim 1, further comprising:
for each of the images it is possible to provide,
acquiring intermediate feature data of the image by each feature extractor of a plurality of feature extractors;
and fusing the intermediate feature data acquired by each feature extractor to obtain feature data of the image.
12. A data review device, comprising:
the acquisition module is configured to acquire a data set to be rechecked, wherein the data set to be rechecked comprises: a plurality of images and labeling information thereof;
the dividing module is configured to divide the data set to be rechecked into a plurality of subsets according to the characteristic data of each image;
a first determining module configured to determine a convex hull of each subset and data on the convex hull according to feature data of an image in the subset;
and the second determining module is configured to determine whether the labeling information of the plurality of images is accurate according to the data on the convex hull of each subset.
13. A data review device, comprising:
a processor; and
a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the data review method of any one of claims 1 to 11.
14. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the data review method of any of claims 1 to 11.
CN202311656863.XA 2023-12-05 2023-12-05 Data rechecking method, device and computer readable storage medium Pending CN117636042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311656863.XA CN117636042A (en) 2023-12-05 2023-12-05 Data rechecking method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311656863.XA CN117636042A (en) 2023-12-05 2023-12-05 Data rechecking method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN117636042A true CN117636042A (en) 2024-03-01

Family

ID=90035399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311656863.XA Pending CN117636042A (en) 2023-12-05 2023-12-05 Data rechecking method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN117636042A (en)

Similar Documents

Publication Publication Date Title
US10650237B2 (en) Recognition process of an object in a query image
CN109325538B (en) Object detection method, device and computer-readable storage medium
CN109409398B (en) Image processing apparatus, image processing method, and storage medium
CN109977983B (en) Method and device for obtaining training image
CN111353549B (en) Image label verification method and device, electronic equipment and storage medium
US20140010407A1 (en) Image-based localization
CN109241741B (en) Malicious code classification method based on image texture fingerprints
CN112949693B (en) Training method of image classification model, image classification method, device and equipment
CN112149705A (en) Method and system for training classification model, computer equipment and storage medium
CN107609590B (en) Multi-scale mouse track feature extraction method, device and system
CN107392221B (en) Training method of classification model, and method and device for classifying OCR (optical character recognition) results
US11600088B2 (en) Utilizing machine learning and image filtering techniques to detect and analyze handwritten text
WO2015146113A1 (en) Identification dictionary learning system, identification dictionary learning method, and recording medium
CN114120138A (en) Method, device, equipment and medium for detecting and identifying remote sensing image target
EP1930852B1 (en) Image search method and device
JP4802176B2 (en) Pattern recognition apparatus, pattern recognition program, and pattern recognition method
CN110020638B (en) Facial expression recognition method, device, equipment and medium
JP6623851B2 (en) Learning method, information processing device and learning program
CN116661786A (en) Design page generation method and device
JP2015232805A (en) Image processing method, image processor, and image processing program
JP2013073608A (en) Document processing device, document processing method, and program
KR20140044173A (en) Apparatus and method for providing object image cognition
CN117636042A (en) Data rechecking method, device and computer readable storage medium
CN111931229B (en) Data identification method, device and storage medium
JP6659120B2 (en) Information processing apparatus, information processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination