CN111966674A - Method and device for judging qualification of labeled data and electronic equipment - Google Patents

Method and device for judging qualification of labeled data and electronic equipment Download PDF

Info

Publication number
CN111966674A
CN111966674A CN202010868165.6A CN202010868165A CN111966674A CN 111966674 A CN111966674 A CN 111966674A CN 202010868165 A CN202010868165 A CN 202010868165A CN 111966674 A CN111966674 A CN 111966674A
Authority
CN
China
Prior art keywords
data
target object
labeling
annotation
labeling data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010868165.6A
Other languages
Chinese (zh)
Other versions
CN111966674B (en
Inventor
李果
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202010868165.6A priority Critical patent/CN111966674B/en
Publication of CN111966674A publication Critical patent/CN111966674A/en
Application granted granted Critical
Publication of CN111966674B publication Critical patent/CN111966674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method and a device for judging the qualification of labeled data and electronic equipment, wherein the method comprises the following steps: acquiring the labeling data of a target object; determining whether the labeling data of the target object meets a specified condition; the specified condition includes a condition related to the annotation data of the control object; and if the labeling data of the target object meets the specified conditions, determining that the labeling data of the target object is qualified. The method determines whether the identification data of the target object meets the specified conditions based on the labeled data of the comparison object, and further determines whether the labeled data of the target object is qualified.

Description

Method and device for judging qualification of labeled data and electronic equipment
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for judging the qualification of labeled data and electronic equipment.
Background
The labeling of the quality evaluation of the image or the video is performed by a labeling person according to own visual feeling, and for the same image, the labels of the quality evaluation given by different labeling persons may be different, so that the labeling of the quality evaluation has strong subjectivity and uncertainty, and therefore, in order to ensure the accuracy of the labeling of the quality evaluation, whether the labeling of the labeling person is qualified needs to be judged.
In the related art, part of data is usually extracted from the labeled data of the labeling staff, and then the accuracy of the extracted part of data is manually judged to determine whether the label of the labeling staff is qualified, but the judgment precision of the manual judgment mode is low, and a large amount of labor cost is required.
Disclosure of Invention
The invention aims to provide a method and a device for judging the qualification of labeling data and electronic equipment, so as to improve the judgment precision of the qualification of labeling and reduce the judgment cost.
In a first aspect, an embodiment of the present invention provides a method for determining eligibility of annotation data, where the method includes: acquiring the labeling data of a target object; determining whether the labeling data of the target object meets a specified condition; wherein, the specified condition comprises a condition related to the labeling data of the comparison object; the labeling data of the comparison object and the labeling data of the target object are as follows: labeling data labeled aiming at the same data set; and if the labeling data of the target object meets the specified conditions, determining that the labeling data of the target object is qualified.
In an alternative embodiment, the above-mentioned specified conditions include: a first correlation coefficient between the labeling data of the target object and the labeling mean value is greater than or equal to a first preset threshold value; the labeling mean value is the mean value of the labeling data of the target object and the labeling data of the comparison object; and/or the second correlation number of the labeling data of the target object and the labeling data of the comparison object is larger than or equal to a second preset threshold value.
In an optional embodiment, the data set includes a plurality of data to be processed; the labeled data is: labeling results of each piece of data to be processed in the plurality of pieces of data to be processed; the first correlation coefficient is determined by: calculating the average value of the labeling result of the target object to the current data to be processed and the labeling result of the contrast object to the current data to be processed aiming at each data to be processed; the marked mean value comprises a mean value corresponding to each piece of data to be processed; arranging the mean values corresponding to each piece of data to be processed in the marked mean values to obtain a first sequence; arranging the marking data of the target object into a second sequence according to the sequence of the data to be processed in the first sequence; and calculating the SROCC correlation coefficient between the first sequence and the second sequence, and determining the SROCC correlation coefficient as the first correlation coefficient.
In an alternative embodiment, the second correlation number is determined by: calculating the Manhattan distance between the labeling data of the target object and the labeling data of the comparison object; the manhattan distance is determined as a second correlation coefficient.
In an alternative embodiment, the number of the control subjects is plural; the step of calculating the manhattan distance between the labeling data of the target object and the labeling data of the comparison object includes: calculating the Manhattan distance between the labeling data of the target object and each comparison object to obtain a plurality of Manhattan distances; the above-mentioned specified conditions include: each of the plurality of manhattan distances is greater than or equal to a second preset threshold.
In an alternative embodiment, the above-mentioned specified conditions further include: the common option proportion of the target object is smaller than or equal to a third preset threshold; wherein, the common option proportion is determined by the marking data of the target object.
In an optional embodiment, the annotation data includes one of a plurality of preset annotation options; the common option ratio is determined by: counting options in the labeling data of the target object to obtain the use times of each option; and determining the quotient of the maximum value of the using times and the total quantity of the labeling data as a common option proportion.
In an optional embodiment, the annotation data includes an annotation time; the above-mentioned specified conditions further include: the average value of the target object during marking is greater than or equal to a fourth preset threshold value; or, in the specified data in which the labeling time of the labeling data of the target object is less than the fifth preset threshold, the average value of the labeling times of the specified data is greater than or equal to the sixth preset threshold.
In a second aspect, an embodiment of the present invention provides an apparatus for determining eligibility of annotation data, where the apparatus includes: the data acquisition module is used for acquiring the marking data of the target object; the condition judgment module is used for determining whether the marking data of the target object meets the specified condition; wherein the specified condition includes a condition related to the labeling data of the control object; the labeling data of the comparison object and the labeling data of the target object are as follows: labeling data labeled aiming at the same data set; and the qualification determining module is used for determining that the labeling data of the target object is qualified if the labeling data of the target object meets the specified condition.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the method for determining eligibility of annotation data according to any one of the foregoing embodiments.
In a fourth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions, which when invoked and executed by a processor, cause the processor to implement the method for qualification of annotation data as described in any one of the preceding embodiments.
The embodiment of the invention has the following beneficial effects:
according to the method, the device and the electronic equipment for judging the qualification of the labeled data, provided by the embodiment of the invention, firstly, the labeled data of a target object is obtained; further determining whether the marking data of the target object meets specified conditions; the specified conditions comprise conditions related to the labeling data of the comparison object, and the labeling data of the comparison object and the labeling data of the target object are labeling data labeled for the same data set; and if the marking data of the target object meets the specified conditions, determining that the marking data of the target object is qualified, otherwise, determining that the marking data of the target object is unqualified. The method determines whether the identification data of the target object meets the specified conditions based on the labeled data of the comparison object, and further determines whether the labeled data of the target object is qualified.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for determining eligibility of annotation data according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for determining eligibility of annotation data according to an embodiment of the present invention;
FIG. 3 is a flowchart of another method for determining eligibility of annotation data according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for determining eligibility of annotation data according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Data annotation generally refers to the process of manually describing and marking data according to some rule. Typical types of data annotations include: the method comprises the following steps of image annotation, voice annotation, text annotation, video annotation and the like, wherein the basic forms of annotation comprise an annotation picture frame, a 3D (3-Dimension) picture frame, text transcription, image dotting, a target object contour line and the like.
The image or video may undergo multiple processes before it is presented to the viewer, and each process may introduce distortions that reduce its final display effect, e.g., distortions at capture, compression, transmission, etc. Therefore, in order to improve the presentation effect, before the image or the video is presented to the viewer, quality evaluation needs to be performed on the image or the video, that is, a annotator annotates the image quality of the image or the video; generally, the quality evaluation can be divided into no-reference quality evaluation and full-reference quality evaluation, wherein the no-reference quality evaluation refers to evaluation made by a marker on the image quality perceived by a viewer when the marker views an image or a video; the full-reference quality evaluation refers to evaluation made by a marker on a quality difference between two images or two videos which are viewed and perceived when the marker views the pair of images or the pair of videos. Therefore, the labeling of the quality evaluation is performed by the labeling personnel according to the self visual feeling, and the labeling of the quality evaluation given by different labeling personnel may be different for the same image, so that the labeling of the quality evaluation has strong subjectivity and uncertainty, and therefore, in order to ensure the accuracy of the labeling of the quality evaluation, whether the labeling of the labeling personnel is qualified needs to be judged.
In the related technology, usually, a small part of data is extracted from the labeling data of a labeling person, and then the accuracy of the extracted part of data is judged manually to determine whether the labeling of the labeling person is qualified, but the method only extracts the small part of data, which is difficult to represent all the labels of the labeling person accurately, and the judgment precision of the manual judgment method is low, and a large amount of labor cost is required.
Based on the above problems, embodiments of the present invention provide a method and an apparatus for determining eligibility of annotation data, and an electronic device, where the technique may be applied in any scene of determining eligibility of annotation data, especially in a scene of determining eligibility of an annotator for image quality annotation. To facilitate understanding of the embodiment, first, a detailed description is given of a method for determining eligibility of label data disclosed in the embodiment of the present invention, where the method is applied to an electronic device, and as shown in fig. 1, the method includes the following steps:
step S102, obtaining the marking data of the target object.
The target object may be a annotator needing to perform qualification judgment of the annotation data, and all the annotation data of the annotator need to be acquired before the qualification judgment of the annotation data of the annotator. In a specific implementation, the annotation data of the target object is stored in a preset storage list, that is, the annotation data of the target object to a certain data can be stored in the preset storage list in real time, and the annotation data may be annotation data of an image, annotation data of a video, or annotation data of a voice or a text.
The preset storage list stores label data of a plurality of objects, and the preset storage list may be an excel table, where each column in the excel table corresponds to label data of one object separately, or each row in the excel table corresponds to label data of one object separately. In specific implementation, when the label data of the target object is obtained, a column or a row corresponding to the target object may be determined from the excel table, and the data of the row or the column is extracted, so that the label data of the target object can be obtained.
Step S104, determining whether the marking data of the target object meets the specified conditions; wherein the specified condition includes a condition related to the labeling data of the control object; the labeling data of the comparison object and the labeling data of the target object are as follows: and marking data for marking the same data set.
The comparison object may be an object that is labeled with the target object for the same data set, for example, the comparison object may be an object in the preset storage list except the target object, and when each column in the preset storage list represents labeled data of one object, the labeled data in each row is also a labeled result of each object for the same data to be processed.
The specified condition includes a condition related to the annotation data of the comparison object, and may further include other preset conditions, which may be set according to a user requirement, for example, the annotation data of the target object is greater than or equal to a specified value, and the like, and is not limited specifically herein. In a specific implementation, the condition related to the labeling data of the comparison object may be: the labeling data of the target object is not similar to the labeling data of the comparison object, or the number of data similar to or similar to the labeling data of the comparison object exists in the plurality of labeling data of the target object and is less than or equal to a threshold value.
Step S106, if the marking data of the target object meets the specified conditions, the marking data of the target object is determined to be qualified.
When the labeling data of the target object meets specified conditions, determining that the labeling data of the target object is qualified; and when the labeling data of the target object does not meet the specified condition, determining that the labeling data of the target object is unqualified. In some embodiments, if the specified condition is multiple, the annotation data of the target object is determined to be disqualified as long as the annotation data of the target object does not satisfy any of the specified conditions.
The method for judging the qualification of the labeled data provided by the embodiment of the invention comprises the following steps of firstly, acquiring labeled data of a target object; further determining whether the marking data of the target object meets specified conditions; the specified conditions comprise conditions related to the labeling data of the comparison object, and the labeling data of the comparison object and the labeling data of the target object are labeling data labeled for the same data set; and if the marking data of the target object meets the specified conditions, determining that the marking data of the target object is qualified, otherwise, determining that the marking data of the target object is unqualified. The method determines whether the identification data of the target object meets the specified conditions based on the labeled data of the comparison object, and further determines whether the labeled data of the target object is qualified.
The embodiment of the invention also provides another method for judging the qualification of the marked data, which is realized on the basis of the method of the embodiment; the method mainly describes a specific process for determining whether the annotation data of the target object meets the specified conditions (realized by the following steps S204-S208); as shown in fig. 2, the method comprises the following specific steps:
step S202, the marking data of the target object is obtained.
Step S204, calculating the mean value of the labeling data of the target object and the labeling data of the comparison object to obtain the labeling mean value.
The label data of the comparison object can be obtained from a preset storage list. In a specific implementation, the labeled data of the target object and the labeled data of the comparison object are both multiple, that is, the same data set labeled on the target object and the comparison object contains multiple pieces of data to be processed, and thus, the labeled data is a labeled result of each piece of data to be processed in the multiple pieces of data to be processed, and it can also be understood that each piece of labeled data is a labeled result of a certain piece of data to be processed, where the labeled mean value includes: and aiming at each data to be processed, the average value of the labeling result of the target object on the current data to be processed and the labeling result of the contrast object on the current data to be processed.
Step S206, determining a first correlation coefficient between the labeling data of the target object and the labeling mean value according to the labeling data of the target object and the labeling mean value.
The similarity between the labeled data of the target object and the labeled mean may be calculated and determined as a first correlation coefficient, for example, a difference value or a mean square error between the labeled data of the target object and the labeled mean of the labeled data. When the method is implemented, the first correlation coefficient can be calculated through the following steps 10-13:
step 10, calculating the average value of the labeling result of the target object to the current data to be processed and the labeling result of the contrast object to the current data to be processed aiming at each data to be processed; and the marked mean value comprises a mean value corresponding to each data to be processed.
And 11, arranging the mean values corresponding to each piece of data to be processed in the marked mean values to obtain a first sequence.
And step 12, arranging the marking data of the target object into a second sequence according to the sequence of the data to be processed in the first sequence.
And step 13, calculating the SROCC correlation coefficient between the first sequence and the second sequence, and determining the SROCC correlation coefficient as the first correlation coefficient.
The SROCC (Spearman's Rank Correlation Coefficient) is usually used to measure the Correlation between two sequences, and SROCC equals to plus or minus 1, which means that two sequences are completely correlated monotonously, and equals to 0, which means that two sequences are completely uncorrelated. In a specific implementation, when it is assumed that the data set includes n pieces of data to be processed, the SROCC correlation coefficient p of the first sequence X corresponding to the labeled data of the target object and the second sequence Y of the labeled mean object may be obtained by the following equation:
Figure BDA0002649695470000091
wherein x isiRepresenting the ith annotation data in the first sequence X,
Figure BDA0002649695470000092
representing the mean, y, of all annotated data in the first sequenceiRepresenting the ith annotation data in the second sequence Y,
Figure BDA0002649695470000093
representing the mean of all the annotated data in the second sequence Y.
Step S208, judging whether the first correlation coefficient is greater than or equal to a first preset threshold value; if so, step S210 is designated; otherwise, step S212 is designated.
Step S210, determining that the annotation data of the target object is qualified.
Step S212, determining that the labeling data of the target object is unqualified.
In a specific implementation, the first correlation coefficient is greater than or equal to a first preset threshold, which corresponds to the above-mentioned specified condition. The first preset threshold is set according to user requirements, and the first preset threshold may be set to 0.4, 0.3, and the like. And when the first correlation coefficient is smaller than a first preset threshold value, considering that the accuracy of the target object is low, and determining that the marking data of the target object is unqualified.
In some embodiments, in order to ensure the accuracy of the qualification judgment of the annotation data, the following steps 20 to 21 may be further performed when the first correlation coefficient is greater than or equal to the first preset threshold:
and step 20, determining a second correlation coefficient between the labeling data of the target object and the labeling data of the comparison object.
In specific implementation, the manhattan distance between the labeling data of the target object and the labeling data of the comparison object can be calculated; the manhattan distance is determined as a second correlation coefficient. The manhattan distance is generally a sum of distances of projections generated on an axis by a line segment formed by two points on a fixed rectangular coordinate system in euclidean space, and thus, a manhattan distance d between the labeling data of the target object and the labeling data of the control object can be calculated by the following equation:
Figure BDA0002649695470000101
wherein x isiI-th annotation data, z, of the annotation data representing the target objectiAnd n represents the total amount of data to be processed in the same data set marked by the target object and the contrast object.
Step 21, judging whether the second correlation number of the labeling data of the target object and the labeling data of the comparison object is greater than or equal to a second preset threshold value; if so, determining that the labeling data of the target object is qualified; otherwise, determining that the labeling data of the target object is unqualified.
In a specific implementation, the second number of correlations is greater than or equal to a second preset threshold, which is equivalent to one of the above-mentioned specified conditions. The second preset threshold is set according to user requirements, when the second correlation number is smaller than the second preset threshold, the target object is considered to be suspected of plagiarism, and the marking data of the target object is determined to be unqualified.
In a specific implementation, the number of the control objects can be multiple; the manhattan distance between the labeling data of the target object and the labeling data of the comparison object is calculated in the following way: calculating the Manhattan distance between the labeling data of the target object and the labeling data of each comparison object to obtain a plurality of Manhattan distances; the above-mentioned specified conditions include: each of the plurality of manhattan distances is greater than or equal to a second preset threshold; that is, if there is a manhattan distance smaller than the second preset threshold among the plurality of manhattan distances, it is determined that the labeling data of the target object is not qualified.
In some embodiments, the second correlation number is greater than or equal to the second preset threshold value, which is taken as a specified condition alone to determine the eligibility of the annotation data, that is, steps S204-S212 are replaced with steps 20-21.
The method for judging the qualification of the labeled data comprises the steps of firstly, acquiring labeled data of a target object; further calculating the mean value of the labeling data of the target object and the labeling data of the comparison object to obtain a labeling mean value; determining a first correlation coefficient of the labeling data and the labeling mean value of the target object according to the labeling data and the labeling mean value of the target object; and if the first correlation coefficient is larger than or equal to a first preset threshold value, determining that the labeling data of the target object is qualified. According to the method, the electronic equipment can automatically judge the qualification of the marked data according to the specified conditions, manual judgment is not needed, the judgment precision is improved, and the labor cost in judgment is reduced.
The embodiment of the invention also provides another method for judging the qualification of the marked data, which is realized on the basis of the method of the embodiment; the method mainly describes a specific process for determining whether the annotation data of the target object meets the specified conditions (realized by the following steps S304-S314); as shown in fig. 3, the method comprises the following specific steps:
step S302, obtaining the marking data of the target object.
Step S304, judging whether a first correlation coefficient between the labeling data of the target object and the labeling mean value is greater than or equal to a first preset threshold value; if yes, go to step S306; otherwise, step S314 is executed.
Step S306, judging whether the second correlation number of the labeling data of the target object and the labeling data of the comparison object is greater than or equal to a second preset threshold value; if yes, go to step S308; otherwise, step S314 is executed.
The specific implementation of the steps S304-S306 can refer to the steps S204-S212 and the steps 20-21, which are not described again.
Step S308, judging whether the proportion of the common options of the target object is less than or equal to a third preset threshold value; if so, step S310 is performed, otherwise, step S314 is performed.
The common option proportion is determined by the label data of the target object, and the condition that the common option proportion of the target object is smaller than or equal to a third preset threshold value is one of the specified conditions. In specific implementation, the annotation data comprises one of a plurality of preset annotation options; the determination mode of the common option proportion is as follows: counting options in the labeling data of the target object to obtain the use times of each option; and determining the quotient of the maximum value of the using times and the total quantity of the labeling data as a common option proportion.
The preset labeling options may be represented by A, B, C, D, E, etc., or may also be represented by 1, 2, 3, 4, 5, etc., and the meaning represented by each option may be determined according to the labeled data, for example, when labeling the quality of an image, the above 1, 2, 3, 4, 5 may represent different quality levels, and when evaluating the quality of an image, the target object only needs to select one option, that is, the selected option is the labeled data of the image.
For example, if the target object selects the most frequent option as a at the time of annotation, the number of times the target object selects a is determined to be NA, and the total number of times is determined to be Ntotal (which corresponds to the total number of the annotation data), the commonly used option ratio is defined as r ═ NA/Ntotal. If the ratio r of the frequently-used options is greater than a third preset threshold, for example, 80%, the information amount of the labeling data of the target object is considered to be too small, and the labeling data of the target object is not qualified.
Step S310, judging whether the average value of the labeling time corresponding to the labeling data of the target object is greater than or equal to a fourth preset threshold value or not; if yes, go to step S312; otherwise, step S314 is executed.
When the marking data comprises marking time, and the average time for marking corresponding to the marking data of the target object is greater than or equal to a fourth preset threshold value, the fourth preset threshold value is one of the specified conditions; the specified condition comprises an average value of the target object when the target object is marked, and the average value is greater than or equal to a fourth preset threshold. The fourth preset threshold value can be set according to user requirements, if the average value of the target object during marking is smaller than the fourth preset threshold value, the target object is not seriously marked, and the marking data of the target object is determined to be unqualified.
In a specific implementation, the specified conditions in step S310 may be replaced by the following specified conditions: in the specified data in which the labeling time of the labeling data of the target object is smaller than the fifth preset threshold, the average value of the labeling times of the specified data is greater than or equal to the sixth preset threshold. For example, recording the labeling time T of the target object for each piece of data to be processed, and after eliminating the labeling data of which the labeling user is greater than or equal to a fifth preset threshold (e.g., 30 seconds), leaving N' pieces of labeling data; and calculating the average annotation time Tmean of the N' annotation data, and if the Tmean of the target object is smaller than a sixth preset threshold value, for example, 5 seconds, determining that the annotation data of the target object is not observed seriously, and determining that the annotation data is unqualified.
Step S312, determining that the annotation data of the target object is qualified.
Step S314, determining that the labeling data of the target object is not qualified.
In a specific implementation, the specified conditions may include the following four conditions:
first, a first correlation coefficient between the labeled data of the target object and the labeled mean value is greater than or equal to a first preset threshold.
Second, a second correlation number between the labeling data of the target object and the labeling data of the control object is greater than or equal to a second preset threshold.
Thirdly, the proportion of the common options of the target object is less than or equal to a third preset threshold value.
Fourthly, the average value of the marking data of the target object in the marking process is larger than or equal to a fourth preset threshold value; or, in the specified data in which the labeling time of the labeling data of the target object is less than the fifth preset threshold, the average value of the labeling times of the specified data is greater than or equal to the sixth preset threshold. In practical applications, when the eligibility of the annotation data of the target object is determined, one or more of the four conditions may be selected for determination, and when a plurality of specified conditions are selected, the annotation data of the target object is only eligible if the annotation data of the target object satisfies all the specified conditions.
According to the method for judging the qualification of the labeled data, the labeled data of each target object can be analyzed in four directions of relevance with average score, relevance with labeled data of a comparison object, common option proportion of labeling and labeling time, so that whether the labeled data of the target object is qualified or not is judged comprehensively, and therefore the method can automatically judge the qualification of the labeled data of the target object comprehensively by using various statistical algorithms in the four directions of accuracy, similarity, information quantity and labeling time, the method judges comprehensively, can give consideration to multiple dimensions of all data, and can improve the accuracy of judging the qualification of the labeled data.
Corresponding to the method for judging the eligibility of the labeled data, an embodiment of the present invention further provides a device for judging the eligibility of the labeled data, as shown in fig. 4, where the device includes:
and the data acquisition module 40 is configured to acquire the annotation data of the target object.
A condition judgment module 41, configured to determine whether the annotation data of the target object meets a specified condition; wherein the specified condition includes a condition related to the labeling data of the control object; the labeling data of the comparison object and the labeling data of the target object are as follows: and marking data for marking the same data set.
And the eligibility determination module 42 is configured to determine that the annotation data of the target object is eligible if the annotation data of the target object meets a specified condition.
The device for judging the qualification of the labeled data firstly acquires the labeled data of a target object; further determining whether the marking data of the target object meets specified conditions; the specified conditions comprise conditions related to the labeling data of the comparison object, and the labeling data of the comparison object and the labeling data of the target object are labeling data labeled for the same data set; and if the marking data of the target object meets the specified conditions, determining that the marking data of the target object is qualified, otherwise, determining that the marking data of the target object is unqualified. The method determines whether the identification data of the target object meets the specified conditions based on the labeled data of the comparison object, and further determines whether the labeled data of the target object is qualified.
Specifically, the above-mentioned specified conditions include: a first correlation coefficient between the labeling data of the target object and the labeling mean value is greater than or equal to a first preset threshold value; the labeling mean value is the mean value of the labeling data of the target object and the labeling data of the comparison object; and/or the second correlation number of the labeling data of the target object and the labeling data of the comparison object is larger than or equal to a second preset threshold value.
In a specific implementation, the data set includes a plurality of data to be processed; the labeled data are: the apparatus further includes a first correlation coefficient determining module, configured to: calculating the average value of the labeling result of the target object to the current data to be processed and the labeling result of the contrast object to the current data to be processed aiming at each data to be processed; the marked mean value comprises a mean value corresponding to each piece of data to be processed; arranging the mean values corresponding to each piece of data to be processed in the marked mean values to obtain a first sequence; arranging the marking data of the target object into a second sequence according to the sequence of the data to be processed in the first sequence; and calculating the SROCC correlation coefficient between the first sequence and the second sequence, and determining the SROCC correlation coefficient as the first correlation coefficient.
Further, the apparatus further includes a second correlation coefficient determining module, configured to: calculating the Manhattan distance between the labeling data of the target object and the labeling data of the comparison object; the manhattan distance is determined as a second correlation coefficient.
In a specific implementation, the number of the control objects is multiple; the second correlation coefficient determining module is further configured to: calculating the Manhattan distance between the labeling data of the target object and the labeling data of each comparison object to obtain a plurality of Manhattan distances; the above-mentioned specified conditions include: each of the plurality of manhattan distances is greater than or equal to a second preset threshold.
Further, the above-mentioned specified conditions further include: the common option proportion of the target object is smaller than or equal to a third preset threshold; wherein, the common option proportion is determined by the marking data of the target object.
In a specific implementation, the annotation data comprises one of a plurality of preset annotation options; the apparatus further comprises a ratio determination module configured to: counting options in the labeling data of the target object to obtain the use times of each option; and determining the quotient of the maximum value of the using times and the total quantity of the labeling data as a common option proportion.
Specifically, the marking data includes marking time; the above-mentioned specified conditions further include: the average value of the target object during marking is greater than or equal to a fourth preset threshold value; or, in the specified data in which the labeling time of the labeling data of the target object is less than the fifth preset threshold, the average value of the labeling times of the specified data is greater than or equal to the sixth preset threshold.
The implementation principle and the technical effect of the device for judging the qualification of the label data provided by the embodiment of the invention are the same as those of the embodiment of the method for judging the qualification of the label data, and for the sake of brief description, corresponding contents in the embodiment of the method can be referred to where the embodiment of the device is not mentioned.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, the electronic device includes a processor 101 and a memory 100, where the memory 100 stores machine executable instructions that can be executed by the processor 101, and the processor executes the machine executable instructions to implement the method for determining eligibility of the labeled data.
Further, the electronic device shown in fig. 5 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
The embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the method for determining eligibility of the labeled data.
The method and the apparatus for determining eligibility of annotation data and the computer program product of the electronic device provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A method for judging the qualification of labeled data is characterized by comprising the following steps:
acquiring the labeling data of a target object;
determining whether the labeling data of the target object meets a specified condition; wherein the specified condition includes a condition related to the annotation data of the control object; the labeling data of the comparison object and the labeling data of the target object are as follows: labeling data labeled aiming at the same data set;
and if the labeling data of the target object meets the specified condition, determining that the labeling data of the target object is qualified.
2. The method of claim 1, wherein the specified condition comprises:
a first correlation coefficient between the labeling data of the target object and the labeling mean value is greater than or equal to a first preset threshold value; wherein the labeled mean is: the mean value of the labeling data of the target object and the labeling data of the comparison object;
and/or the second correlation number of the labeling data of the target object and the labeling data of the comparison object is greater than or equal to a second preset threshold value.
3. The method of claim 2, wherein the data set includes a plurality of data to be processed; the labeled data is as follows: labeling results of each piece of data to be processed in the plurality of pieces of data to be processed; the first correlation coefficient is determined by:
calculating the average value of the labeling result of the target object to the current data to be processed and the labeling result of the contrast object to the current data to be processed aiming at each data to be processed; the marked mean value comprises a mean value corresponding to each piece of data to be processed;
arranging the mean values corresponding to each piece of data to be processed in the marked mean values to obtain a first sequence;
arranging the marking data of the target object into a second sequence according to the sequence of the data to be processed in the first sequence;
and calculating the SROCC correlation coefficient between the first sequence and the second sequence, and determining the SROCC correlation coefficient as the first correlation coefficient.
4. The method of claim 2, wherein the second correlation number is determined by:
calculating the Manhattan distance between the labeling data of the target object and the labeling data of the comparison object; determining the manhattan distance as the second correlation coefficient.
5. The method of claim 4, wherein the control subject is a plurality;
the step of calculating the manhattan distance between the labeling data of the target object and the labeling data of the control object includes:
calculating the Manhattan distance between the labeling data of the target object and the labeling data of each comparison object to obtain a plurality of Manhattan distances;
the specified conditions include: each of the manhattan distances in the plurality of manhattan distances is greater than or equal to the second preset threshold.
6. The method of claim 1, wherein the specified conditions further comprise:
the common option proportion of the target object is smaller than or equal to a third preset threshold; wherein the common option proportion is determined by the labeling data of the target object.
7. The method of claim 6, wherein the annotation data comprises one of a plurality of preset annotation options; the common option ratio is determined by the following method:
counting options in the labeling data of the target object to obtain the use times of each option;
and determining the quotient of the maximum value of the using times and the total quantity of the labeling data as the common option proportion.
8. The method of any of claims 1-7, wherein the annotation data comprises an annotation time;
the specified conditions further include: the average value of the target object during marking is greater than or equal to a fourth preset threshold value;
or, in the specified data of which the marking time of the marking data of the target object is less than a fifth preset threshold, the average value of the marking time of the specified data is greater than or equal to a sixth preset threshold.
9. An apparatus for judging the eligibility of label data, comprising:
the data acquisition module is used for acquiring the marking data of the target object;
the condition judgment module is used for determining whether the marking data of the target object meets a specified condition; wherein the specified condition includes a condition related to the annotation data of the control object; the labeling data of the comparison object and the labeling data of the target object are as follows: labeling data labeled aiming at the same data set;
and the eligibility determination module is used for determining that the labeling data of the target object is qualified if the labeling data of the target object meets the specified condition.
10. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of eligibility determination of annotation data according to any one of claims 1 to 8.
11. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to carry out the method of eligibility determination of annotation data of any one of claims 1 to 8.
CN202010868165.6A 2020-08-25 2020-08-25 Method and device for judging eligibility of annotation data and electronic equipment Active CN111966674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010868165.6A CN111966674B (en) 2020-08-25 2020-08-25 Method and device for judging eligibility of annotation data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010868165.6A CN111966674B (en) 2020-08-25 2020-08-25 Method and device for judging eligibility of annotation data and electronic equipment

Publications (2)

Publication Number Publication Date
CN111966674A true CN111966674A (en) 2020-11-20
CN111966674B CN111966674B (en) 2024-03-15

Family

ID=73390973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010868165.6A Active CN111966674B (en) 2020-08-25 2020-08-25 Method and device for judging eligibility of annotation data and electronic equipment

Country Status (1)

Country Link
CN (1) CN111966674B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988727A (en) * 2021-03-25 2021-06-18 北京百度网讯科技有限公司 Data annotation method, device, equipment, storage medium and computer program product

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105404896A (en) * 2015-11-03 2016-03-16 北京旷视科技有限公司 Annotation data processing method and annotation data processing system
CN109727055A (en) * 2018-06-01 2019-05-07 平安普惠企业管理有限公司 Network Questionnaire Survey method, apparatus, equipment and computer readable storage medium
CN109800320A (en) * 2019-01-04 2019-05-24 平安科技(深圳)有限公司 A kind of image processing method, equipment and computer readable storage medium
CN110188769A (en) * 2019-05-14 2019-08-30 广州虎牙信息科技有限公司 Checking method, device, equipment and the storage medium of key point mark
US20190278777A1 (en) * 2011-02-22 2019-09-12 Refinitiv Us Organization Llc Entity fingerprints
CN110232060A (en) * 2019-05-29 2019-09-13 第四范式(北京)技术有限公司 A kind of checking method and device of labeled data
CN110399483A (en) * 2019-06-12 2019-11-01 五八有限公司 A kind of subject classification method, apparatus, electronic equipment and readable storage medium storing program for executing
CN110807045A (en) * 2019-10-30 2020-02-18 深圳前海微众银行股份有限公司 Data display method, device, equipment and computer readable storage medium
CN111291013A (en) * 2020-01-17 2020-06-16 深圳市商汤科技有限公司 Behavior data processing method and device, electronic equipment and storage medium
CN111383058A (en) * 2020-03-13 2020-07-07 北方民族大学 High-reliability online questionnaire survey method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190278777A1 (en) * 2011-02-22 2019-09-12 Refinitiv Us Organization Llc Entity fingerprints
CN105404896A (en) * 2015-11-03 2016-03-16 北京旷视科技有限公司 Annotation data processing method and annotation data processing system
CN109727055A (en) * 2018-06-01 2019-05-07 平安普惠企业管理有限公司 Network Questionnaire Survey method, apparatus, equipment and computer readable storage medium
CN109800320A (en) * 2019-01-04 2019-05-24 平安科技(深圳)有限公司 A kind of image processing method, equipment and computer readable storage medium
CN110188769A (en) * 2019-05-14 2019-08-30 广州虎牙信息科技有限公司 Checking method, device, equipment and the storage medium of key point mark
CN110232060A (en) * 2019-05-29 2019-09-13 第四范式(北京)技术有限公司 A kind of checking method and device of labeled data
CN110399483A (en) * 2019-06-12 2019-11-01 五八有限公司 A kind of subject classification method, apparatus, electronic equipment and readable storage medium storing program for executing
CN110807045A (en) * 2019-10-30 2020-02-18 深圳前海微众银行股份有限公司 Data display method, device, equipment and computer readable storage medium
CN111291013A (en) * 2020-01-17 2020-06-16 深圳市商汤科技有限公司 Behavior data processing method and device, electronic equipment and storage medium
CN111383058A (en) * 2020-03-13 2020-07-07 北方民族大学 High-reliability online questionnaire survey method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
杨启悦等: "基于维基百科的汉越词语相似度计算", 南京理工大学学报, vol. 40, no. 4, 31 August 2016 (2016-08-31) *
蒋亚芳等: "多重CCA算法的柬汉双语词向量构建方法", 计算机工程与应用, vol. 56, no. 17, 20 August 2019 (2019-08-20) *
蔡莉;王淑婷;刘俊晖;朱扬勇;: "数据标注研究综述", 软件学报, vol. 31, no. 02, 5 December 2019 (2019-12-05) *
高茂庭;王吉;: "融合社交关系与时间因素的主题模型推荐算法", 计算机工程, no. 03, 15 March 2020 (2020-03-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988727A (en) * 2021-03-25 2021-06-18 北京百度网讯科技有限公司 Data annotation method, device, equipment, storage medium and computer program product
US11604766B2 (en) 2021-03-25 2023-03-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device, storage medium and computer program product for labeling data

Also Published As

Publication number Publication date
CN111966674B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
CN110222791B (en) Sample labeling information auditing method and device
CN107123122B (en) No-reference image quality evaluation method and device
CN108769776B (en) Title subtitle detection method and device and electronic equipment
CN112016053A (en) Assessment method and device for data annotation and electronic equipment
CN108460098B (en) Information recommendation method and device and computer equipment
CN110047513B (en) Video monitoring method and device, electronic equipment and storage medium
CN111708909B (en) Video tag adding method and device, electronic equipment and computer readable storage medium
CN107766467B (en) Information detection method and device, electronic equipment and storage medium
CN108062341A (en) The automatic marking method and device of data
CN111541939B (en) Video splitting method and device, electronic equipment and storage medium
US8787702B1 (en) Methods and apparatus for determining and/or modifying image orientation
CN106709489B (en) Character recognition processing method and device
CN112866800A (en) Video content similarity detection method, device, equipment and storage medium
CN111966674A (en) Method and device for judging qualification of labeled data and electronic equipment
CN115983606A (en) Crowdsourcing task library updating method and system and electronic equipment
CN111695643B (en) Image processing method and device and electronic equipment
CN114494775A (en) Video segmentation method, device, equipment and storage medium
CN111046747A (en) Crowd counting model training method, crowd counting method, device and server
CN111291567B (en) Evaluation method and device for manual labeling quality, electronic equipment and storage medium
Yan et al. Video quality assessment based on motion structure partition similarity of spatiotemporal slice images
CN109829043B (en) Part-of-speech confirmation method, part-of-speech confirmation device, electronic device, and storage medium
CN114926464B (en) Image quality inspection method, image quality inspection device and system in double-recording scene
CN112084092B (en) Method, device, equipment and storage medium for determining diagnosis rule
CN113987034A (en) Information display method and device, electronic equipment and readable storage medium
CN113420809A (en) Video quality evaluation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant