CN111143333A

CN111143333A - Method, device and equipment for processing labeled data and computer readable storage medium

Info

Publication number: CN111143333A
Application number: CN201811313048.2A
Authority: CN
Inventors: 黄铭哲; 颜钦钦; 高良才; 汤帜
Original assignee: Pku Founder Information Industry Group Co ltd; Peking University; Peking University Founder Group Co Ltd
Current assignee: Peking University
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2020-05-12
Anticipated expiration: 2038-11-06
Also published as: CN111143333B

Abstract

The embodiment of the invention provides a method, a device and equipment for processing labeled data and a computer readable storage medium. According to the method, at least one group of marked data with the similarity of the marked area larger than a preset threshold value is obtained, and each group of marked data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.

Description

Method, device and equipment for processing labeled data and computer readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of digital publishing, in particular to a method, a device and equipment for processing labeled data and a computer readable storage medium.

Background

With the continuous and deep research on digital information resources and the increasingly wide application of deep learning in the field of digital publishing, data marking becomes a very important task. The accuracy and efficiency of page object labeling also become factors restricting the effect of the deep learning model.

At present, a page tagging system can provide a page tagging function, and record and store tagging data of a page object by a user in a database. However, for some complex pages, page objects are overlapped, nested, mutually contained and the like, and repeated data and differential data exist in labeled data labeled for multiple times on the same page; for example, there may be a deviation in the position and size of the label area, the label type, and the like. In addition, when multiple users repeatedly mark the same page, because different users have differences in the marked data marked on the same page object, a large amount of repeated data and difference data also exist in the marked data.

Disclosure of Invention

The embodiment of the invention provides a method, a device and equipment for processing labeled data and a computer readable storage medium, which are used for solving the problem that a large amount of difference data and repeated data exist in labeled data of a page labeling system.

One aspect of the embodiments of the present invention is to provide a method for processing annotation data, including:

acquiring at least one group of labeled data of which the similarity of the labeled areas is greater than a preset threshold, wherein each group of labeled data is a group of data to be cleaned;

determining a new labeling area and a new labeling type of each group of data to be cleaned;

and cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.

Another aspect of the embodiments of the present invention is to provide an annotation data processing apparatus, including:

the system comprises a to-be-cleaned data acquisition module, a data analysis module and a data analysis module, wherein the to-be-cleaned data acquisition module is used for acquiring at least one group of marked data of which the similarity of marked areas is greater than a preset threshold, and each group of marked data is a group of to-be-cleaned data;

the determining module is used for determining a new labeling area and a new labeling category of each group of data to be cleaned;

and the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.

Another aspect of an embodiment of the present invention is to provide an annotation data processing apparatus, including:

a memory, a processor, and a computer program stored on the memory and executable on the processor,

when the processor runs the computer program, the annotated data processing method described in any one of the above is realized.

It is another aspect of an embodiment of the present invention to provide a computer-readable storage medium, storing a computer program,

the computer program realizes the above-mentioned annotation data processing method when being executed by a processor.

According to the method, the device and the equipment for processing the labeled data and the computer readable storage medium provided by the embodiment of the invention, at least one group of labeled data of which the similarity of the labeled area is greater than a preset threshold is obtained, and each group of labeled data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.

Drawings

FIG. 1 is a flowchart of a method for processing annotated data according to an embodiment of the present invention;

FIG. 2 is a flowchart of a labeled data processing method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of another annotated data processing method according to the second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a tag data processing apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of an annotation data processing device according to a fifth embodiment of the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of embodiments of the invention, as detailed in the following claims.

The terms "first", "second", etc. referred to in the embodiments of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.

The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Example one

Fig. 1 is a flowchart of a method for processing annotation data according to an embodiment of the present invention. The embodiment of the invention provides a method for processing labeled data, aiming at the problem that a large amount of difference data and repeated data exist in the labeled data of a page labeling system. The method in this embodiment is applied to a labeled data processing device, which may be a mobile terminal such as a smart phone or a tablet computer, or a server device, and in other embodiments, the method may also be applied to other devices, and this embodiment takes the labeled data processing device as an example for schematic description. As shown in fig. 1, the method comprises the following specific steps:

step S101, at least one group of marking data with the marking area similarity larger than a preset threshold is obtained, and each group of marking data is a group of data to be cleaned.

The labeling data at least comprises identification information, labeling categories, labeling areas and the like. The annotation data may also include annotation page object information and other information. The information included in the annotation data can be different for different page annotation systems.

The identification information of the annotation data may be information such as a number for uniquely identifying one piece of annotation data, and may be used as an index of the annotation data.

The type of the labeled page object may include a header, a footer, a title, a body paragraph, a formula, a table, and the like, and the type of the page object is not specifically limited in this embodiment.

In practical application, after a user marks a certain page object in a page marking system once, the page marking system generates corresponding marking data with unique identification information for each marking, and stores the marking data into a database.

For some complex pages, page objects have the conditions of overlapping, nesting, mutual inclusion and the like, and repeated data and differential data exist in labeled data labeled for multiple times on the same page; for example, there may be a deviation in the position and size of the label area, the label type, and the like. In addition, when multiple users repeatedly mark the same page, because different users have differences in the marked data marked on the same page object, a large amount of repeated data and difference data also exist in the marked data.

In the embodiment, one or more groups of labeled data with the similarity of the labeled areas larger than a preset threshold value in the labeled data of the database are obtained by calculating the similarity of the labeled areas of any two labeled data; each group of obtained labeled data is either repeated data with large labeled area similarity and same labeled type, or differential data with large labeled area similarity and different labeled types, and is to-be-cleaned data needing data cleaning.

In particular, one possible implementation of this step is as follows:

calculating the similarity of the labeling areas of any two labeling data; determining at least one group of data to be cleaned according to the similarity of the labeling areas of any two labeling data; each group of data to be cleaned comprises at least two marking data, and the similarity of the marking areas of any two marking data in each group of data to be cleaned is greater than a preset threshold value.

Optionally, the similarity between the two labeled regions may be equal to the ratio of the area of the overlapped part of the two labeled regions to the total area of the regions covered by the two labeled regions. The area covered by the two labeling areas is the union of the two labeling areas.

The preset threshold is a value greater than 0 and less than 1, and the preset threshold may be set by a technician according to actual needs, which is not specifically limited in this embodiment.

And S102, determining a new labeling area and a new labeling type of each group of data to be cleaned.

After one or more groups of data to be cleaned are determined, new labeling areas and labeling categories of each group of data to be cleaned are respectively determined.

For any group of data to be cleaned, if the labeling types of all the labeled data in the group of data to be cleaned are the same, the group of data to be cleaned is the repeated data. In this step, for a group of repeating data, only the new labeling area of the group of repeating data needs to be determined again, and the original labeling category is retained.

For any group of data to be cleaned, if the group of data to be cleaned has label data with different label types, the group of data to be cleaned is used as difference data. In this step, for a set of difference data, a new labeling area and a new labeling category of the set of difference data need to be determined again.

And S103, cleaning each group of data to be cleaned according to the new labeled area and the new labeled category of each group of data to be cleaned.

And for any group of data to be cleaned, cleaning the group of data to be cleaned in the database after determining the new labeling area and the new labeling category of the group of data to be cleaned.

Specifically, a new piece of labeled data corresponding to the group of data to be cleaned may be added to the database, the labeled area and the labeled category of the new piece of labeled data are the new labeled area and the new labeled category of the group of data to be cleaned, respectively, and the other information of the new piece of labeled data may be obtained by merging and sorting the other information of all labeled data in the group of data to be cleaned; and deleting the group of data to be cleaned in the database, so that the repeated data and the differential data in the database can be cleaned.

The embodiment of the invention obtains at least one group of marking data of which the similarity of the marking areas is greater than a preset threshold, wherein each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.

Example two

FIG. 2 is a flowchart of a labeled data processing method according to a second embodiment of the present invention; FIG. 3 is a flowchart of another annotation data processing method according to the second embodiment of the invention. On the basis of the first embodiment, in this embodiment, as shown in fig. 2, determining a new labeled area of each set of data to be cleaned specifically includes the following steps:

step S201, dividing the coverage area of all the labeled data of the group of data to be cleaned into a plurality of grids.

In this embodiment, in order to determine the new labeled area of the set of data to be cleaned, the coverage area of all labeled data of the set of data to be cleaned may be divided into a plurality of grids.

In addition, because the similarity of the labeling areas of all the labeling data of the group of data to be cleaned is higher, that is, all the labeling data of the group of data to be cleaned are the labeling data on the same page; the entire page may be partitioned into evenly distributed grids, for example, a page may be partitioned into 1000 x 2000 grids.

And step S202, calculating the labeling times of the group of data to be cleaned to each grid.

In this embodiment, the number of times of labeling each mesh by the set of data to be cleaned is calculated according to the mesh covered by the labeling area of each labeled data in the set of data to be cleaned.

Specifically, after a page corresponding to the group of data to be cleaned is divided into grids, initializing the labeling frequency of each grid to be 0; sequentially carrying out statistical processing on each marking data in the group of data to be cleaned, and adding 1 to the marking times of the grids covered by the marking areas of the marking data; and after the statistical processing of all the labeled data in the group of data to be cleaned is completed, obtaining the labeling times of the group of data to be cleaned on each grid.

Optionally, when each piece of labeled data in the set of data to be cleaned is subjected to statistical processing, the mesh covered by the labeled area of the labeled data may be determined according to the projection coordinate of the labeled area of the labeled data projected onto the page.

Alternatively, a data structure such as a two-dimensional vector may be used to record the labeling times of all grids.

And step S203, determining a connected region with the labeling frequency greater than the frequency threshold value in the coverage area as a new labeling region of the group of data to be cleaned.

In this embodiment, after the number of times of labeling of each grid is obtained, the maximum number of times of labeling of all the grids can be determined by comparing the number of times of labeling of each grid; and determining a time threshold according to the maximum time and a preset threshold.

Alternatively, the number threshold may be equal to the product of the maximum number and a preset threshold.

In addition, the number threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited herein.

After the number threshold is determined, determining an area formed by grids with the marking times larger than the number threshold in all grids corresponding to the group of data to be cleaned, and taking a connected area in the area formed by the grids with the marking times larger than the number threshold as a new marking area of the group of data to be cleaned.

Optionally, if the region formed by the grid with the labeling times greater than the time threshold includes a plurality of connected regions, the connected region with the largest area is used as a new labeling region of the set of data to be cleaned.

The embodiment of the invention divides the coverage area of all the marking data of the group of data to be cleaned into a plurality of grids, calculates the marking times of the group of data to be cleaned on each grid, determines the connected area with the marking times larger than the time threshold value in the coverage area as the new marking area of the group of data to be cleaned, can use the area with more marking times of the marking data in the group of data to be cleaned as the new marking area, and improves the accuracy of the new marking area.

In another implementation manner of this embodiment, determining a new label category of a group of data to be cleaned includes the following two cases:

the first case is: and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.

In the first case, the labeling categories of all the labeled data in the set of data to be cleaned are the same, that is, the labeling areas of all the labeled data in the set of data to be cleaned have high similarity, and the labeling categories are the same, which indicates that the set of data to be cleaned is the repeated data, and only needs to re-determine the new labeling area of the set of repeated data and keep the original labeling category.

The second case is: if the labeling types of any two labeling data in the group of data to be cleaned are different, calculating a labeling probability comprehensive score corresponding to each labeling type corresponding to the group of data to be cleaned; and taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.

In the second case, the labeling types of any two pieces of labeling data in the group of data to be cleaned are different, that is, the labeling areas of all the pieces of labeling data in the group of data to be cleaned have high similarity, but the labeling types are different, which indicates that the group of data to be cleaned is difference data, and not only the new labeling area of the group of difference data needs to be determined again, but also the new labeling type of the group of difference data needs to be determined again.

As shown in fig. 3, the specific steps of determining a new label category of a group of data to be cleaned are as follows:

and S301, calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned.

In this embodiment, a preset classification model may be used to classify the new labeled region, so as to obtain the probability that the labeled data corresponding to the new labeled region is of each labeled category.

The preset classification model may be a classifier for identifying a category of the page object in the designated area. And inputting the position information of the new labeling area into a classification model, and calculating and outputting the probability that the category of the page object in the new standard area is each group of labeling categories through the classification model.

And step S302, calculating the occurrence probability of each label type corresponding to the group of data to be cleaned.

In this embodiment, the probability of occurrence of any one of the label categories corresponding to the group of data to be cleaned may be calculated by using a formula two:

wherein c represents a label category, p_h(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, n_cAnd representing the times of marking the page object as the marked type c by the group of data to be cleaned, and N representing the total times of marking the page object by the group of data to be cleaned.

n_cAnd the number of times that the page object is marked as the marking type c by the group of data to be cleaned is shown, namely the number of marking data with the marking type c in all the marking data of the group of data to be cleaned.

The annotation category c can be any annotation category, for example, a certain annotation category in header, footer, title, text paragraph, formula, table, and picture.

In addition, the sum of the probabilities of occurrence of each label type corresponding to the group of data to be cleaned is equal to 1, which can be specifically expressed by the following formula three:

∑_c∈Categoryp_h(c) 1 formula three

Wherein c represents a label category, p_h(c) And representing the occurrence probability of the label Category c corresponding to the group of data to be cleaned, wherein Category is the set of all label categories corresponding to the group of data to be cleaned.

And step S303, calculating a marking probability comprehensive score corresponding to each marking type according to the probability of each marking type corresponding to the new marking area and the occurrence probability of each marking type corresponding to the group of data to be cleaned.

Specifically, the label probability comprehensive score corresponding to each label category is calculated by adopting the following formula I:

p_o(c)＝w_h×p_h(c)+w_m×p_m(c) formula one

Wherein c represents a label category, p_o(c) Label probability composite score, p, representing label category c_m(c) Indicates the probability, p, of the new label region corresponding to the label category c_h(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, w_h、w_mAre weight coefficients.

In this example, w_hWeight coefficient for the type of annotation corresponding to the manual pair of page objects represented by the set of data to be cleaned, w_mIs a weight coefficient corresponding to the type of label of the page object by the classification model. w is a_h、w_mThe setting can be performed by a technician according to actual needs, and the embodiment is not specifically limited herein.

And step S304, taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.

And after the marking probability comprehensive scores corresponding to each marking category are obtained through calculation, determining the marking category with the maximum marking probability comprehensive score as a new marking category of the group of data to be cleaned by comparing the size of the marking probability comprehensive scores corresponding to each marking category.

The embodiment of the invention provides a specific implementation mode for determining a new labeling category of differential data, and the probability of each labeling category corresponding to a new labeling area is calculated through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned, calculating the labeling probability comprehensive score corresponding to each labeling category according to the occurrence probability of each labeling category corresponding to the new labeling area and the occurrence probability of each labeling category corresponding to the group of data to be cleaned, and taking the labeling category with the maximum labeling probability comprehensive score as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.

EXAMPLE III

Fig. 4 is a schematic structural diagram of an annotation data processing apparatus according to a third embodiment of the present invention. The annotated data processing apparatus provided in the embodiment of the present invention may execute the processing procedure provided in the embodiment of the annotated data processing method. As shown in fig. 4, the apparatus 40 includes: a data to be cleaned acquisition module 401, a determination module 402 and a cleaning processing module 403.

Specifically, the to-be-cleaned data obtaining module 401 is configured to obtain at least one set of labeled data with the labeled area similarity greater than a preset threshold, where each set of labeled data is a set of to-be-cleaned data.

The determining module 402 is configured to determine a new labeled area and a new labeled category of each set of data to be cleaned.

The cleaning processing module 403 is configured to perform cleaning processing on each set of data to be cleaned according to the new labeled area and the new labeled category of each set of data to be cleaned.

The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the first embodiment, and specific functions are not described herein again.

Example four

On the basis of the third embodiment, in this embodiment, the determining module is further configured to:

dividing the coverage area of all the labeled data of the group of data to be cleaned into a plurality of grids; calculating the labeling times of the group of data to be cleaned to each grid; and determining a connected region with the labeling times larger than the time threshold value in the coverage area as a new labeling region of the group of data to be cleaned.

Optionally, the determining module is further configured to:

determining the maximum times in the labeling times of each grid; and determining a time threshold value according to the product of the maximum time and a preset threshold value.

Optionally, the data acquiring module to be cleaned is further configured to:

Optionally, the determining module is further configured to:

if the labeling types of any two labeling data in the group of data to be cleaned are different, calculating a labeling probability comprehensive score corresponding to each labeling type corresponding to the group of data to be cleaned; and taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.

Optionally, the determining module is further configured to:

calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each label category corresponding to the group of data to be cleaned; and calculating the marking probability comprehensive score corresponding to each marking category by adopting the following formula:

p_o(c)＝w_h×p_h(c)+w_m×p_m(c)

Optionally, the determining module is further configured to:

and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.

The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the second embodiment, and specific functions are not described herein again.

The embodiment of the invention divides the coverage area of all the marking data of the group of data to be cleaned into a plurality of grids, calculates the marking times of the group of data to be cleaned on each grid, determines the connected area with the marking times larger than the threshold value of the times in the coverage area as the new marking area of the group of data to be cleaned, can use the area with more marking times of the marking data in the group of data to be cleaned as the new marking area, and improves the accuracy of the new marking area; further, calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned, calculating the labeling probability comprehensive score corresponding to each labeling category according to the occurrence probability of each labeling category corresponding to the new labeling area and the occurrence probability of each labeling category corresponding to the group of data to be cleaned, and taking the labeling category with the maximum labeling probability comprehensive score as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.

EXAMPLE five

Fig. 5 is a schematic structural diagram of an annotation data processing device according to a fifth embodiment of the present invention. As shown in fig. 5, the annotation data processing apparatus 50 includes: a processor 501, a memory 502, and computer programs stored on the memory 502 and executable by the processor 501.

The processor 501, when executing the computer program stored on the memory 502, implements the annotation data processing method provided by any of the method embodiments described above.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for processing annotation data provided in any of the above method embodiments is implemented.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for processing annotation data, comprising:

2. The method according to claim 1, wherein the obtaining at least one group of labeled data with labeled region similarity greater than a preset threshold, each group of labeled data being a group of data to be cleaned, comprises:

calculating the similarity of the labeling areas of any two labeling data;

determining at least one group of data to be cleaned according to the similarity of the labeling areas of any two labeling data;

each group of data to be cleaned comprises at least two marking data, and the similarity of the marking areas of any two marking data in each group of data to be cleaned is greater than the preset threshold value.

3. The method of claim 1 or 2, wherein the determining a new labeled area of a set of data to be cleaned comprises:

dividing the coverage area of all the labeled data of the group of data to be cleaned into a plurality of grids;

calculating the labeling times of the group of data to be cleaned to each grid;

and determining the connected region with the labeling times larger than the time threshold value in the coverage area as a new labeling region of the group of data to be cleaned.

4. The method of claim 3, wherein before determining the connected region in the coverage area labeled more than the threshold number of times as a new labeled region of the set of data to be cleaned, the method further comprises:

determining the maximum times of the labeling times of each grid;

and determining the frequency threshold value according to the product of the maximum frequency and the preset threshold value.

5. The method of claim 3, wherein determining a new label category for a set of data to be cleaned comprises:

if the labeling types of any two labeling data in the group of data to be cleaned are different, calculating a labeling probability comprehensive score corresponding to each labeling type corresponding to the group of data to be cleaned;

and taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.

6. The method of claim 5, wherein the calculating a label probability composite score corresponding to each label category corresponding to the set of data to be cleaned comprises:

calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned;

calculating the occurrence probability of each label category corresponding to the group of data to be cleaned;

and calculating the marking probability comprehensive score corresponding to each marking category by adopting the following formula:

p_o(c)＝w_h×p_h(c)+w_m×p_m(c)

wherein c represents a label category, p_o(c) Label probability composite score, p, representing label category c_m(c) Representing the probability, p, of the new labeled region corresponding to the label category c_h(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, w_h、w_mAre weight coefficients.

7. The method of claim 3, wherein determining a new label category for a set of data to be cleaned comprises:

8. An annotation data processing apparatus, comprising:

9. An annotation data processing apparatus, comprising:

the processor, when executing the computer program, implements the method of any of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored,

the computer program, when executed by a processor, implementing the method of any one of claims 1-7.