CN111143333B

CN111143333B - Labeling data processing method, device, equipment and computer readable storage medium

Info

Publication number: CN111143333B
Application number: CN201811313048.2A
Authority: CN
Inventors: 黄铭哲; 颜钦钦; 高良才; 汤帜
Original assignee: Pku Founder Information Industry Group Co ltd; Peking University; Peking University Founder Group Co Ltd
Current assignee: Peking University
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2023-06-09
Anticipated expiration: 2038-11-06
Also published as: CN111143333A

Abstract

The embodiment of the invention provides a method, a device, equipment and a computer readable storage medium for processing annotation data. According to the method, at least one group of marking data with similarity larger than the preset threshold value of the marking area is obtained, and each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.

Description

Labeling data processing method, device, equipment and computer readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of digital publishing, in particular to a method, a device, equipment and a computer readable storage medium for processing annotation data.

Background

With the continuous and deep research on digital information resources and the increasingly wide application of deep learning in the field of digital publishing, data annotation becomes a very important work. The accuracy and efficiency of the page object annotation also become factors that restrict the effect of the deep learning model.

At present, the page marking system can provide a page marking function, and record marking data of a page object by a user and store the marking data in a database. However, for some complex pages, page objects are overlapped, nested, mutually contained and the like, and repeated data and difference data exist for the marked data of multiple marks of the same page; for example, there may be deviations in the location and size of the labeling area, labeling category, etc. In addition, when multiple persons repeatedly label the same page, because different users have differences in the labeling data of the same page object, a large amount of repeated data and difference data exist in the labeling data.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a computer readable storage medium for processing marking data, which are used for solving the problem that a large amount of difference data and repeated data exist in the marking data of a page marking system.

An aspect of an embodiment of the present invention provides a method for processing annotation data, including:

acquiring at least one group of marking data with similarity larger than a preset threshold value of the marking area, wherein each group of marking data is a group of data to be cleaned;

determining a new labeling area and a new labeling category of each group of data to be cleaned;

and cleaning the data to be cleaned of each group according to the new labeling area and the new labeling category of the data to be cleaned of each group.

Another aspect of an embodiment of the present invention provides a labeling data processing apparatus, including:

the data to be cleaned acquisition module is used for acquiring at least one group of marking data with the similarity of the marking area being greater than a preset threshold value, and each group of marking data is one group of data to be cleaned;

the determining module is used for determining a new labeling area and a new labeling category of each group of data to be cleaned;

and the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.

Another aspect of an embodiment of the present invention provides an annotation data processing apparatus, including:

a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when running the computer program, implements the annotation data processing method of any one of the above.

It is another aspect of embodiments of the present invention to provide a computer-readable storage medium, storing a computer program,

the computer program, when executed by a processor, implements the annotation data processing method described above.

According to the method, the device, the equipment and the computer readable storage medium for processing the annotation data, at least one group of annotation data with similarity larger than a preset threshold value of an annotation region is obtained, and each group of annotation data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.

Drawings

FIG. 1 is a flowchart of a method for processing annotation data according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for processing annotation data according to a second embodiment of the present invention;

FIG. 3 is a flowchart of another method for processing annotation data according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a labeling data processing apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a labeling data processing device according to a fifth embodiment of the present invention.

Specific embodiments of the present invention have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive embodiments in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the invention as detailed in the accompanying claims.

The terms "first," "second," and the like, according to embodiments of the present invention, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the following description of the embodiments, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Example 1

Fig. 1 is a flowchart of a method for processing annotation data according to an embodiment of the present invention. The embodiment of the invention provides a method for processing the annotation data, aiming at the problem that a large amount of difference data and repeated data exist in the annotation data of a page annotation system. The method in this embodiment is applied to a labeling data processing device, where the labeling data processing device may be a mobile terminal such as a smart phone, a tablet computer, or a server device, and in other embodiments, the method may also be applied to other devices, and this embodiment is schematically illustrated by taking the labeling data processing device as an example. As shown in fig. 1, the method specifically comprises the following steps:

step S101, at least one group of marking data with the similarity of the marking area larger than a preset threshold value is obtained, and each group of marking data is a group of data to be cleaned.

The labeling data at least comprises identification information, labeling category, labeling area and the like. The annotation data may also include other information such as annotation page object information. The information included in the annotation data may be different for different page annotation systems.

The identification information of the labeling data may be information such as a number for uniquely identifying one piece of labeling data, and may be used as an index of the labeling data.

The type of the marked page object may include a header, a footer, a header, a text segment, a formula, a table, etc., and the type of the page object is not particularly limited in this embodiment.

In practical application, after a user marks a certain page object in the page marking system once, the page marking system generates marking data with unique identification information corresponding to each marking, and stores the marking data in a database.

For some complex pages, page objects are overlapped, nested, mutually contained and the like, and repeated data and difference data exist for the labeling data of multiple labels of the same page; for example, there may be deviations in the location and size of the labeling area, labeling category, etc. In addition, when multiple persons repeatedly label the same page, because different users have differences in the labeling data of the same page object, a large amount of repeated data and difference data exist in the labeling data.

In the embodiment, by calculating the similarity of the labeling areas of any two labeling data, one or more groups of labeling data with the similarity of the labeling areas larger than a preset threshold value in the labeling data of the database are obtained; each group of obtained labeling data is either the repeated data with large similarity of the labeling areas and the same labeling category, or the difference data with large similarity of the labeling areas and different labeling categories, and is the data to be cleaned which needs to be cleaned.

Specifically, one possible implementation of this step is as follows:

calculating the similarity of the labeling areas of any two labeling data; determining at least one group of data to be cleaned according to the similarity of the marked areas of any two marked data; each group of data to be cleaned comprises at least two labeling data, and the similarity of labeling areas of any two labeling data in each group of data to be cleaned is larger than a preset threshold value.

Alternatively, the similarity of the two labeling areas may be equal to the ratio of the area of the overlapping portion of the two labeling areas to the total area covered by the two labeling areas. The coverage area of the two labeling areas refers to the union of the two labeling areas.

The preset threshold is a value greater than 0 and less than 1, and may be set by a technician according to actual needs, which is not specifically limited herein.

Step S102, determining a new labeling area and a new labeling category of each group of data to be cleaned.

After one or more sets of data to be cleaned are determined, a new labeling area and labeling category for each set of data to be cleaned is determined, respectively.

And for any group of data to be cleaned, if the marking categories of all marking data in the group of data to be cleaned are the same, the group of data to be cleaned is described as repeated data. In the step, for a group of repeated data, only a new labeling area of the group of repeated data is needed to be redetermined, and the original labeling category is reserved.

And for any group of data to be cleaned, if the group of data to be cleaned has the marking data with different marking categories, taking the group of data to be cleaned as difference data. In this step, for a set of difference data, a new labeling area and a new labeling category for the set of difference data need to be redetermined.

Step S103, cleaning the data to be cleaned of each group according to the new labeling area and the new labeling category of the data to be cleaned of each group.

And for any group of data to be cleaned, cleaning the group of data to be cleaned in the database after determining a new labeling area and a new labeling category of the group of data to be cleaned.

Specifically, a new piece of annotation data corresponding to the group of data to be cleaned can be added into the database, the annotation region and the annotation category of the new piece of annotation data are respectively the new annotation region and the new annotation category of the group of data to be cleaned, and other information of the new piece of annotation data can be obtained by merging and sorting according to other information of all the annotation data in the group of data to be cleaned; deleting the set of data to be cleaned in the database, so that repeated data and difference data in the database can be cleaned.

According to the embodiment of the invention, at least one group of marking data with the similarity larger than the preset threshold value of the marking area is obtained, and each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.

Example two

FIG. 2 is a flowchart of a method for processing annotation data according to a second embodiment of the present invention; fig. 3 is a flowchart of another method for processing annotation data according to the second embodiment of the present invention. On the basis of the first embodiment, in this embodiment, as shown in fig. 2, a new labeling area of each group of data to be cleaned is determined, which specifically includes the following steps:

step S201, dividing the coverage area of all the labeling data of the group of data to be cleaned into a plurality of grids.

In this embodiment, in order to determine a new labeling area of the set of data to be cleaned, the coverage area of all the labeling data of the set of data to be cleaned may be divided into a plurality of grids.

In addition, as the similarity of the labeling areas of all the labeling data of the group of data to be cleaned is higher, namely, all the labeling data of the group of data to be cleaned are labeling data on the same page; the whole page may be partitioned into a uniformly distributed grid, e.g. a 1000 x 2000 grid.

Step S202, the labeling times of the group of data to be cleaned on each grid are calculated.

In this embodiment, the number of times of labeling each grid by the set of data to be cleaned is calculated according to the grids covered by the labeling area of each labeling data in the set of data to be cleaned.

Specifically, after the page corresponding to the group of data to be cleaned is divided into grids, initializing the labeling times of each grid to be 0; carrying out statistics processing on each marking data in the group of data to be cleaned in sequence, and adding 1 to the marking times of grids covered by the marking area of the marking data; and after the statistical processing of all the labeling data in the group of data to be cleaned is completed, the labeling times of the group of data to be cleaned on each grid are obtained.

Optionally, when each labeling data in the group of data to be cleaned is statistically processed, a grid covered by the labeling area of the labeling data may be determined according to a projection coordinate of the labeling area of the labeling data projected onto the page.

Alternatively, a data structure such as a two-dimensional vector may be used to record the number of annotations for all grids.

Step 203, determining a connected region with the labeling times larger than the times threshold value in the coverage area as a new labeling region of the set of data to be cleaned.

In this embodiment, after the labeling times of each grid are obtained, the maximum times of the labeling times of all grids can be determined by comparing the sizes of the labeling times of each grid; and determining a frequency threshold according to the maximum frequency and a preset threshold.

Alternatively, the number of times threshold may be equal to the product of the maximum number of times and a preset threshold.

In addition, the frequency threshold may be set by a technician according to actual needs, and the embodiment is not specifically limited herein.

After the frequency threshold is determined, determining the area formed by grids with the marking frequency larger than the frequency threshold in all grids corresponding to the group of data to be cleaned, and taking the connected area in the area formed by grids with the marking frequency larger than the frequency threshold as a new marking area of the group of data to be cleaned.

Optionally, if the area formed by the grid with the labeling times larger than the frequency threshold includes a plurality of connected areas, the connected area with the largest area is used as a new labeling area of the group of data to be cleaned.

According to the embodiment of the invention, the coverage area of all the marking data of the group of data to be cleaned is divided into a plurality of grids, the marking times of the group of data to be cleaned on each grid are calculated, the communication area with the marking times larger than the threshold value in the coverage area is determined to be used as the new marking area of the group of data to be cleaned, and the area with more marking times in the group of data to be cleaned can be used as the new marking area, so that the accuracy of the new marking area is improved.

In another implementation of this embodiment, determining a new annotation class for a set of data to be cleaned includes the following two cases:

the first case is: and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.

In the first case, the labeling types of all the labeling data in the group of data to be cleaned are the same, that is, the similarity of the labeling areas of all the labeling data in the group of data to be cleaned is very high, and the labeling types are the same, so that the group of data to be cleaned is the repeated data, and only the new labeling areas of the group of repeated data need to be determined again, and the original labeling types are reserved.

The second case is: if any two labeling categories of the labeling data in the group of data to be cleaned are different, calculating labeling probability comprehensive scores corresponding to each labeling category corresponding to the group of data to be cleaned; and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.

In the second case, the labeling types of any two labeling data in the set of data to be cleaned are different, that is, the similarity of labeling areas of all the labeling data in the set of data to be cleaned is very high, but the labeling types are different, which means that the set of data to be cleaned is differential data, and not only the new labeling area of the set of differential data, but also the new labeling type of the set of differential data need to be redetermined.

As shown in fig. 3, the specific steps for determining a new annotation class for a set of data to be cleaned are as follows:

step S301, calculating the probability of the new labeling area corresponding to each labeling category through a classification model according to the new labeling area of the group of data to be cleaned.

In this embodiment, a preset classification model may be used to classify the new labeling area, so as to obtain the probability that the labeling data corresponding to the new labeling area is of each labeling category.

The preset classification model may be a classifier for identifying a category of the page object in the specified area. And (3) inputting the position information of the new labeling area into a classification model, and calculating and outputting the probability that the class of the page object in the new standard area is each group of labeling class through the classification model.

Step S302, calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned.

In this embodiment, the probability of occurrence of any labeling category corresponding to the group of data to be cleaned may be calculated by using the formula two:

wherein c represents a labeling category, p _h (c) Representing the probability of the occurrence of the labeling category c corresponding to the group of data to be cleaned, n _c Representing the set of data to be flushed to pageThe surface object is marked as the number of times of marking class c, and N represents the total number of times of marking the page object by the group of data to be cleaned.

n _c The number of times that the group of data to be cleaned marks the page object as the marking category c is indicated, namely the number of marking data with the marking category c in all marking data of the group of data to be cleaned.

The labeling category c may be any labeling category, for example, a labeling category in a header, a footer, a header, a text paragraph, a formula, a table, or a picture.

In addition, the sum of the occurrence probabilities of each labeling category corresponding to the data to be cleaned is equal to 1, and the sum can be specifically expressed by the following formula III:

∑ _c∈Category p _h (c) Equation three of =1

Wherein c represents a labeling category, p _h (c) And representing the occurrence probability of the labeling Category c corresponding to the group of data to be cleaned, wherein Category is a set of all the labeling categories corresponding to the group of data to be cleaned.

And step S303, calculating the labeling probability comprehensive score corresponding to each labeling category according to the probability of each labeling category corresponding to the new labeling area and the probability of each labeling category corresponding to the group of data to be cleaned.

Specifically, the following formula I is adopted to calculate the labeling probability comprehensive score corresponding to each labeling category:

p _o (c)＝w _h ×p _h (c)+w _m ×p _m (c) Equation one

Wherein c represents a labeling category, p _o (c) Labeling probability score, p, representing labeling category c _m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p _h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w _h 、w _m Is a weight coefficient.

In this embodiment, w _h Weight coefficient for the type of manual annotation of page objects represented by the set of data to be cleaned, w _m To correspond to the page pair by the classification modelThe weighting coefficients of the annotation type of the surface object. w (w) _h 、w _m The setting can be performed by a skilled person according to actual needs, and the present embodiment is not particularly limited here.

And step S304, taking the labeling category with the largest labeling probability comprehensive score as a new labeling category of the group of data to be cleaned.

After the labeling probability comprehensive score corresponding to each labeling category is obtained through calculation, the labeling category with the largest labeling probability comprehensive score is determined to be the new labeling category of the group of data to be cleaned by comparing the size of the labeling probability comprehensive score corresponding to each labeling category.

The embodiment of the invention provides a specific implementation mode for determining new annotation categories of difference data, wherein the probability of the new annotation areas corresponding to each annotation category is calculated through a classification model according to the new annotation areas of the group of data to be cleaned; the probability of each labeling category corresponding to the group of data to be cleaned is calculated, the labeling probability comprehensive score corresponding to each labeling category is calculated according to the probability of each labeling category corresponding to the new labeling area and the probability of each labeling category corresponding to the group of data to be cleaned, and the labeling category with the largest labeling probability comprehensive score is used as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.

Example III

Fig. 4 is a schematic structural diagram of a labeling data processing apparatus according to a third embodiment of the present invention. The labeling data processing device provided by the embodiment of the invention can execute the processing flow provided by the labeling data processing method embodiment. As shown in fig. 4, the apparatus 40 includes: a data acquisition module 401 to be cleaned, a determining module 402 and a cleaning processing module 403.

Specifically, the to-be-cleaned data obtaining module 401 is configured to obtain at least one set of labeling data with a similarity of the labeling area greater than a preset threshold, where each set of labeling data is a set of to-be-cleaned data.

The determination module 402 is configured to determine a new annotation region and a new annotation category for each set of data to be cleaned.

The cleaning processing module 403 is configured to perform cleaning processing on each set of data to be cleaned according to the new labeling area and the new labeling category of each set of data to be cleaned.

The apparatus provided in the embodiment of the present invention may be specifically used to perform the method embodiment provided in the first embodiment, and specific functions are not described herein.

Example IV

On the basis of the third embodiment, in this embodiment, the determining module is further configured to:

dividing the coverage area of all marking data of the group of data to be cleaned into a plurality of grids; calculating the labeling times of the group of data to be cleaned on each grid; and determining a communication area with the labeling times larger than the frequency threshold value in the coverage area as a new labeling area of the group of data to be cleaned.

Optionally, the determining module is further configured to:

determining the maximum number of labeling times of each grid; the frequency threshold is determined according to the product of the maximum frequency and a preset threshold.

Optionally, the data acquisition module to be cleaned is further configured to:

Optionally, the determining module is further configured to:

if any two labeling categories of the labeling data in the group of data to be cleaned are different, calculating labeling probability comprehensive scores corresponding to each labeling category corresponding to the group of data to be cleaned; and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.

Optionally, the determining module is further configured to:

calculating the probability of the new labeling area corresponding to each labeling category through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned; calculating the labeling probability comprehensive score corresponding to each labeling category by adopting the following formula:

p _o (c)＝w _h ×p _h (c)+w _m ×p _m (c)

Optionally, the determining module is further configured to:

and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.

The apparatus provided in the embodiment of the present invention may be specifically used to execute the method embodiment provided in the second embodiment, and specific functions are not described herein.

According to the embodiment of the invention, the coverage area of all the marking data of the group of data to be cleaned is divided into a plurality of grids, the marking times of the group of data to be cleaned on each grid are calculated, the communication area with the marking times larger than the threshold value in the coverage area is determined to be used as a new marking area of the group of data to be cleaned, and the area with more marking times in the group of data to be cleaned can be used as the new marking area, so that the accuracy of the new marking area is improved; further, calculating the probability of the new labeling area corresponding to each labeling category through a classification model according to the new labeling area of the group of data to be cleaned; the probability of each labeling category corresponding to the group of data to be cleaned is calculated, the labeling probability comprehensive score corresponding to each labeling category is calculated according to the probability of each labeling category corresponding to the new labeling area and the probability of each labeling category corresponding to the group of data to be cleaned, and the labeling category with the largest labeling probability comprehensive score is used as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.

Example five

Fig. 5 is a schematic structural diagram of a labeling data processing device according to a fifth embodiment of the present invention. As shown in fig. 5, the annotation data processing apparatus 50 includes: a processor 501, a memory 502, and a computer program stored on the memory 502 and executable by the processor 501.

The processor 501, when executing a computer program stored on the memory 502, implements the annotation data processing method provided by any of the method embodiments described above.

In addition, the embodiment of the invention also provides a computer readable storage medium which stores a computer program, and the computer program realizes the labeling data processing method provided by any one of the method embodiments when being executed by a processor.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. The specific working process of the above-described device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of annotation data processing, comprising:

according to the new labeling area and the new labeling category of each group of data to be cleaned, cleaning the data to be cleaned;

the determining of the new annotation class of each group of data to be cleaned comprises the following steps:

if any two labeling types of the labeling data in the group of data to be cleaned are different, calculating the probability of the new labeling area corresponding to each labeling type through a classification model according to the new labeling area of the group of data to be cleaned;

calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned;

calculating the labeling probability comprehensive score corresponding to each labeling category by adopting the following formula:

p _o (c)＝w _h ×p _h (c)+w _m ×p _m (c)

wherein c represents a labeling category, p _o (c) Labeling probability score, p, representing labeling category c _m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p _h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w _h 、w _m Is a weight coefficient;

and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.

2. The method according to claim 1, wherein the obtaining at least one set of labeling data with a similarity of the labeling area greater than a preset threshold, each set of labeling data being a set of data to be cleaned, includes:

calculating the similarity of the labeling areas of any two labeling data;

determining at least one group of data to be cleaned according to the similarity of the marked areas of any two marked data;

each group of data to be cleaned comprises at least two labeling data, and the similarity of labeling areas of any two labeling data in each group of data to be cleaned is larger than the preset threshold.

3. The method according to claim 1 or 2, wherein said determining a new set of marked areas of data to be cleaned comprises:

dividing the coverage area of all marking data of the group of data to be cleaned into a plurality of grids;

calculating the labeling times of the group of data to be cleaned on each grid;

and determining the communication area with the labeling times larger than the frequency threshold value in the coverage area as a new labeling area of the group of data to be cleaned.

4. A method according to claim 3, wherein before determining the connected region in the coverage area with the number of labeling greater than the number threshold as the new labeling region of the set of data to be cleaned, further comprises:

determining the maximum number of times in the labeling times of each grid;

and determining the frequency threshold according to the product of the maximum frequency and the preset threshold.

5. The method of claim 1, wherein determining a new annotation class for a set of data to be cleaned comprises:

6. A labeling data processing apparatus, comprising:

the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned;

the determining module is further configured to:

p _o (c)＝w _h ×p _h (c)+w _m ×p _m (c)

wherein c represents a labeling category, p _o (c) Representation ofLabeling probability comprehensive score of labeling category c, p _m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p _h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w _h 、w _m Is a weight coefficient;

7. An annotation data processing device, comprising:

the processor, when running the computer program, implements the method according to any of claims 1-5.

8. A computer-readable storage medium, in which a computer program is stored,

the computer program implementing the method according to any of claims 1-5 when executed by a processor.