CN111143333B - Labeling data processing method, device, equipment and computer readable storage medium - Google Patents

Labeling data processing method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111143333B
CN111143333B CN201811313048.2A CN201811313048A CN111143333B CN 111143333 B CN111143333 B CN 111143333B CN 201811313048 A CN201811313048 A CN 201811313048A CN 111143333 B CN111143333 B CN 111143333B
Authority
CN
China
Prior art keywords
data
labeling
cleaned
group
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811313048.2A
Other languages
Chinese (zh)
Other versions
CN111143333A (en
Inventor
黄铭哲
颜钦钦
高良才
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Pku Founder Information Industry Group Co ltd
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pku Founder Information Industry Group Co ltd, Peking University, Peking University Founder Group Co Ltd filed Critical Pku Founder Information Industry Group Co ltd
Priority to CN201811313048.2A priority Critical patent/CN111143333B/en
Publication of CN111143333A publication Critical patent/CN111143333A/en
Application granted granted Critical
Publication of CN111143333B publication Critical patent/CN111143333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a method, a device, equipment and a computer readable storage medium for processing annotation data. According to the method, at least one group of marking data with similarity larger than the preset threshold value of the marking area is obtained, and each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.

Description

Labeling data processing method, device, equipment and computer readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of digital publishing, in particular to a method, a device, equipment and a computer readable storage medium for processing annotation data.
Background
With the continuous and deep research on digital information resources and the increasingly wide application of deep learning in the field of digital publishing, data annotation becomes a very important work. The accuracy and efficiency of the page object annotation also become factors that restrict the effect of the deep learning model.
At present, the page marking system can provide a page marking function, and record marking data of a page object by a user and store the marking data in a database. However, for some complex pages, page objects are overlapped, nested, mutually contained and the like, and repeated data and difference data exist for the marked data of multiple marks of the same page; for example, there may be deviations in the location and size of the labeling area, labeling category, etc. In addition, when multiple persons repeatedly label the same page, because different users have differences in the labeling data of the same page object, a large amount of repeated data and difference data exist in the labeling data.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a computer readable storage medium for processing marking data, which are used for solving the problem that a large amount of difference data and repeated data exist in the marking data of a page marking system.
An aspect of an embodiment of the present invention provides a method for processing annotation data, including:
acquiring at least one group of marking data with similarity larger than a preset threshold value of the marking area, wherein each group of marking data is a group of data to be cleaned;
determining a new labeling area and a new labeling category of each group of data to be cleaned;
and cleaning the data to be cleaned of each group according to the new labeling area and the new labeling category of the data to be cleaned of each group.
Another aspect of an embodiment of the present invention provides a labeling data processing apparatus, including:
the data to be cleaned acquisition module is used for acquiring at least one group of marking data with the similarity of the marking area being greater than a preset threshold value, and each group of marking data is one group of data to be cleaned;
the determining module is used for determining a new labeling area and a new labeling category of each group of data to be cleaned;
and the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.
Another aspect of an embodiment of the present invention provides an annotation data processing apparatus, including:
a memory, a processor, and a computer program stored on the memory and executable on the processor,
the processor, when running the computer program, implements the annotation data processing method of any one of the above.
It is another aspect of embodiments of the present invention to provide a computer-readable storage medium, storing a computer program,
the computer program, when executed by a processor, implements the annotation data processing method described above.
According to the method, the device, the equipment and the computer readable storage medium for processing the annotation data, at least one group of annotation data with similarity larger than a preset threshold value of an annotation region is obtained, and each group of annotation data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.
Drawings
FIG. 1 is a flowchart of a method for processing annotation data according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a method for processing annotation data according to a second embodiment of the present invention;
FIG. 3 is a flowchart of another method for processing annotation data according to a second embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a labeling data processing apparatus according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a labeling data processing device according to a fifth embodiment of the present invention.
Specific embodiments of the present invention have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive embodiments in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the invention as detailed in the accompanying claims.
The terms "first," "second," and the like, according to embodiments of the present invention, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the following description of the embodiments, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Example 1
Fig. 1 is a flowchart of a method for processing annotation data according to an embodiment of the present invention. The embodiment of the invention provides a method for processing the annotation data, aiming at the problem that a large amount of difference data and repeated data exist in the annotation data of a page annotation system. The method in this embodiment is applied to a labeling data processing device, where the labeling data processing device may be a mobile terminal such as a smart phone, a tablet computer, or a server device, and in other embodiments, the method may also be applied to other devices, and this embodiment is schematically illustrated by taking the labeling data processing device as an example. As shown in fig. 1, the method specifically comprises the following steps:
step S101, at least one group of marking data with the similarity of the marking area larger than a preset threshold value is obtained, and each group of marking data is a group of data to be cleaned.
The labeling data at least comprises identification information, labeling category, labeling area and the like. The annotation data may also include other information such as annotation page object information. The information included in the annotation data may be different for different page annotation systems.
The identification information of the labeling data may be information such as a number for uniquely identifying one piece of labeling data, and may be used as an index of the labeling data.
The type of the marked page object may include a header, a footer, a header, a text segment, a formula, a table, etc., and the type of the page object is not particularly limited in this embodiment.
In practical application, after a user marks a certain page object in the page marking system once, the page marking system generates marking data with unique identification information corresponding to each marking, and stores the marking data in a database.
For some complex pages, page objects are overlapped, nested, mutually contained and the like, and repeated data and difference data exist for the labeling data of multiple labels of the same page; for example, there may be deviations in the location and size of the labeling area, labeling category, etc. In addition, when multiple persons repeatedly label the same page, because different users have differences in the labeling data of the same page object, a large amount of repeated data and difference data exist in the labeling data.
In the embodiment, by calculating the similarity of the labeling areas of any two labeling data, one or more groups of labeling data with the similarity of the labeling areas larger than a preset threshold value in the labeling data of the database are obtained; each group of obtained labeling data is either the repeated data with large similarity of the labeling areas and the same labeling category, or the difference data with large similarity of the labeling areas and different labeling categories, and is the data to be cleaned which needs to be cleaned.
Specifically, one possible implementation of this step is as follows:
calculating the similarity of the labeling areas of any two labeling data; determining at least one group of data to be cleaned according to the similarity of the marked areas of any two marked data; each group of data to be cleaned comprises at least two labeling data, and the similarity of labeling areas of any two labeling data in each group of data to be cleaned is larger than a preset threshold value.
Alternatively, the similarity of the two labeling areas may be equal to the ratio of the area of the overlapping portion of the two labeling areas to the total area covered by the two labeling areas. The coverage area of the two labeling areas refers to the union of the two labeling areas.
The preset threshold is a value greater than 0 and less than 1, and may be set by a technician according to actual needs, which is not specifically limited herein.
Step S102, determining a new labeling area and a new labeling category of each group of data to be cleaned.
After one or more sets of data to be cleaned are determined, a new labeling area and labeling category for each set of data to be cleaned is determined, respectively.
And for any group of data to be cleaned, if the marking categories of all marking data in the group of data to be cleaned are the same, the group of data to be cleaned is described as repeated data. In the step, for a group of repeated data, only a new labeling area of the group of repeated data is needed to be redetermined, and the original labeling category is reserved.
And for any group of data to be cleaned, if the group of data to be cleaned has the marking data with different marking categories, taking the group of data to be cleaned as difference data. In this step, for a set of difference data, a new labeling area and a new labeling category for the set of difference data need to be redetermined.
Step S103, cleaning the data to be cleaned of each group according to the new labeling area and the new labeling category of the data to be cleaned of each group.
And for any group of data to be cleaned, cleaning the group of data to be cleaned in the database after determining a new labeling area and a new labeling category of the group of data to be cleaned.
Specifically, a new piece of annotation data corresponding to the group of data to be cleaned can be added into the database, the annotation region and the annotation category of the new piece of annotation data are respectively the new annotation region and the new annotation category of the group of data to be cleaned, and other information of the new piece of annotation data can be obtained by merging and sorting according to other information of all the annotation data in the group of data to be cleaned; deleting the set of data to be cleaned in the database, so that repeated data and difference data in the database can be cleaned.
According to the embodiment of the invention, at least one group of marking data with the similarity larger than the preset threshold value of the marking area is obtained, and each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.
Example two
FIG. 2 is a flowchart of a method for processing annotation data according to a second embodiment of the present invention; fig. 3 is a flowchart of another method for processing annotation data according to the second embodiment of the present invention. On the basis of the first embodiment, in this embodiment, as shown in fig. 2, a new labeling area of each group of data to be cleaned is determined, which specifically includes the following steps:
step S201, dividing the coverage area of all the labeling data of the group of data to be cleaned into a plurality of grids.
In this embodiment, in order to determine a new labeling area of the set of data to be cleaned, the coverage area of all the labeling data of the set of data to be cleaned may be divided into a plurality of grids.
In addition, as the similarity of the labeling areas of all the labeling data of the group of data to be cleaned is higher, namely, all the labeling data of the group of data to be cleaned are labeling data on the same page; the whole page may be partitioned into a uniformly distributed grid, e.g. a 1000 x 2000 grid.
Step S202, the labeling times of the group of data to be cleaned on each grid are calculated.
In this embodiment, the number of times of labeling each grid by the set of data to be cleaned is calculated according to the grids covered by the labeling area of each labeling data in the set of data to be cleaned.
Specifically, after the page corresponding to the group of data to be cleaned is divided into grids, initializing the labeling times of each grid to be 0; carrying out statistics processing on each marking data in the group of data to be cleaned in sequence, and adding 1 to the marking times of grids covered by the marking area of the marking data; and after the statistical processing of all the labeling data in the group of data to be cleaned is completed, the labeling times of the group of data to be cleaned on each grid are obtained.
Optionally, when each labeling data in the group of data to be cleaned is statistically processed, a grid covered by the labeling area of the labeling data may be determined according to a projection coordinate of the labeling area of the labeling data projected onto the page.
Alternatively, a data structure such as a two-dimensional vector may be used to record the number of annotations for all grids.
Step 203, determining a connected region with the labeling times larger than the times threshold value in the coverage area as a new labeling region of the set of data to be cleaned.
In this embodiment, after the labeling times of each grid are obtained, the maximum times of the labeling times of all grids can be determined by comparing the sizes of the labeling times of each grid; and determining a frequency threshold according to the maximum frequency and a preset threshold.
Alternatively, the number of times threshold may be equal to the product of the maximum number of times and a preset threshold.
In addition, the frequency threshold may be set by a technician according to actual needs, and the embodiment is not specifically limited herein.
After the frequency threshold is determined, determining the area formed by grids with the marking frequency larger than the frequency threshold in all grids corresponding to the group of data to be cleaned, and taking the connected area in the area formed by grids with the marking frequency larger than the frequency threshold as a new marking area of the group of data to be cleaned.
Optionally, if the area formed by the grid with the labeling times larger than the frequency threshold includes a plurality of connected areas, the connected area with the largest area is used as a new labeling area of the group of data to be cleaned.
According to the embodiment of the invention, the coverage area of all the marking data of the group of data to be cleaned is divided into a plurality of grids, the marking times of the group of data to be cleaned on each grid are calculated, the communication area with the marking times larger than the threshold value in the coverage area is determined to be used as the new marking area of the group of data to be cleaned, and the area with more marking times in the group of data to be cleaned can be used as the new marking area, so that the accuracy of the new marking area is improved.
In another implementation of this embodiment, determining a new annotation class for a set of data to be cleaned includes the following two cases:
the first case is: and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.
In the first case, the labeling types of all the labeling data in the group of data to be cleaned are the same, that is, the similarity of the labeling areas of all the labeling data in the group of data to be cleaned is very high, and the labeling types are the same, so that the group of data to be cleaned is the repeated data, and only the new labeling areas of the group of repeated data need to be determined again, and the original labeling types are reserved.
The second case is: if any two labeling categories of the labeling data in the group of data to be cleaned are different, calculating labeling probability comprehensive scores corresponding to each labeling category corresponding to the group of data to be cleaned; and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.
In the second case, the labeling types of any two labeling data in the set of data to be cleaned are different, that is, the similarity of labeling areas of all the labeling data in the set of data to be cleaned is very high, but the labeling types are different, which means that the set of data to be cleaned is differential data, and not only the new labeling area of the set of differential data, but also the new labeling type of the set of differential data need to be redetermined.
As shown in fig. 3, the specific steps for determining a new annotation class for a set of data to be cleaned are as follows:
step S301, calculating the probability of the new labeling area corresponding to each labeling category through a classification model according to the new labeling area of the group of data to be cleaned.
In this embodiment, a preset classification model may be used to classify the new labeling area, so as to obtain the probability that the labeling data corresponding to the new labeling area is of each labeling category.
The preset classification model may be a classifier for identifying a category of the page object in the specified area. And (3) inputting the position information of the new labeling area into a classification model, and calculating and outputting the probability that the class of the page object in the new standard area is each group of labeling class through the classification model.
Step S302, calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned.
In this embodiment, the probability of occurrence of any labeling category corresponding to the group of data to be cleaned may be calculated by using the formula two:
Figure BDA0001855518570000071
wherein c represents a labeling category, p h (c) Representing the probability of the occurrence of the labeling category c corresponding to the group of data to be cleaned, n c Representing the set of data to be flushed to pageThe surface object is marked as the number of times of marking class c, and N represents the total number of times of marking the page object by the group of data to be cleaned.
n c The number of times that the group of data to be cleaned marks the page object as the marking category c is indicated, namely the number of marking data with the marking category c in all marking data of the group of data to be cleaned.
The labeling category c may be any labeling category, for example, a labeling category in a header, a footer, a header, a text paragraph, a formula, a table, or a picture.
In addition, the sum of the occurrence probabilities of each labeling category corresponding to the data to be cleaned is equal to 1, and the sum can be specifically expressed by the following formula III:
c∈Category p h (c) Equation three of =1
Wherein c represents a labeling category, p h (c) And representing the occurrence probability of the labeling Category c corresponding to the group of data to be cleaned, wherein Category is a set of all the labeling categories corresponding to the group of data to be cleaned.
And step S303, calculating the labeling probability comprehensive score corresponding to each labeling category according to the probability of each labeling category corresponding to the new labeling area and the probability of each labeling category corresponding to the group of data to be cleaned.
Specifically, the following formula I is adopted to calculate the labeling probability comprehensive score corresponding to each labeling category:
p o (c)=w h ×p h (c)+w m ×p m (c) Equation one
Wherein c represents a labeling category, p o (c) Labeling probability score, p, representing labeling category c m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w h 、w m Is a weight coefficient.
In this embodiment, w h Weight coefficient for the type of manual annotation of page objects represented by the set of data to be cleaned, w m To correspond to the page pair by the classification modelThe weighting coefficients of the annotation type of the surface object. w (w) h 、w m The setting can be performed by a skilled person according to actual needs, and the present embodiment is not particularly limited here.
And step S304, taking the labeling category with the largest labeling probability comprehensive score as a new labeling category of the group of data to be cleaned.
After the labeling probability comprehensive score corresponding to each labeling category is obtained through calculation, the labeling category with the largest labeling probability comprehensive score is determined to be the new labeling category of the group of data to be cleaned by comparing the size of the labeling probability comprehensive score corresponding to each labeling category.
The embodiment of the invention provides a specific implementation mode for determining new annotation categories of difference data, wherein the probability of the new annotation areas corresponding to each annotation category is calculated through a classification model according to the new annotation areas of the group of data to be cleaned; the probability of each labeling category corresponding to the group of data to be cleaned is calculated, the labeling probability comprehensive score corresponding to each labeling category is calculated according to the probability of each labeling category corresponding to the new labeling area and the probability of each labeling category corresponding to the group of data to be cleaned, and the labeling category with the largest labeling probability comprehensive score is used as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.
Example III
Fig. 4 is a schematic structural diagram of a labeling data processing apparatus according to a third embodiment of the present invention. The labeling data processing device provided by the embodiment of the invention can execute the processing flow provided by the labeling data processing method embodiment. As shown in fig. 4, the apparatus 40 includes: a data acquisition module 401 to be cleaned, a determining module 402 and a cleaning processing module 403.
Specifically, the to-be-cleaned data obtaining module 401 is configured to obtain at least one set of labeling data with a similarity of the labeling area greater than a preset threshold, where each set of labeling data is a set of to-be-cleaned data.
The determination module 402 is configured to determine a new annotation region and a new annotation category for each set of data to be cleaned.
The cleaning processing module 403 is configured to perform cleaning processing on each set of data to be cleaned according to the new labeling area and the new labeling category of each set of data to be cleaned.
The apparatus provided in the embodiment of the present invention may be specifically used to perform the method embodiment provided in the first embodiment, and specific functions are not described herein.
According to the embodiment of the invention, at least one group of marking data with the similarity larger than the preset threshold value of the marking area is obtained, and each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.
Example IV
On the basis of the third embodiment, in this embodiment, the determining module is further configured to:
dividing the coverage area of all marking data of the group of data to be cleaned into a plurality of grids; calculating the labeling times of the group of data to be cleaned on each grid; and determining a communication area with the labeling times larger than the frequency threshold value in the coverage area as a new labeling area of the group of data to be cleaned.
Optionally, the determining module is further configured to:
determining the maximum number of labeling times of each grid; the frequency threshold is determined according to the product of the maximum frequency and a preset threshold.
Optionally, the data acquisition module to be cleaned is further configured to:
calculating the similarity of the labeling areas of any two labeling data; determining at least one group of data to be cleaned according to the similarity of the marked areas of any two marked data; each group of data to be cleaned comprises at least two labeling data, and the similarity of labeling areas of any two labeling data in each group of data to be cleaned is larger than a preset threshold value.
Optionally, the determining module is further configured to:
if any two labeling categories of the labeling data in the group of data to be cleaned are different, calculating labeling probability comprehensive scores corresponding to each labeling category corresponding to the group of data to be cleaned; and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.
Optionally, the determining module is further configured to:
calculating the probability of the new labeling area corresponding to each labeling category through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned; calculating the labeling probability comprehensive score corresponding to each labeling category by adopting the following formula:
p o (c)=w h ×p h (c)+w m ×p m (c)
wherein c represents a labeling category, p o (c) Labeling probability score, p, representing labeling category c m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w h 、w m Is a weight coefficient.
Optionally, the determining module is further configured to:
and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.
The apparatus provided in the embodiment of the present invention may be specifically used to execute the method embodiment provided in the second embodiment, and specific functions are not described herein.
According to the embodiment of the invention, the coverage area of all the marking data of the group of data to be cleaned is divided into a plurality of grids, the marking times of the group of data to be cleaned on each grid are calculated, the communication area with the marking times larger than the threshold value in the coverage area is determined to be used as a new marking area of the group of data to be cleaned, and the area with more marking times in the group of data to be cleaned can be used as the new marking area, so that the accuracy of the new marking area is improved; further, calculating the probability of the new labeling area corresponding to each labeling category through a classification model according to the new labeling area of the group of data to be cleaned; the probability of each labeling category corresponding to the group of data to be cleaned is calculated, the labeling probability comprehensive score corresponding to each labeling category is calculated according to the probability of each labeling category corresponding to the new labeling area and the probability of each labeling category corresponding to the group of data to be cleaned, and the labeling category with the largest labeling probability comprehensive score is used as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.
Example five
Fig. 5 is a schematic structural diagram of a labeling data processing device according to a fifth embodiment of the present invention. As shown in fig. 5, the annotation data processing apparatus 50 includes: a processor 501, a memory 502, and a computer program stored on the memory 502 and executable by the processor 501.
The processor 501, when executing a computer program stored on the memory 502, implements the annotation data processing method provided by any of the method embodiments described above.
According to the embodiment of the invention, at least one group of marking data with the similarity larger than the preset threshold value of the marking area is obtained, and each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling category of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, the data to be cleaned are cleaned, repeated data and difference data in the labeling data can be automatically identified, cleaning of the repeated data and the difference data is automatically completed, and effectiveness of the labeling data is improved.
In addition, the embodiment of the invention also provides a computer readable storage medium which stores a computer program, and the computer program realizes the labeling data processing method provided by any one of the method embodiments when being executed by a processor.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. The specific working process of the above-described device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (8)

1. A method of annotation data processing, comprising:
acquiring at least one group of marking data with similarity larger than a preset threshold value of the marking area, wherein each group of marking data is a group of data to be cleaned;
determining a new labeling area and a new labeling category of each group of data to be cleaned;
according to the new labeling area and the new labeling category of each group of data to be cleaned, cleaning the data to be cleaned;
the determining of the new annotation class of each group of data to be cleaned comprises the following steps:
if any two labeling types of the labeling data in the group of data to be cleaned are different, calculating the probability of the new labeling area corresponding to each labeling type through a classification model according to the new labeling area of the group of data to be cleaned;
calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned;
calculating the labeling probability comprehensive score corresponding to each labeling category by adopting the following formula:
p o (c)=w h ×p h (c)+w m ×p m (c)
wherein c represents a labeling category, p o (c) Labeling probability score, p, representing labeling category c m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w h 、w m Is a weight coefficient;
and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.
2. The method according to claim 1, wherein the obtaining at least one set of labeling data with a similarity of the labeling area greater than a preset threshold, each set of labeling data being a set of data to be cleaned, includes:
calculating the similarity of the labeling areas of any two labeling data;
determining at least one group of data to be cleaned according to the similarity of the marked areas of any two marked data;
each group of data to be cleaned comprises at least two labeling data, and the similarity of labeling areas of any two labeling data in each group of data to be cleaned is larger than the preset threshold.
3. The method according to claim 1 or 2, wherein said determining a new set of marked areas of data to be cleaned comprises:
dividing the coverage area of all marking data of the group of data to be cleaned into a plurality of grids;
calculating the labeling times of the group of data to be cleaned on each grid;
and determining the communication area with the labeling times larger than the frequency threshold value in the coverage area as a new labeling area of the group of data to be cleaned.
4. A method according to claim 3, wherein before determining the connected region in the coverage area with the number of labeling greater than the number threshold as the new labeling region of the set of data to be cleaned, further comprises:
determining the maximum number of times in the labeling times of each grid;
and determining the frequency threshold according to the product of the maximum frequency and the preset threshold.
5. The method of claim 1, wherein determining a new annotation class for a set of data to be cleaned comprises:
and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.
6. A labeling data processing apparatus, comprising:
the data to be cleaned acquisition module is used for acquiring at least one group of marking data with the similarity of the marking area being greater than a preset threshold value, and each group of marking data is one group of data to be cleaned;
the determining module is used for determining a new labeling area and a new labeling category of each group of data to be cleaned;
the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned;
the determining module is further configured to:
if any two labeling types of the labeling data in the group of data to be cleaned are different, calculating the probability of the new labeling area corresponding to each labeling type through a classification model according to the new labeling area of the group of data to be cleaned;
calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned;
calculating the labeling probability comprehensive score corresponding to each labeling category by adopting the following formula:
p o (c)=w h ×p h (c)+w m ×p m (c)
wherein c represents a labeling category, p o (c) Representation ofLabeling probability comprehensive score of labeling category c, p m (c) Representing the probability of the new labeling area corresponding to the labeling category c, p h (c) Representing the probability of occurrence of the labeling category c corresponding to the group of data to be cleaned, w h 、w m Is a weight coefficient;
and taking the labeling category with the largest labeling probability as the new labeling category of the group of data to be cleaned.
7. An annotation data processing device, comprising:
a memory, a processor, and a computer program stored on the memory and executable on the processor,
the processor, when running the computer program, implements the method according to any of claims 1-5.
8. A computer-readable storage medium, in which a computer program is stored,
the computer program implementing the method according to any of claims 1-5 when executed by a processor.
CN201811313048.2A 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium Active CN111143333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811313048.2A CN111143333B (en) 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811313048.2A CN111143333B (en) 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111143333A CN111143333A (en) 2020-05-12
CN111143333B true CN111143333B (en) 2023-06-09

Family

ID=70516499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811313048.2A Active CN111143333B (en) 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111143333B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005346376A (en) * 2004-06-02 2005-12-15 Fuji Xerox Co Ltd Document processor, document processing method and document processing program
CN108268575A (en) * 2017-01-04 2018-07-10 阿里巴巴集团控股有限公司 Processing method, the device and system of markup information
CN108509969A (en) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 Data mask method and terminal

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050210038A1 (en) * 2004-03-18 2005-09-22 International Business Machines Corporation Method for providing workflow functionality and tracking in an annotation subsystem
US20100011282A1 (en) * 2008-07-11 2010-01-14 iCyte Pty Ltd. Annotation system and method
US9626348B2 (en) * 2011-03-11 2017-04-18 Microsoft Technology Licensing, Llc Aggregating document annotations
US8935265B2 (en) * 2011-08-30 2015-01-13 Abbyy Development Llc Document journaling
US10380235B2 (en) * 2015-09-01 2019-08-13 Branchfire, Inc. Method and system for annotation and connection of electronic documents
US11941344B2 (en) * 2016-09-29 2024-03-26 Dropbox, Inc. Document differences analysis and presentation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005346376A (en) * 2004-06-02 2005-12-15 Fuji Xerox Co Ltd Document processor, document processing method and document processing program
CN108268575A (en) * 2017-01-04 2018-07-10 阿里巴巴集团控股有限公司 Processing method, the device and system of markup information
CN108509969A (en) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 Data mask method and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Gao Liangcai et.al.A Sequence Labeling Based Approach for Character Segmentation of Historical Documents.《2018 13th IAPR International Workshop on Document Analysis Systems》.2018,全文. *

Also Published As

Publication number Publication date
CN111143333A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111190939B (en) User portrait construction method and device
US10713532B2 (en) Image recognition method and apparatus
JP4545641B2 (en) Similar image retrieval method, similar image retrieval system, similar image retrieval program, and recording medium
US9235758B1 (en) Robust method to find layout similarity between two documents
CN107729935B (en) The recognition methods of similar pictures and device, server, storage medium
CN107786943B (en) User grouping method and computing device
WO2019137185A1 (en) Image screening method and apparatus, storage medium and computer device
CN111126262B (en) Video highlight detection method and device based on graphic neural network
CN109117742B (en) Gesture detection model processing method, device, equipment and storage medium
CN110414502B (en) Image processing method and device, electronic equipment and computer readable medium
CN107729416B (en) Book recommendation method and system
CN111191652A (en) Certificate image identification method and device, electronic equipment and storage medium
CN111191454A (en) Entity matching method and device
CN111860484A (en) Region labeling method, device, equipment and storage medium
CN113902856B (en) Semantic annotation method and device, electronic equipment and storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN104966109A (en) Medical laboratory report image classification method and apparatus
CN111143333B (en) Labeling data processing method, device, equipment and computer readable storage medium
CN110442674B (en) Label propagation clustering method, terminal equipment, storage medium and device
CN111783561A (en) Picture examination result correction method, electronic equipment and related products
CN110941638B (en) Application classification rule base construction method, application classification method and device
CN103377381A (en) Method and device for identifying content attribute of image
CN114781517A (en) Risk identification method and device and terminal equipment
CN111177450B (en) Image retrieval cloud identification method and system and computer readable storage medium
CN113139539A (en) Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230627

Address after: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD.

Patentee before: Peking University

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240327

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Country or region after: China

Address before: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Patentee before: New founder holdings development Co.,Ltd.

Country or region before: China

Patentee before: Peking University

TR01 Transfer of patent right