CN111143333A - Method, device and equipment for processing labeled data and computer readable storage medium - Google Patents

Method, device and equipment for processing labeled data and computer readable storage medium Download PDF

Info

Publication number
CN111143333A
CN111143333A CN201811313048.2A CN201811313048A CN111143333A CN 111143333 A CN111143333 A CN 111143333A CN 201811313048 A CN201811313048 A CN 201811313048A CN 111143333 A CN111143333 A CN 111143333A
Authority
CN
China
Prior art keywords
data
cleaned
group
labeling
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811313048.2A
Other languages
Chinese (zh)
Other versions
CN111143333B (en
Inventor
黄铭哲
颜钦钦
高良才
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Pku Founder Information Industry Group Co ltd
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pku Founder Information Industry Group Co ltd, Peking University, Peking University Founder Group Co Ltd filed Critical Pku Founder Information Industry Group Co ltd
Priority to CN201811313048.2A priority Critical patent/CN111143333B/en
Publication of CN111143333A publication Critical patent/CN111143333A/en
Application granted granted Critical
Publication of CN111143333B publication Critical patent/CN111143333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a method, a device and equipment for processing labeled data and a computer readable storage medium. According to the method, at least one group of marked data with the similarity of the marked area larger than a preset threshold value is obtained, and each group of marked data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.

Description

Method, device and equipment for processing labeled data and computer readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of digital publishing, in particular to a method, a device and equipment for processing labeled data and a computer readable storage medium.
Background
With the continuous and deep research on digital information resources and the increasingly wide application of deep learning in the field of digital publishing, data marking becomes a very important task. The accuracy and efficiency of page object labeling also become factors restricting the effect of the deep learning model.
At present, a page tagging system can provide a page tagging function, and record and store tagging data of a page object by a user in a database. However, for some complex pages, page objects are overlapped, nested, mutually contained and the like, and repeated data and differential data exist in labeled data labeled for multiple times on the same page; for example, there may be a deviation in the position and size of the label area, the label type, and the like. In addition, when multiple users repeatedly mark the same page, because different users have differences in the marked data marked on the same page object, a large amount of repeated data and difference data also exist in the marked data.
Disclosure of Invention
The embodiment of the invention provides a method, a device and equipment for processing labeled data and a computer readable storage medium, which are used for solving the problem that a large amount of difference data and repeated data exist in labeled data of a page labeling system.
One aspect of the embodiments of the present invention is to provide a method for processing annotation data, including:
acquiring at least one group of labeled data of which the similarity of the labeled areas is greater than a preset threshold, wherein each group of labeled data is a group of data to be cleaned;
determining a new labeling area and a new labeling type of each group of data to be cleaned;
and cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.
Another aspect of the embodiments of the present invention is to provide an annotation data processing apparatus, including:
the system comprises a to-be-cleaned data acquisition module, a data analysis module and a data analysis module, wherein the to-be-cleaned data acquisition module is used for acquiring at least one group of marked data of which the similarity of marked areas is greater than a preset threshold, and each group of marked data is a group of to-be-cleaned data;
the determining module is used for determining a new labeling area and a new labeling category of each group of data to be cleaned;
and the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.
Another aspect of an embodiment of the present invention is to provide an annotation data processing apparatus, including:
a memory, a processor, and a computer program stored on the memory and executable on the processor,
when the processor runs the computer program, the annotated data processing method described in any one of the above is realized.
It is another aspect of an embodiment of the present invention to provide a computer-readable storage medium, storing a computer program,
the computer program realizes the above-mentioned annotation data processing method when being executed by a processor.
According to the method, the device and the equipment for processing the labeled data and the computer readable storage medium provided by the embodiment of the invention, at least one group of labeled data of which the similarity of the labeled area is greater than a preset threshold is obtained, and each group of labeled data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.
Drawings
FIG. 1 is a flowchart of a method for processing annotated data according to an embodiment of the present invention;
FIG. 2 is a flowchart of a labeled data processing method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of another annotated data processing method according to the second embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a tag data processing apparatus according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of an annotation data processing device according to a fifth embodiment of the present invention.
With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of embodiments of the invention, as detailed in the following claims.
The terms "first", "second", etc. referred to in the embodiments of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.
The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Example one
Fig. 1 is a flowchart of a method for processing annotation data according to an embodiment of the present invention. The embodiment of the invention provides a method for processing labeled data, aiming at the problem that a large amount of difference data and repeated data exist in the labeled data of a page labeling system. The method in this embodiment is applied to a labeled data processing device, which may be a mobile terminal such as a smart phone or a tablet computer, or a server device, and in other embodiments, the method may also be applied to other devices, and this embodiment takes the labeled data processing device as an example for schematic description. As shown in fig. 1, the method comprises the following specific steps:
step S101, at least one group of marking data with the marking area similarity larger than a preset threshold is obtained, and each group of marking data is a group of data to be cleaned.
The labeling data at least comprises identification information, labeling categories, labeling areas and the like. The annotation data may also include annotation page object information and other information. The information included in the annotation data can be different for different page annotation systems.
The identification information of the annotation data may be information such as a number for uniquely identifying one piece of annotation data, and may be used as an index of the annotation data.
The type of the labeled page object may include a header, a footer, a title, a body paragraph, a formula, a table, and the like, and the type of the page object is not specifically limited in this embodiment.
In practical application, after a user marks a certain page object in a page marking system once, the page marking system generates corresponding marking data with unique identification information for each marking, and stores the marking data into a database.
For some complex pages, page objects have the conditions of overlapping, nesting, mutual inclusion and the like, and repeated data and differential data exist in labeled data labeled for multiple times on the same page; for example, there may be a deviation in the position and size of the label area, the label type, and the like. In addition, when multiple users repeatedly mark the same page, because different users have differences in the marked data marked on the same page object, a large amount of repeated data and difference data also exist in the marked data.
In the embodiment, one or more groups of labeled data with the similarity of the labeled areas larger than a preset threshold value in the labeled data of the database are obtained by calculating the similarity of the labeled areas of any two labeled data; each group of obtained labeled data is either repeated data with large labeled area similarity and same labeled type, or differential data with large labeled area similarity and different labeled types, and is to-be-cleaned data needing data cleaning.
In particular, one possible implementation of this step is as follows:
calculating the similarity of the labeling areas of any two labeling data; determining at least one group of data to be cleaned according to the similarity of the labeling areas of any two labeling data; each group of data to be cleaned comprises at least two marking data, and the similarity of the marking areas of any two marking data in each group of data to be cleaned is greater than a preset threshold value.
Optionally, the similarity between the two labeled regions may be equal to the ratio of the area of the overlapped part of the two labeled regions to the total area of the regions covered by the two labeled regions. The area covered by the two labeling areas is the union of the two labeling areas.
The preset threshold is a value greater than 0 and less than 1, and the preset threshold may be set by a technician according to actual needs, which is not specifically limited in this embodiment.
And S102, determining a new labeling area and a new labeling type of each group of data to be cleaned.
After one or more groups of data to be cleaned are determined, new labeling areas and labeling categories of each group of data to be cleaned are respectively determined.
For any group of data to be cleaned, if the labeling types of all the labeled data in the group of data to be cleaned are the same, the group of data to be cleaned is the repeated data. In this step, for a group of repeating data, only the new labeling area of the group of repeating data needs to be determined again, and the original labeling category is retained.
For any group of data to be cleaned, if the group of data to be cleaned has label data with different label types, the group of data to be cleaned is used as difference data. In this step, for a set of difference data, a new labeling area and a new labeling category of the set of difference data need to be determined again.
And S103, cleaning each group of data to be cleaned according to the new labeled area and the new labeled category of each group of data to be cleaned.
And for any group of data to be cleaned, cleaning the group of data to be cleaned in the database after determining the new labeling area and the new labeling category of the group of data to be cleaned.
Specifically, a new piece of labeled data corresponding to the group of data to be cleaned may be added to the database, the labeled area and the labeled category of the new piece of labeled data are the new labeled area and the new labeled category of the group of data to be cleaned, respectively, and the other information of the new piece of labeled data may be obtained by merging and sorting the other information of all labeled data in the group of data to be cleaned; and deleting the group of data to be cleaned in the database, so that the repeated data and the differential data in the database can be cleaned.
The embodiment of the invention obtains at least one group of marking data of which the similarity of the marking areas is greater than a preset threshold, wherein each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.
Example two
FIG. 2 is a flowchart of a labeled data processing method according to a second embodiment of the present invention; FIG. 3 is a flowchart of another annotation data processing method according to the second embodiment of the invention. On the basis of the first embodiment, in this embodiment, as shown in fig. 2, determining a new labeled area of each set of data to be cleaned specifically includes the following steps:
step S201, dividing the coverage area of all the labeled data of the group of data to be cleaned into a plurality of grids.
In this embodiment, in order to determine the new labeled area of the set of data to be cleaned, the coverage area of all labeled data of the set of data to be cleaned may be divided into a plurality of grids.
In addition, because the similarity of the labeling areas of all the labeling data of the group of data to be cleaned is higher, that is, all the labeling data of the group of data to be cleaned are the labeling data on the same page; the entire page may be partitioned into evenly distributed grids, for example, a page may be partitioned into 1000 x 2000 grids.
And step S202, calculating the labeling times of the group of data to be cleaned to each grid.
In this embodiment, the number of times of labeling each mesh by the set of data to be cleaned is calculated according to the mesh covered by the labeling area of each labeled data in the set of data to be cleaned.
Specifically, after a page corresponding to the group of data to be cleaned is divided into grids, initializing the labeling frequency of each grid to be 0; sequentially carrying out statistical processing on each marking data in the group of data to be cleaned, and adding 1 to the marking times of the grids covered by the marking areas of the marking data; and after the statistical processing of all the labeled data in the group of data to be cleaned is completed, obtaining the labeling times of the group of data to be cleaned on each grid.
Optionally, when each piece of labeled data in the set of data to be cleaned is subjected to statistical processing, the mesh covered by the labeled area of the labeled data may be determined according to the projection coordinate of the labeled area of the labeled data projected onto the page.
Alternatively, a data structure such as a two-dimensional vector may be used to record the labeling times of all grids.
And step S203, determining a connected region with the labeling frequency greater than the frequency threshold value in the coverage area as a new labeling region of the group of data to be cleaned.
In this embodiment, after the number of times of labeling of each grid is obtained, the maximum number of times of labeling of all the grids can be determined by comparing the number of times of labeling of each grid; and determining a time threshold according to the maximum time and a preset threshold.
Alternatively, the number threshold may be equal to the product of the maximum number and a preset threshold.
In addition, the number threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited herein.
After the number threshold is determined, determining an area formed by grids with the marking times larger than the number threshold in all grids corresponding to the group of data to be cleaned, and taking a connected area in the area formed by the grids with the marking times larger than the number threshold as a new marking area of the group of data to be cleaned.
Optionally, if the region formed by the grid with the labeling times greater than the time threshold includes a plurality of connected regions, the connected region with the largest area is used as a new labeling region of the set of data to be cleaned.
The embodiment of the invention divides the coverage area of all the marking data of the group of data to be cleaned into a plurality of grids, calculates the marking times of the group of data to be cleaned on each grid, determines the connected area with the marking times larger than the time threshold value in the coverage area as the new marking area of the group of data to be cleaned, can use the area with more marking times of the marking data in the group of data to be cleaned as the new marking area, and improves the accuracy of the new marking area.
In another implementation manner of this embodiment, determining a new label category of a group of data to be cleaned includes the following two cases:
the first case is: and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.
In the first case, the labeling categories of all the labeled data in the set of data to be cleaned are the same, that is, the labeling areas of all the labeled data in the set of data to be cleaned have high similarity, and the labeling categories are the same, which indicates that the set of data to be cleaned is the repeated data, and only needs to re-determine the new labeling area of the set of repeated data and keep the original labeling category.
The second case is: if the labeling types of any two labeling data in the group of data to be cleaned are different, calculating a labeling probability comprehensive score corresponding to each labeling type corresponding to the group of data to be cleaned; and taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.
In the second case, the labeling types of any two pieces of labeling data in the group of data to be cleaned are different, that is, the labeling areas of all the pieces of labeling data in the group of data to be cleaned have high similarity, but the labeling types are different, which indicates that the group of data to be cleaned is difference data, and not only the new labeling area of the group of difference data needs to be determined again, but also the new labeling type of the group of difference data needs to be determined again.
As shown in fig. 3, the specific steps of determining a new label category of a group of data to be cleaned are as follows:
and S301, calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned.
In this embodiment, a preset classification model may be used to classify the new labeled region, so as to obtain the probability that the labeled data corresponding to the new labeled region is of each labeled category.
The preset classification model may be a classifier for identifying a category of the page object in the designated area. And inputting the position information of the new labeling area into a classification model, and calculating and outputting the probability that the category of the page object in the new standard area is each group of labeling categories through the classification model.
And step S302, calculating the occurrence probability of each label type corresponding to the group of data to be cleaned.
In this embodiment, the probability of occurrence of any one of the label categories corresponding to the group of data to be cleaned may be calculated by using a formula two:
Figure BDA0001855518570000071
wherein c represents a label category, ph(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, ncAnd representing the times of marking the page object as the marked type c by the group of data to be cleaned, and N representing the total times of marking the page object by the group of data to be cleaned.
ncAnd the number of times that the page object is marked as the marking type c by the group of data to be cleaned is shown, namely the number of marking data with the marking type c in all the marking data of the group of data to be cleaned.
The annotation category c can be any annotation category, for example, a certain annotation category in header, footer, title, text paragraph, formula, table, and picture.
In addition, the sum of the probabilities of occurrence of each label type corresponding to the group of data to be cleaned is equal to 1, which can be specifically expressed by the following formula three:
c∈Categoryph(c) 1 formula three
Wherein c represents a label category, ph(c) And representing the occurrence probability of the label Category c corresponding to the group of data to be cleaned, wherein Category is the set of all label categories corresponding to the group of data to be cleaned.
And step S303, calculating a marking probability comprehensive score corresponding to each marking type according to the probability of each marking type corresponding to the new marking area and the occurrence probability of each marking type corresponding to the group of data to be cleaned.
Specifically, the label probability comprehensive score corresponding to each label category is calculated by adopting the following formula I:
po(c)=wh×ph(c)+wm×pm(c) formula one
Wherein c represents a label category, po(c) Label probability composite score, p, representing label category cm(c) Indicates the probability, p, of the new label region corresponding to the label category ch(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, wh、wmAre weight coefficients.
In this example, whWeight coefficient for the type of annotation corresponding to the manual pair of page objects represented by the set of data to be cleaned, wmIs a weight coefficient corresponding to the type of label of the page object by the classification model. w is ah、wmThe setting can be performed by a technician according to actual needs, and the embodiment is not specifically limited herein.
And step S304, taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.
And after the marking probability comprehensive scores corresponding to each marking category are obtained through calculation, determining the marking category with the maximum marking probability comprehensive score as a new marking category of the group of data to be cleaned by comparing the size of the marking probability comprehensive scores corresponding to each marking category.
The embodiment of the invention provides a specific implementation mode for determining a new labeling category of differential data, and the probability of each labeling category corresponding to a new labeling area is calculated through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned, calculating the labeling probability comprehensive score corresponding to each labeling category according to the occurrence probability of each labeling category corresponding to the new labeling area and the occurrence probability of each labeling category corresponding to the group of data to be cleaned, and taking the labeling category with the maximum labeling probability comprehensive score as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.
EXAMPLE III
Fig. 4 is a schematic structural diagram of an annotation data processing apparatus according to a third embodiment of the present invention. The annotated data processing apparatus provided in the embodiment of the present invention may execute the processing procedure provided in the embodiment of the annotated data processing method. As shown in fig. 4, the apparatus 40 includes: a data to be cleaned acquisition module 401, a determination module 402 and a cleaning processing module 403.
Specifically, the to-be-cleaned data obtaining module 401 is configured to obtain at least one set of labeled data with the labeled area similarity greater than a preset threshold, where each set of labeled data is a set of to-be-cleaned data.
The determining module 402 is configured to determine a new labeled area and a new labeled category of each set of data to be cleaned.
The cleaning processing module 403 is configured to perform cleaning processing on each set of data to be cleaned according to the new labeled area and the new labeled category of each set of data to be cleaned.
The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the first embodiment, and specific functions are not described herein again.
The embodiment of the invention obtains at least one group of marking data of which the similarity of the marking areas is greater than a preset threshold, wherein each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.
Example four
On the basis of the third embodiment, in this embodiment, the determining module is further configured to:
dividing the coverage area of all the labeled data of the group of data to be cleaned into a plurality of grids; calculating the labeling times of the group of data to be cleaned to each grid; and determining a connected region with the labeling times larger than the time threshold value in the coverage area as a new labeling region of the group of data to be cleaned.
Optionally, the determining module is further configured to:
determining the maximum times in the labeling times of each grid; and determining a time threshold value according to the product of the maximum time and a preset threshold value.
Optionally, the data acquiring module to be cleaned is further configured to:
calculating the similarity of the labeling areas of any two labeling data; determining at least one group of data to be cleaned according to the similarity of the labeling areas of any two labeling data; each group of data to be cleaned comprises at least two marking data, and the similarity of the marking areas of any two marking data in each group of data to be cleaned is greater than a preset threshold value.
Optionally, the determining module is further configured to:
if the labeling types of any two labeling data in the group of data to be cleaned are different, calculating a labeling probability comprehensive score corresponding to each labeling type corresponding to the group of data to be cleaned; and taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.
Optionally, the determining module is further configured to:
calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each label category corresponding to the group of data to be cleaned; and calculating the marking probability comprehensive score corresponding to each marking category by adopting the following formula:
po(c)=wh×ph(c)+wm×pm(c)
wherein c represents a label category, po(c) Label probability composite score, p, representing label category cm(c) Indicates the probability, p, of the new label region corresponding to the label category ch(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, wh、wmAre weight coefficients.
Optionally, the determining module is further configured to:
and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.
The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the second embodiment, and specific functions are not described herein again.
The embodiment of the invention divides the coverage area of all the marking data of the group of data to be cleaned into a plurality of grids, calculates the marking times of the group of data to be cleaned on each grid, determines the connected area with the marking times larger than the threshold value of the times in the coverage area as the new marking area of the group of data to be cleaned, can use the area with more marking times of the marking data in the group of data to be cleaned as the new marking area, and improves the accuracy of the new marking area; further, calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned; calculating the occurrence probability of each labeling category corresponding to the group of data to be cleaned, calculating the labeling probability comprehensive score corresponding to each labeling category according to the occurrence probability of each labeling category corresponding to the new labeling area and the occurrence probability of each labeling category corresponding to the group of data to be cleaned, and taking the labeling category with the maximum labeling probability comprehensive score as the new labeling category of the group of data to be cleaned, so that the accuracy of the new labeling category can be improved.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an annotation data processing device according to a fifth embodiment of the present invention. As shown in fig. 5, the annotation data processing apparatus 50 includes: a processor 501, a memory 502, and computer programs stored on the memory 502 and executable by the processor 501.
The processor 501, when executing the computer program stored on the memory 502, implements the annotation data processing method provided by any of the method embodiments described above.
The embodiment of the invention obtains at least one group of marking data of which the similarity of the marking areas is greater than a preset threshold, wherein each group of marking data is a group of data to be cleaned; determining a new labeling area and a new labeling type of each group of data to be cleaned; according to the new labeling area and the new labeling category of each group of data to be cleaned, each group of data to be cleaned is cleaned, repeated data and differential data in the labeled data can be automatically identified, cleaning of the repeated data and the differential data is automatically completed, and effectiveness of the labeled data is improved.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for processing annotation data provided in any of the above method embodiments is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A method for processing annotation data, comprising:
acquiring at least one group of labeled data of which the similarity of the labeled areas is greater than a preset threshold, wherein each group of labeled data is a group of data to be cleaned;
determining a new labeling area and a new labeling type of each group of data to be cleaned;
and cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.
2. The method according to claim 1, wherein the obtaining at least one group of labeled data with labeled region similarity greater than a preset threshold, each group of labeled data being a group of data to be cleaned, comprises:
calculating the similarity of the labeling areas of any two labeling data;
determining at least one group of data to be cleaned according to the similarity of the labeling areas of any two labeling data;
each group of data to be cleaned comprises at least two marking data, and the similarity of the marking areas of any two marking data in each group of data to be cleaned is greater than the preset threshold value.
3. The method of claim 1 or 2, wherein the determining a new labeled area of a set of data to be cleaned comprises:
dividing the coverage area of all the labeled data of the group of data to be cleaned into a plurality of grids;
calculating the labeling times of the group of data to be cleaned to each grid;
and determining the connected region with the labeling times larger than the time threshold value in the coverage area as a new labeling region of the group of data to be cleaned.
4. The method of claim 3, wherein before determining the connected region in the coverage area labeled more than the threshold number of times as a new labeled region of the set of data to be cleaned, the method further comprises:
determining the maximum times of the labeling times of each grid;
and determining the frequency threshold value according to the product of the maximum frequency and the preset threshold value.
5. The method of claim 3, wherein determining a new label category for a set of data to be cleaned comprises:
if the labeling types of any two labeling data in the group of data to be cleaned are different, calculating a labeling probability comprehensive score corresponding to each labeling type corresponding to the group of data to be cleaned;
and taking the labeling category with the maximum comprehensive labeling probability as a new labeling category of the group of data to be cleaned.
6. The method of claim 5, wherein the calculating a label probability composite score corresponding to each label category corresponding to the set of data to be cleaned comprises:
calculating the probability of each labeling category corresponding to the new labeling area through a classification model according to the new labeling area of the group of data to be cleaned;
calculating the occurrence probability of each label category corresponding to the group of data to be cleaned;
and calculating the marking probability comprehensive score corresponding to each marking category by adopting the following formula:
po(c)=wh×ph(c)+wm×pm(c)
wherein c represents a label category, po(c) Label probability composite score, p, representing label category cm(c) Representing the probability, p, of the new labeled region corresponding to the label category ch(c) Representing the probability of occurrence of the label category c corresponding to the group of data to be cleaned, wh、wmAre weight coefficients.
7. The method of claim 3, wherein determining a new label category for a set of data to be cleaned comprises:
and if the labeling categories of all the labeling data in the group of data to be cleaned are the same, taking the labeling category of the labeling data in the group of data to be cleaned as a new labeling category.
8. An annotation data processing apparatus, comprising:
the system comprises a to-be-cleaned data acquisition module, a data analysis module and a data analysis module, wherein the to-be-cleaned data acquisition module is used for acquiring at least one group of marked data of which the similarity of marked areas is greater than a preset threshold, and each group of marked data is a group of to-be-cleaned data;
the determining module is used for determining a new labeling area and a new labeling category of each group of data to be cleaned;
and the cleaning processing module is used for cleaning each group of data to be cleaned according to the new labeling area and the new labeling category of each group of data to be cleaned.
9. An annotation data processing apparatus, comprising:
a memory, a processor, and a computer program stored on the memory and executable on the processor,
the processor, when executing the computer program, implements the method of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored,
the computer program, when executed by a processor, implementing the method of any one of claims 1-7.
CN201811313048.2A 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium Active CN111143333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811313048.2A CN111143333B (en) 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811313048.2A CN111143333B (en) 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111143333A true CN111143333A (en) 2020-05-12
CN111143333B CN111143333B (en) 2023-06-09

Family

ID=70516499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811313048.2A Active CN111143333B (en) 2018-11-06 2018-11-06 Labeling data processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111143333B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050210038A1 (en) * 2004-03-18 2005-09-22 International Business Machines Corporation Method for providing workflow functionality and tracking in an annotation subsystem
JP2005346376A (en) * 2004-06-02 2005-12-15 Fuji Xerox Co Ltd Document processor, document processing method and document processing program
US20100011282A1 (en) * 2008-07-11 2010-01-14 iCyte Pty Ltd. Annotation system and method
US20120233150A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Aggregating document annotations
US20130054636A1 (en) * 2011-08-30 2013-02-28 Ding-Yuan Tang Document Journaling
US20170060829A1 (en) * 2015-09-01 2017-03-02 Branchfire, Inc. Method and system for annotation and connection of electronic documents
US20180089155A1 (en) * 2016-09-29 2018-03-29 Dropbox, Inc. Document differences analysis and presentation
CN108268575A (en) * 2017-01-04 2018-07-10 阿里巴巴集团控股有限公司 Processing method, the device and system of markup information
CN108509969A (en) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 Data mask method and terminal

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050210038A1 (en) * 2004-03-18 2005-09-22 International Business Machines Corporation Method for providing workflow functionality and tracking in an annotation subsystem
JP2005346376A (en) * 2004-06-02 2005-12-15 Fuji Xerox Co Ltd Document processor, document processing method and document processing program
US20100011282A1 (en) * 2008-07-11 2010-01-14 iCyte Pty Ltd. Annotation system and method
US20120233150A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Aggregating document annotations
US20130054636A1 (en) * 2011-08-30 2013-02-28 Ding-Yuan Tang Document Journaling
US20170060829A1 (en) * 2015-09-01 2017-03-02 Branchfire, Inc. Method and system for annotation and connection of electronic documents
US20180089155A1 (en) * 2016-09-29 2018-03-29 Dropbox, Inc. Document differences analysis and presentation
CN108268575A (en) * 2017-01-04 2018-07-10 阿里巴巴集团控股有限公司 Processing method, the device and system of markup information
CN108509969A (en) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 Data mask method and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GAO LIANGCAI ET.AL: "A Sequence Labeling Based Approach for Character Segmentation of Historical Documents" *

Also Published As

Publication number Publication date
CN111143333B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN108280477B (en) Method and apparatus for clustering images
JP4545641B2 (en) Similar image retrieval method, similar image retrieval system, similar image retrieval program, and recording medium
CN108108821A (en) Model training method and device
CN107729935B (en) The recognition methods of similar pictures and device, server, storage medium
CN107729416B (en) Book recommendation method and system
CN111639970A (en) Method for determining price of article based on image recognition and related equipment
CN103699691A (en) Method for generating image fingerprint and method for searching similar image based on same
CN111507285A (en) Face attribute recognition method and device, computer equipment and storage medium
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
US9104946B2 (en) Systems and methods for comparing images
CN111860484A (en) Region labeling method, device, equipment and storage medium
CN105260458A (en) Video recommendation method for display apparatus and display apparatus
CN113902856B (en) Semantic annotation method and device, electronic equipment and storage medium
US10210281B2 (en) Method and system for obtaining knowledge point implicit relationship
CN111191454A (en) Entity matching method and device
CN114168768A (en) Image retrieval method and related equipment
CN111027551B (en) Image processing method, apparatus and medium
CN110427496B (en) Knowledge graph expansion method and device for text processing
CN102760127A (en) Method, device and equipment for determining resource type based on extended text information
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN111143333B (en) Labeling data processing method, device, equipment and computer readable storage medium
CN103377381A (en) Method and device for identifying content attribute of image
CN113282781B (en) Image retrieval method and device
CN104850600A (en) Method and device for searching images containing faces
CN110674388A (en) Mapping method and device for push item, storage medium and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230627

Address after: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD.

Patentee before: Peking University

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240327

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Country or region after: China

Address before: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Patentee before: New founder holdings development Co.,Ltd.

Country or region before: China

Patentee before: Peking University