CN112052646A

CN112052646A - Text data labeling method

Info

Publication number: CN112052646A
Application number: CN202010881236.6A
Authority: CN
Inventors: 江灏; 汤智; 曾东
Original assignee: Anhui Jurong Science And Technology Information Consulting Co ltd
Current assignee: Anhui Jurong Science And Technology Information Consulting Co ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-12-08
Anticipated expiration: 2040-08-27
Also published as: CN112052646B

Abstract

The invention discloses a text data labeling method, which comprises the following steps: extracting text information: determining a text search range, text labeling data and a text labeling standard; text information segmentation and numbering: and segmenting the text labeling data, and numbering the segmented data according to the segmentation order. According to the text data labeling method, before labeling, text labeling data or data in a text searching range are segmented, then the segmented text labeling data or the data are distributed to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced; according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.

Description

Text data labeling method

Technical Field

The invention relates to the field of data management, in particular to a text data labeling method.

Background

With the rapid development of society, the living standard of people is continuously improved, information becomes an important part of each industry, people usually check the duplicate of a text designed by people when designing a product, wherein related contents need to be labeled, and people invent some text data labeling methods for the convenience of labeling text data;

the existing text data labeling method has certain disadvantages in use, firstly, the existing text data labeling method generally hands texts to related labeling platforms to be directly compared with all data, the existing data in a database is larger and larger, the time required for labeling is prolonged continuously, the requirement for labeling platform equipment cannot be labeled timely is higher and higher, secondly, all files are gathered together, the file leakage condition is easy to occur, the safety performance is not high, and therefore the text data labeling method is provided.

Disclosure of Invention

The invention mainly aims to provide a text data labeling method which can effectively solve the problems in the background technology.

In order to achieve the purpose, the invention adopts the technical scheme that:

a text data labeling method comprises the following steps:

(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;

(2) text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;

(3) text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;

(4) and text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, and summarizes the data;

(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;

(6) and text sequencing combination: and sequencing the summarized data according to the serial number sequence to obtain the complete content of the label.

Preferably, the text search range in step (1) is divided according to the main body or type of the text label data, the text label data is the label data to be displayed, and the text label standard is divided into different grades according to the similarity of the text label data.

Preferably, in the step (2), the original of the text label data is reserved before the file is divided, and the dividing and numbering step of the text label data is as follows:

firstly, dividing a text into parts according to the association degree of text contents, and numbering for the first time according to the division sequence;

randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and then numbering the integrated data for the second time;

and thirdly, summarizing the first-time number content and the second-time number content.

Preferably, the file in the text search range may be further divided in step (2), the total file is directly divided into portions with equal data during the division, and the divided data is numbered according to the order of the division.

Preferably, in the step (3), the numbered data, the text annotation data and the text annotation standard can also be issued to different annotation platforms, and the number of the numbered data issued to each platform is recorded.

Preferably, the number is checked with the platform after the file is issued in the step (3), whether the file is completely issued or not is judged, and the file is re-issued aiming at the platform which issues the error when the number is not accurate or the number is missing.

Preferably, the file state is judged after the file is recalled in the step (5), the serial number is compared when the state is judged, whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not are checked, and the file is applied to the platform again when the file state is abnormal.

Preferably, in the step (6), the files obtained after sorting the summarized data are compared with the original files of the reserved text labeling data before file segmentation according to the numbering sequence, whether the sorting is wrong is judged, and when the sorting is wrong, the marked data are re-sorted according to the original contents of the reserved text labeling data before segmentation.

Compared with the prior art, the text data labeling method has the following beneficial effects:

1. according to the invention, before labeling, the text labeling data or the data in the text searching range are segmented, then the segmented text labeling data or the data are sent to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced;

2. according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.

Drawings

Fig. 1 is a flowchart of a text data labeling method according to the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

A text data labeling method comprises the following steps:

when a text search range is established, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data needing to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the grades are divided into 5 grades according to the similarity, and each grade is marked by different colors.

original documents of the text labeling data are reserved before the document segmentation, and the steps of segmenting and numbering the text labeling data are as follows:

firstly, dividing a text into parts according to the association degree of text contents, and numbering the parts for the first time according to a division sequence, for example, dividing text data into numbers 1, 2, 3, 4, 5, 6, 7, 8 and 9;

randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and numbering the integrated data for the second time, wherein the numbering is 14, 235, 68 and 79 for example;

and thirdly, summarizing the first-time number content and the second-time number content, and recording the summarized numbers.

after the file is issued, the serial number is checked with the platform, whether the file is issued completely or not is judged, and the file is re-issued aiming at the platform which issues the file wrongly when the serial number is not accurate or the serial number is missing.

(4) And text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, summarizes the data, and extracts the extracted features to the text corresponding positions of the issued text marking data according to the text marking standard;

and judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not, and reapplying the file to the platform when the state of the file is abnormal.

(6) And text sequencing combination: sequencing the summarized data according to the serial number sequence to obtain the complete content of the label;

and comparing the files obtained after the summarized data are sequenced with the original files of the reserved text label data before the files are segmented according to the numbering sequence, judging whether the sequencing is wrong, and re-sequencing the data after the labeling according to the original contents of the reserved text label data before the segmentation when the sequencing is wrong.

Example 2

A text data labeling method comprises the following steps:

(3) Text information segmentation and numbering: dividing the files in the text search range, directly dividing the total file into parts with equal data during division, numbering the divided data according to the division sequence, and keeping the record of the text search range before the file is divided.

(3) Text information publishing: the numbered data, the text annotation data and the text annotation standard are issued to different annotation platforms, and the serial numbers of the numbered data issued to the platforms are recorded;

after the file is issued, the serial number is checked with the platform, whether the file in the text search range is completely issued is judged, and when the serial number is lacked, the file is re-issued for the platform with the wrong issuing.

(4) And text information labeling: the marking platform receives the data, searches text marking data in the numbered data according to the text marking standard, summarizes the data, and extracts the extracted features to the text corresponding positions of the issued text marking data according to the text marking standard;

(6) And text sequencing combination: and integrating all the marked texts to obtain the marked complete content.

In the embodiment 1, the text is divided and labeled, and a labeling platform searches, extracts and labels the divided data in a complete text searching range; in the embodiment 2, the text search range is divided, and the labeling platform searches, extracts and labels the text label data in the divided text search range; compared with the embodiment 2, the embodiment 1 has high file security, and compared with the embodiment 1, the embodiment 2 does not need multiple numbering and is convenient to integrate;

in summary, the text data labeling method provided by the invention is characterized in that before labeling, text labeling data or data in a text search range are segmented, then the segmented text labeling data or data are sent to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced; according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A text data labeling method is characterized in that: the method comprises the following steps:

2. The method for labeling text data according to claim 1, wherein: and (2) dividing the text search range in the step (1) according to the main body or type of the text marking data, wherein the text marking data is marking data needing to be displayed, and the text marking standard is divided into different grades according to the similarity of the text marking data.

3. The method for labeling text data according to claim 1, wherein: and (3) reserving the original text labeling data before the file is segmented in the step (2), wherein the segmentation and numbering step of the text labeling data is as follows:

4. The method for labeling text data according to claim 1, wherein: in the step (2), the files in the text search range can be divided, the total file is directly divided into parts with equal data during division, and the divided data are numbered according to the dividing sequence.

5. The method of claim 4, wherein: in the step (3), the numbered data, the text labeling data and the text labeling standard can be issued to different labeling platforms, and the numbers of the numbered data issued to the platforms are recorded.

6. The method of claim 5, wherein: and (3) after the file is issued, checking the serial number with the platform, and judging whether the file is issued completely or not, and re-issuing the file aiming at the platform which issues the file wrongly when the serial number is not accurate or the serial number is missing.

7. The method for labeling text data according to claim 1, wherein: and (5) judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and reapplying the file to the platform when the state of the file is abnormal.

8. The method for labeling text data according to claim 1, wherein: and (6) comparing the files obtained after the summarized data are sequenced with the original files of the reserved text label data before the files are segmented according to the numbering sequence, judging whether the sequencing is wrong, and re-sequencing the labeled data according to the original contents of the reserved text label data before the segmentation when the sequencing is wrong.