CN112052646B

CN112052646B - Text data labeling method

Info

Publication number: CN112052646B
Application number: CN202010881236.6A
Authority: CN
Inventors: 江灏; 汤智; 曾东
Original assignee: Anhui Jurong Science And Technology Information Consulting Co ltd
Current assignee: Anhui Jurong Science And Technology Information Consulting Co ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2024-03-29
Anticipated expiration: 2040-08-27
Also published as: CN112052646A

Abstract

The invention discloses a text data labeling method, which comprises the following steps: text information extraction: determining a text searching range, text marking data and a text marking standard; text information segmentation and numbering: the text labeling data is segmented, and the segmented data is numbered according to the segmentation order. According to the text data labeling method, the text labeling data or the data in the text searching range are segmented before labeling, then the text labeling data or the data in the text searching range are issued to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the speed of labeling the text data can be increased, the time of labeling the text data can be shortened, and the pressure of the labeling platforms can be reduced; this patent divides text annotation data, then issues to each different annotation platform after the combination, and the data that every annotation platform received is incomplete, reduces the data leakage risk, improves the security of text data annotation.

Description

Text data labeling method

Technical Field

The invention relates to the field of data management, in particular to a text data labeling method.

Background

Along with the rapid development of society, the living standard of people is continuously improved, information becomes an important part of each industry, people usually check the designed text when designing products, related contents need to be marked, and in order to be convenient for people to mark text data, people invent a plurality of text data marking methods;

the existing text data labeling method has certain defects when in use, firstly, the existing text data labeling method generally gives texts to a related labeling platform to be directly compared with all data, the larger the data in a database is, the more and more time is required for labeling, the higher and more requirements on labeling platform equipment cannot be timely labeled, secondly, all files are summarized together, the condition of file leakage is easy to occur, and the safety performance is not high.

Disclosure of Invention

The invention mainly aims to provide a text data labeling method which can effectively solve the problems in the background technology

Problems.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a text data labeling method comprises the following steps:

(1) Extracting text information: determining a text searching range, text marking data and a text marking standard;

(2) Text information segmentation and numbering: dividing the text labeling data, and numbering the divided data according to the dividing sequence;

(3) And (3) issuing text information: issuing numbered data, text searching range and text labeling standard to different labeling platforms;

(4) Labeling text information: the labeling platform receives the data, searches numbered data in a text searching range according to a text labeling standard, and then gathers the data;

(5) Recall of the annotation information: summarizing all the data searched by all the labeling platforms;

(6) Text ordering combination: and sequencing the summarized data according to the numbering sequence to obtain the marked complete content.

Preferably, in the step (1), the text searching range is defined according to the main body or type of the text marking data, namely marking data to be displayed, and the text marking standard is divided into different grades according to the similarity of the text marking data.

Preferably, the text labeling data original is reserved before the file is segmented in the step (2), and the segmentation and numbering steps of the text labeling data are as follows:

(1) dividing the text into various parts according to the association degree of the text content, and numbering the text for the first time according to the segmentation sequence;

(2) randomly integrating the contents of all the parts, enabling the data of all the parts after integration to be equal, and numbering the integrated data for the second time; (3) and summarizing the first time numbering content and the second time numbering content.

Preferably, in the step (2), the files in the text searching range can be further divided, the total files are directly divided into parts with equal data during the division, and the divided data are numbered according to the dividing sequence.

Preferably, in the step (3), the numbered data, the text labeling data and the text labeling standard can be issued to different labeling platforms, and the numbers of the numbered data issued to the respective platforms are recorded.

Preferably, after the file is issued in the step (3), the serial number is checked with the platform, whether the file is issued completely or not is judged, and when the serial number is inaccurate or is missing, the file is issued again for the platform with the wrong issuing.

Preferably, in the step (5), after the file is recalled, the file state is judged, the numbers are compared in the judging process, whether the file is marked, whether the file is completely marked and whether the file is missing or not is checked, and when the file state is abnormal, the file is applied to the platform again.

Preferably, in the step (6), the file after the summarized data are sequenced according to the serial number sequence is compared with the original document of the reserved text marking data before the file is segmented, whether the sequencing is wrong or not is judged, and when the sequencing is wrong, the marked data are reordered according to the original document content of the reserved text marking data before the segmentation.

Compared with the prior art, the text data labeling method has the following beneficial effects:

1. according to the method, the text marking data or the data in the text searching range are segmented before marking, then the text marking data or the data in the text searching range are issued to different marking platforms for synchronous marking, and finally the marked results are summarized, so that the speed of marking the text data can be increased, the time of marking the text data can be shortened, and the pressure of the marking platforms can be reduced;

2. this patent divides text annotation data, then issues to each different annotation platform after the combination, and the data that every annotation platform received is incomplete, reduces the data leakage risk, improves the security of text data annotation.

Drawings

Fig. 1 is a flowchart of a text data labeling method according to the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

A text data labeling method comprises the following steps:

when the text searching range is formulated, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the text marking standard is divided into 5 grades according to the similarity of the texts marking data, and each grade is marked by different colors.

the original text marking data is reserved before the file is segmented, and the steps of segmentation and numbering of the text marking data are as follows:

(1) dividing the text into various parts according to the association degree of the text content, and numbering the text data for the first time according to the division sequence, for example dividing the text data into numbers 1, 2, 3, 4, 5, 6, 7, 8 and 9;

(2) randomly integrating the contents of all the parts, wherein the data of all the parts after integration are equal, and then numbering the integrated data for the second time, for example, numbering 14, 235, 68 and 79;

(3) and summarizing the first numbering content and the second numbering content, and recording the summarized numbers.

after the file is issued, checking the number with the platform, judging whether the file is issued completely, and re-issuing the file to the platform with wrong issuing when the number is inaccurate or the number is missing.

(4) Labeling text information: the labeling platform receives the data, searches numbered data in a text searching range according to a text labeling standard, gathers the data, and extracts the extracted features to the text corresponding position of the issued text labeling data according to the text labeling standard;

judging the file state after the file is recalled, comparing numbers in judging the state, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and re-applying the file to the platform when the file state is abnormal.

(6) Text ordering combination: sequencing the summarized data according to the numbering sequence to obtain marked complete content;

comparing the file after ordering the summarized data with the original document with the reserved text marking data before the file is divided according to the numbering sequence, judging whether the sorting is wrong, and re-sorting the marked data according to the original content of the reserved text marked data before segmentation when the sorting is wrong.

Examples

A text data labeling method comprises the following steps:

(3) Text information segmentation and numbering: dividing the file in the text searching range, directly dividing the total file into parts with equal data, numbering the divided data according to the dividing sequence, and keeping the record of the text searching range before the file is divided.

(3) And (3) issuing text information: the numbered data, the text marking data and the text marking standard are issued to different marking platforms, and the numbers of the numbered data issued to each platform are recorded;

after the file is issued, checking the number with the platform, judging whether the file in the text searching range is issued completely, and re-issuing the file aiming at the platform with the issuing error when the number is missing.

(4) Labeling text information: the labeling platform receives the data, searches text labeling data in the numbered data according to the text labeling standard, summarizes the data, and extracts the extracted features to the text corresponding position of the issued text labeling data according to the text labeling standard;

(6) Text ordering combination: and integrating all marked texts to obtain marked complete contents.

The embodiment 1 is to divide text labeling data, and a labeling platform searches, extracts and labels the divided data in a complete text searching range; embodiment 2 is to divide a text searching range, and the labeling platform searches, extracts and labels text labeling data in the divided text searching range; compared with the embodiment 2, the embodiment 1 has high file security, the embodiment 2 does not need to be numbered for a plurality of times compared with the embodiment 1, and the integration is convenient;

in summary, according to the text data labeling method, the text labeling data or the data in the text searching range are segmented before labeling, then the segmented text labeling data or the data in the text searching range are issued to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the speed of labeling the text data can be increased, the time of labeling the text data can be shortened, and the pressure of the labeling platforms can be reduced; this patent divides text annotation data, then issues to each different annotation platform after the combination, and the data that every annotation platform received is incomplete, reduces the data leakage risk, improves the security of text data annotation.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A text data labeling method is characterized in that: the method comprises the following steps:

dividing the text searching range in the step (1) according to the main body or type of the text marking data, namely marking data to be displayed, and dividing the text marking standard into different grades according to the similarity of the text marking data;

the original text marking data is reserved before the file is segmented in the step (2), and the segmentation and numbering steps of the text marking data are as follows:

(2) randomly integrating the contents of all the parts, enabling the data of all the parts after integration to be equal, and numbering the integrated data for the second time;

(3) summarizing the first numbering content and the second numbering content;

in the step (2), the files in the text searching range are further divided, the total files are directly divided into parts with equal data during division, and the divided data are numbered according to the dividing sequence.

2. A method for labeling text data as recited in claim 1, wherein: and (3) further issuing numbered data, text marking data and text marking standards to different marking platforms, and recording the numbers of the numbered data issued to the platforms.

3. A method for labeling text data as recited in claim 2, wherein: and (3) checking the serial numbers with the platform after the file is issued, judging whether the file is issued completely, and re-issuing the file aiming at the platform with wrong issuing when the serial numbers are inaccurate or missing.

4. A method for labeling text data as recited in claim 1, wherein: and (5) judging the file state after the file is recalled, comparing numbers in judging the state, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and applying for the file to the platform again when the file state is abnormal.

5. A method for labeling text data as recited in claim 1, wherein: in the step (6), the file after ordering the summarized data according to the number sequence is compared with the original text marking data reserved before the file is divided, judging whether the sorting is wrong, and re-sorting the marked data according to the original content of the reserved text marking data before segmentation when the sorting is wrong.