CN112052646A - Text data labeling method - Google Patents
Text data labeling method Download PDFInfo
- Publication number
- CN112052646A CN112052646A CN202010881236.6A CN202010881236A CN112052646A CN 112052646 A CN112052646 A CN 112052646A CN 202010881236 A CN202010881236 A CN 202010881236A CN 112052646 A CN112052646 A CN 112052646A
- Authority
- CN
- China
- Prior art keywords
- text
- data
- labeling
- file
- marking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 100
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 238000000034 method Methods 0.000 claims abstract description 11
- 238000012163 sequencing technique Methods 0.000 claims description 13
- 230000002159 abnormal effect Effects 0.000 claims description 4
- 230000010354 integration Effects 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 abstract description 3
- 239000000284 extract Substances 0.000 description 4
- 239000003086 colorant Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/189—Automatic justification
Abstract
The invention discloses a text data labeling method, which comprises the following steps: extracting text information: determining a text search range, text labeling data and a text labeling standard; text information segmentation and numbering: and segmenting the text labeling data, and numbering the segmented data according to the segmentation order. According to the text data labeling method, before labeling, text labeling data or data in a text searching range are segmented, then the segmented text labeling data or the data are distributed to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced; according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.
Description
Technical Field
The invention relates to the field of data management, in particular to a text data labeling method.
Background
With the rapid development of society, the living standard of people is continuously improved, information becomes an important part of each industry, people usually check the duplicate of a text designed by people when designing a product, wherein related contents need to be labeled, and people invent some text data labeling methods for the convenience of labeling text data;
the existing text data labeling method has certain disadvantages in use, firstly, the existing text data labeling method generally hands texts to related labeling platforms to be directly compared with all data, the existing data in a database is larger and larger, the time required for labeling is prolonged continuously, the requirement for labeling platform equipment cannot be labeled timely is higher and higher, secondly, all files are gathered together, the file leakage condition is easy to occur, the safety performance is not high, and therefore the text data labeling method is provided.
Disclosure of Invention
The invention mainly aims to provide a text data labeling method which can effectively solve the problems in the background technology.
In order to achieve the purpose, the invention adopts the technical scheme that:
a text data labeling method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
(2) text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;
(3) text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;
(4) and text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, and summarizes the data;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
(6) and text sequencing combination: and sequencing the summarized data according to the serial number sequence to obtain the complete content of the label.
Preferably, the text search range in step (1) is divided according to the main body or type of the text label data, the text label data is the label data to be displayed, and the text label standard is divided into different grades according to the similarity of the text label data.
Preferably, in the step (2), the original of the text label data is reserved before the file is divided, and the dividing and numbering step of the text label data is as follows:
firstly, dividing a text into parts according to the association degree of text contents, and numbering for the first time according to the division sequence;
randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and then numbering the integrated data for the second time;
and thirdly, summarizing the first-time number content and the second-time number content.
Preferably, the file in the text search range may be further divided in step (2), the total file is directly divided into portions with equal data during the division, and the divided data is numbered according to the order of the division.
Preferably, in the step (3), the numbered data, the text annotation data and the text annotation standard can also be issued to different annotation platforms, and the number of the numbered data issued to each platform is recorded.
Preferably, the number is checked with the platform after the file is issued in the step (3), whether the file is completely issued or not is judged, and the file is re-issued aiming at the platform which issues the error when the number is not accurate or the number is missing.
Preferably, the file state is judged after the file is recalled in the step (5), the serial number is compared when the state is judged, whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not are checked, and the file is applied to the platform again when the file state is abnormal.
Preferably, in the step (6), the files obtained after sorting the summarized data are compared with the original files of the reserved text labeling data before file segmentation according to the numbering sequence, whether the sorting is wrong is judged, and when the sorting is wrong, the marked data are re-sorted according to the original contents of the reserved text labeling data before segmentation.
Compared with the prior art, the text data labeling method has the following beneficial effects:
1. according to the invention, before labeling, the text labeling data or the data in the text searching range are segmented, then the segmented text labeling data or the data are sent to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced;
2. according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.
Drawings
Fig. 1 is a flowchart of a text data labeling method according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
A text data labeling method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
when a text search range is established, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data needing to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the grades are divided into 5 grades according to the similarity, and each grade is marked by different colors.
(2) Text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;
original documents of the text labeling data are reserved before the document segmentation, and the steps of segmenting and numbering the text labeling data are as follows:
firstly, dividing a text into parts according to the association degree of text contents, and numbering the parts for the first time according to a division sequence, for example, dividing text data into numbers 1, 2, 3, 4, 5, 6, 7, 8 and 9;
randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and numbering the integrated data for the second time, wherein the numbering is 14, 235, 68 and 79 for example;
and thirdly, summarizing the first-time number content and the second-time number content, and recording the summarized numbers.
(3) Text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;
after the file is issued, the serial number is checked with the platform, whether the file is issued completely or not is judged, and the file is re-issued aiming at the platform which issues the file wrongly when the serial number is not accurate or the serial number is missing.
(4) And text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, summarizes the data, and extracts the extracted features to the text corresponding positions of the issued text marking data according to the text marking standard;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
and judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not, and reapplying the file to the platform when the state of the file is abnormal.
(6) And text sequencing combination: sequencing the summarized data according to the serial number sequence to obtain the complete content of the label;
and comparing the files obtained after the summarized data are sequenced with the original files of the reserved text label data before the files are segmented according to the numbering sequence, judging whether the sequencing is wrong, and re-sequencing the data after the labeling according to the original contents of the reserved text label data before the segmentation when the sequencing is wrong.
Example 2
A text data labeling method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
when a text search range is established, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data needing to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the grades are divided into 5 grades according to the similarity, and each grade is marked by different colors.
(3) Text information segmentation and numbering: dividing the files in the text search range, directly dividing the total file into parts with equal data during division, numbering the divided data according to the division sequence, and keeping the record of the text search range before the file is divided.
(3) Text information publishing: the numbered data, the text annotation data and the text annotation standard are issued to different annotation platforms, and the serial numbers of the numbered data issued to the platforms are recorded;
after the file is issued, the serial number is checked with the platform, whether the file in the text search range is completely issued is judged, and when the serial number is lacked, the file is re-issued for the platform with the wrong issuing.
(4) And text information labeling: the marking platform receives the data, searches text marking data in the numbered data according to the text marking standard, summarizes the data, and extracts the extracted features to the text corresponding positions of the issued text marking data according to the text marking standard;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
and judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not, and reapplying the file to the platform when the state of the file is abnormal.
(6) And text sequencing combination: and integrating all the marked texts to obtain the marked complete content.
In the embodiment 1, the text is divided and labeled, and a labeling platform searches, extracts and labels the divided data in a complete text searching range; in the embodiment 2, the text search range is divided, and the labeling platform searches, extracts and labels the text label data in the divided text search range; compared with the embodiment 2, the embodiment 1 has high file security, and compared with the embodiment 1, the embodiment 2 does not need multiple numbering and is convenient to integrate;
in summary, the text data labeling method provided by the invention is characterized in that before labeling, text labeling data or data in a text search range are segmented, then the segmented text labeling data or data are sent to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced; according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (8)
1. A text data labeling method is characterized in that: the method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
(2) text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;
(3) text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;
(4) and text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, and summarizes the data;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
(6) and text sequencing combination: and sequencing the summarized data according to the serial number sequence to obtain the complete content of the label.
2. The method for labeling text data according to claim 1, wherein: and (2) dividing the text search range in the step (1) according to the main body or type of the text marking data, wherein the text marking data is marking data needing to be displayed, and the text marking standard is divided into different grades according to the similarity of the text marking data.
3. The method for labeling text data according to claim 1, wherein: and (3) reserving the original text labeling data before the file is segmented in the step (2), wherein the segmentation and numbering step of the text labeling data is as follows:
firstly, dividing a text into parts according to the association degree of text contents, and numbering for the first time according to the division sequence;
randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and then numbering the integrated data for the second time;
and thirdly, summarizing the first-time number content and the second-time number content.
4. The method for labeling text data according to claim 1, wherein: in the step (2), the files in the text search range can be divided, the total file is directly divided into parts with equal data during division, and the divided data are numbered according to the dividing sequence.
5. The method of claim 4, wherein: in the step (3), the numbered data, the text labeling data and the text labeling standard can be issued to different labeling platforms, and the numbers of the numbered data issued to the platforms are recorded.
6. The method of claim 5, wherein: and (3) after the file is issued, checking the serial number with the platform, and judging whether the file is issued completely or not, and re-issuing the file aiming at the platform which issues the file wrongly when the serial number is not accurate or the serial number is missing.
7. The method for labeling text data according to claim 1, wherein: and (5) judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and reapplying the file to the platform when the state of the file is abnormal.
8. The method for labeling text data according to claim 1, wherein: and (6) comparing the files obtained after the summarized data are sequenced with the original files of the reserved text label data before the files are segmented according to the numbering sequence, judging whether the sequencing is wrong, and re-sequencing the labeled data according to the original contents of the reserved text label data before the segmentation when the sequencing is wrong.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010881236.6A CN112052646B (en) | 2020-08-27 | 2020-08-27 | Text data labeling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010881236.6A CN112052646B (en) | 2020-08-27 | 2020-08-27 | Text data labeling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112052646A true CN112052646A (en) | 2020-12-08 |
CN112052646B CN112052646B (en) | 2024-03-29 |
Family
ID=73600287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010881236.6A Active CN112052646B (en) | 2020-08-27 | 2020-08-27 | Text data labeling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112052646B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023168964A1 (en) * | 2022-03-07 | 2023-09-14 | 华为云计算技术有限公司 | Data segmentation method and related apparatus |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1367446A (en) * | 2001-01-22 | 2002-09-04 | 前程无忧网络信息技术(北京)有限公司上海分公司 | Chinese personal biographical notes information treatment system and method |
CN106021227A (en) * | 2016-05-16 | 2016-10-12 | 南京大学 | State transition and neural network-based Chinese chunk parsing method |
CN106407407A (en) * | 2016-09-22 | 2017-02-15 | 江苏通付盾科技有限公司 | A file tagging system and method |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN107908642A (en) * | 2017-09-29 | 2018-04-13 | 江苏华通晟云科技有限公司 | Industry text entities extracting method based on distributed platform |
CN108108355A (en) * | 2017-12-25 | 2018-06-01 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Text emotion analysis method and system based on deep learning |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108920460A (en) * | 2018-06-26 | 2018-11-30 | 武大吉奥信息技术有限公司 | A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition |
CN109101538A (en) * | 2018-06-29 | 2018-12-28 | 中译语通科技股份有限公司 | A kind of entity abstracting method and system towards Chinese patent text |
CN109635280A (en) * | 2018-11-22 | 2019-04-16 | 园宝科技(武汉)有限公司 | A kind of event extraction method based on mark |
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
CN110888991A (en) * | 2019-11-28 | 2020-03-17 | 哈尔滨工程大学 | Sectional semantic annotation method in weak annotation environment |
CN111191456A (en) * | 2018-11-15 | 2020-05-22 | 零氪科技(天津)有限公司 | Method for identifying text segmentation by using sequence label |
CN111209728A (en) * | 2020-01-13 | 2020-05-29 | 深圳市企鹅网络科技有限公司 | Automatic test question labeling and inputting method |
-
2020
- 2020-08-27 CN CN202010881236.6A patent/CN112052646B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1367446A (en) * | 2001-01-22 | 2002-09-04 | 前程无忧网络信息技术(北京)有限公司上海分公司 | Chinese personal biographical notes information treatment system and method |
CN106021227A (en) * | 2016-05-16 | 2016-10-12 | 南京大学 | State transition and neural network-based Chinese chunk parsing method |
CN106407407A (en) * | 2016-09-22 | 2017-02-15 | 江苏通付盾科技有限公司 | A file tagging system and method |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN107908642A (en) * | 2017-09-29 | 2018-04-13 | 江苏华通晟云科技有限公司 | Industry text entities extracting method based on distributed platform |
CN108108355A (en) * | 2017-12-25 | 2018-06-01 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Text emotion analysis method and system based on deep learning |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108920460A (en) * | 2018-06-26 | 2018-11-30 | 武大吉奥信息技术有限公司 | A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition |
CN109101538A (en) * | 2018-06-29 | 2018-12-28 | 中译语通科技股份有限公司 | A kind of entity abstracting method and system towards Chinese patent text |
CN111191456A (en) * | 2018-11-15 | 2020-05-22 | 零氪科技(天津)有限公司 | Method for identifying text segmentation by using sequence label |
CN109635280A (en) * | 2018-11-22 | 2019-04-16 | 园宝科技(武汉)有限公司 | A kind of event extraction method based on mark |
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
CN110888991A (en) * | 2019-11-28 | 2020-03-17 | 哈尔滨工程大学 | Sectional semantic annotation method in weak annotation environment |
CN111209728A (en) * | 2020-01-13 | 2020-05-29 | 深圳市企鹅网络科技有限公司 | Automatic test question labeling and inputting method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023168964A1 (en) * | 2022-03-07 | 2023-09-14 | 华为云计算技术有限公司 | Data segmentation method and related apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN112052646B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20090144277A1 (en) | Electronic table of contents entry classification and labeling scheme | |
US10445359B2 (en) | Method and system for classifying media content | |
CN106066866A (en) | A kind of automatic abstracting method of english literature key phrase and system | |
CN110321420B (en) | Intelligent question-answering system and method based on question generation | |
CN113158653B (en) | Training method, application method, device and equipment for pre-training language model | |
CN106407182A (en) | A method for automatic abstracting for electronic official documents of enterprises | |
CN110609998A (en) | Data extraction method of electronic document information, electronic equipment and storage medium | |
CN110175334B (en) | Text knowledge extraction system and method based on custom knowledge slot structure | |
CN108959566A (en) | A kind of medical text based on Stacking integrated study goes privacy methods and system | |
CN107562843B (en) | News hot phrase extraction method based on title high-frequency segmentation | |
CN110209828A (en) | Case querying method and case inquiry unit, computer equipment and storage medium | |
CN112364172A (en) | Method for constructing knowledge graph in government official document field | |
CN115618014B (en) | Standard document analysis management system and method applying big data technology | |
CN109033225A (en) | Chinese address identifying system | |
CN116340467B (en) | Text processing method, text processing device, electronic equipment and computer readable storage medium | |
CN112052646A (en) | Text data labeling method | |
CN114090736A (en) | Enterprise industry identification system and method based on text similarity | |
US7107524B2 (en) | Computer implemented example-based concept-oriented data extraction method | |
JP3735336B2 (en) | Document summarization method and system | |
Natsev et al. | IBM Research TRECVID-2008 Video Retrieval System. | |
CN113515622A (en) | Classified storage system for archive data | |
CN105608137A (en) | Method and device for extracting identity label | |
CN115544975B (en) | Log format conversion method and device | |
CN111222031A (en) | Website distinguishing method and system | |
CN111026743A (en) | Rail transit engineering project structure data standardization method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |