CN112052646A - Text data labeling method - Google Patents

Text data labeling method Download PDF

Info

Publication number
CN112052646A
CN112052646A CN202010881236.6A CN202010881236A CN112052646A CN 112052646 A CN112052646 A CN 112052646A CN 202010881236 A CN202010881236 A CN 202010881236A CN 112052646 A CN112052646 A CN 112052646A
Authority
CN
China
Prior art keywords
text
data
labeling
file
marking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010881236.6A
Other languages
Chinese (zh)
Other versions
CN112052646B (en
Inventor
江灏
汤智
曾东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Jurong Science And Technology Information Consulting Co ltd
Original Assignee
Anhui Jurong Science And Technology Information Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Jurong Science And Technology Information Consulting Co ltd filed Critical Anhui Jurong Science And Technology Information Consulting Co ltd
Priority to CN202010881236.6A priority Critical patent/CN112052646B/en
Publication of CN112052646A publication Critical patent/CN112052646A/en
Application granted granted Critical
Publication of CN112052646B publication Critical patent/CN112052646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification

Abstract

The invention discloses a text data labeling method, which comprises the following steps: extracting text information: determining a text search range, text labeling data and a text labeling standard; text information segmentation and numbering: and segmenting the text labeling data, and numbering the segmented data according to the segmentation order. According to the text data labeling method, before labeling, text labeling data or data in a text searching range are segmented, then the segmented text labeling data or the data are distributed to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced; according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.

Description

Text data labeling method
Technical Field
The invention relates to the field of data management, in particular to a text data labeling method.
Background
With the rapid development of society, the living standard of people is continuously improved, information becomes an important part of each industry, people usually check the duplicate of a text designed by people when designing a product, wherein related contents need to be labeled, and people invent some text data labeling methods for the convenience of labeling text data;
the existing text data labeling method has certain disadvantages in use, firstly, the existing text data labeling method generally hands texts to related labeling platforms to be directly compared with all data, the existing data in a database is larger and larger, the time required for labeling is prolonged continuously, the requirement for labeling platform equipment cannot be labeled timely is higher and higher, secondly, all files are gathered together, the file leakage condition is easy to occur, the safety performance is not high, and therefore the text data labeling method is provided.
Disclosure of Invention
The invention mainly aims to provide a text data labeling method which can effectively solve the problems in the background technology.
In order to achieve the purpose, the invention adopts the technical scheme that:
a text data labeling method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
(2) text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;
(3) text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;
(4) and text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, and summarizes the data;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
(6) and text sequencing combination: and sequencing the summarized data according to the serial number sequence to obtain the complete content of the label.
Preferably, the text search range in step (1) is divided according to the main body or type of the text label data, the text label data is the label data to be displayed, and the text label standard is divided into different grades according to the similarity of the text label data.
Preferably, in the step (2), the original of the text label data is reserved before the file is divided, and the dividing and numbering step of the text label data is as follows:
firstly, dividing a text into parts according to the association degree of text contents, and numbering for the first time according to the division sequence;
randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and then numbering the integrated data for the second time;
and thirdly, summarizing the first-time number content and the second-time number content.
Preferably, the file in the text search range may be further divided in step (2), the total file is directly divided into portions with equal data during the division, and the divided data is numbered according to the order of the division.
Preferably, in the step (3), the numbered data, the text annotation data and the text annotation standard can also be issued to different annotation platforms, and the number of the numbered data issued to each platform is recorded.
Preferably, the number is checked with the platform after the file is issued in the step (3), whether the file is completely issued or not is judged, and the file is re-issued aiming at the platform which issues the error when the number is not accurate or the number is missing.
Preferably, the file state is judged after the file is recalled in the step (5), the serial number is compared when the state is judged, whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not are checked, and the file is applied to the platform again when the file state is abnormal.
Preferably, in the step (6), the files obtained after sorting the summarized data are compared with the original files of the reserved text labeling data before file segmentation according to the numbering sequence, whether the sorting is wrong is judged, and when the sorting is wrong, the marked data are re-sorted according to the original contents of the reserved text labeling data before segmentation.
Compared with the prior art, the text data labeling method has the following beneficial effects:
1. according to the invention, before labeling, the text labeling data or the data in the text searching range are segmented, then the segmented text labeling data or the data are sent to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced;
2. according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.
Drawings
Fig. 1 is a flowchart of a text data labeling method according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
A text data labeling method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
when a text search range is established, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data needing to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the grades are divided into 5 grades according to the similarity, and each grade is marked by different colors.
(2) Text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;
original documents of the text labeling data are reserved before the document segmentation, and the steps of segmenting and numbering the text labeling data are as follows:
firstly, dividing a text into parts according to the association degree of text contents, and numbering the parts for the first time according to a division sequence, for example, dividing text data into numbers 1, 2, 3, 4, 5, 6, 7, 8 and 9;
randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and numbering the integrated data for the second time, wherein the numbering is 14, 235, 68 and 79 for example;
and thirdly, summarizing the first-time number content and the second-time number content, and recording the summarized numbers.
(3) Text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;
after the file is issued, the serial number is checked with the platform, whether the file is issued completely or not is judged, and the file is re-issued aiming at the platform which issues the file wrongly when the serial number is not accurate or the serial number is missing.
(4) And text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, summarizes the data, and extracts the extracted features to the text corresponding positions of the issued text marking data according to the text marking standard;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
and judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not, and reapplying the file to the platform when the state of the file is abnormal.
(6) And text sequencing combination: sequencing the summarized data according to the serial number sequence to obtain the complete content of the label;
and comparing the files obtained after the summarized data are sequenced with the original files of the reserved text label data before the files are segmented according to the numbering sequence, judging whether the sequencing is wrong, and re-sequencing the data after the labeling according to the original contents of the reserved text label data before the segmentation when the sequencing is wrong.
Example 2
A text data labeling method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
when a text search range is established, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data needing to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the grades are divided into 5 grades according to the similarity, and each grade is marked by different colors.
(3) Text information segmentation and numbering: dividing the files in the text search range, directly dividing the total file into parts with equal data during division, numbering the divided data according to the division sequence, and keeping the record of the text search range before the file is divided.
(3) Text information publishing: the numbered data, the text annotation data and the text annotation standard are issued to different annotation platforms, and the serial numbers of the numbered data issued to the platforms are recorded;
after the file is issued, the serial number is checked with the platform, whether the file in the text search range is completely issued is judged, and when the serial number is lacked, the file is re-issued for the platform with the wrong issuing.
(4) And text information labeling: the marking platform receives the data, searches text marking data in the numbered data according to the text marking standard, summarizes the data, and extracts the extracted features to the text corresponding positions of the issued text marking data according to the text marking standard;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
and judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked or not, whether the file is completely marked or not and whether the file is missing or not, and reapplying the file to the platform when the state of the file is abnormal.
(6) And text sequencing combination: and integrating all the marked texts to obtain the marked complete content.
In the embodiment 1, the text is divided and labeled, and a labeling platform searches, extracts and labels the divided data in a complete text searching range; in the embodiment 2, the text search range is divided, and the labeling platform searches, extracts and labels the text label data in the divided text search range; compared with the embodiment 2, the embodiment 1 has high file security, and compared with the embodiment 1, the embodiment 2 does not need multiple numbering and is convenient to integrate;
in summary, the text data labeling method provided by the invention is characterized in that before labeling, text labeling data or data in a text search range are segmented, then the segmented text labeling data or data are sent to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the text data labeling speed can be increased, the text data labeling time can be shortened, and the pressure of the labeling platforms can be reduced; according to the method, the text labeling data are divided and then combined to be issued to different labeling platforms, the data received by each labeling platform are incomplete, the risk of data leakage is reduced, and the security of text data labeling is improved.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A text data labeling method is characterized in that: the method comprises the following steps:
(1) extracting text information: determining a text search range, text labeling data and a text labeling standard;
(2) text information segmentation and numbering: segmenting the text labeling data, and numbering the segmented data according to a segmentation sequence;
(3) text information publishing: the numbered data, the text search range and the text marking standard are issued to different marking platforms;
(4) and text information labeling: the marking platform receives the data, searches the numbered data in the text searching range according to the text marking standard, and summarizes the data;
(5) and retrieving the labeling information: summarizing all data searched by all the labeling platforms;
(6) and text sequencing combination: and sequencing the summarized data according to the serial number sequence to obtain the complete content of the label.
2. The method for labeling text data according to claim 1, wherein: and (2) dividing the text search range in the step (1) according to the main body or type of the text marking data, wherein the text marking data is marking data needing to be displayed, and the text marking standard is divided into different grades according to the similarity of the text marking data.
3. The method for labeling text data according to claim 1, wherein: and (3) reserving the original text labeling data before the file is segmented in the step (2), wherein the segmentation and numbering step of the text labeling data is as follows:
firstly, dividing a text into parts according to the association degree of text contents, and numbering for the first time according to the division sequence;
randomly integrating the contents of all parts, enabling the data of all parts to be equal after integration, and then numbering the integrated data for the second time;
and thirdly, summarizing the first-time number content and the second-time number content.
4. The method for labeling text data according to claim 1, wherein: in the step (2), the files in the text search range can be divided, the total file is directly divided into parts with equal data during division, and the divided data are numbered according to the dividing sequence.
5. The method of claim 4, wherein: in the step (3), the numbered data, the text labeling data and the text labeling standard can be issued to different labeling platforms, and the numbers of the numbered data issued to the platforms are recorded.
6. The method of claim 5, wherein: and (3) after the file is issued, checking the serial number with the platform, and judging whether the file is issued completely or not, and re-issuing the file aiming at the platform which issues the file wrongly when the serial number is not accurate or the serial number is missing.
7. The method for labeling text data according to claim 1, wherein: and (5) judging the state of the file after the file is recalled, comparing the serial numbers when the state is judged, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and reapplying the file to the platform when the state of the file is abnormal.
8. The method for labeling text data according to claim 1, wherein: and (6) comparing the files obtained after the summarized data are sequenced with the original files of the reserved text label data before the files are segmented according to the numbering sequence, judging whether the sequencing is wrong, and re-sequencing the labeled data according to the original contents of the reserved text label data before the segmentation when the sequencing is wrong.
CN202010881236.6A 2020-08-27 2020-08-27 Text data labeling method Active CN112052646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010881236.6A CN112052646B (en) 2020-08-27 2020-08-27 Text data labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010881236.6A CN112052646B (en) 2020-08-27 2020-08-27 Text data labeling method

Publications (2)

Publication Number Publication Date
CN112052646A true CN112052646A (en) 2020-12-08
CN112052646B CN112052646B (en) 2024-03-29

Family

ID=73600287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010881236.6A Active CN112052646B (en) 2020-08-27 2020-08-27 Text data labeling method

Country Status (1)

Country Link
CN (1) CN112052646B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023168964A1 (en) * 2022-03-07 2023-09-14 华为云计算技术有限公司 Data segmentation method and related apparatus

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1367446A (en) * 2001-01-22 2002-09-04 前程无忧网络信息技术(北京)有限公司上海分公司 Chinese personal biographical notes information treatment system and method
CN106021227A (en) * 2016-05-16 2016-10-12 南京大学 State transition and neural network-based Chinese chunk parsing method
CN106407407A (en) * 2016-09-22 2017-02-15 江苏通付盾科技有限公司 A file tagging system and method
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN107908642A (en) * 2017-09-29 2018-04-13 江苏华通晟云科技有限公司 Industry text entities extracting method based on distributed platform
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109635280A (en) * 2018-11-22 2019-04-16 园宝科技(武汉)有限公司 A kind of event extraction method based on mark
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110888991A (en) * 2019-11-28 2020-03-17 哈尔滨工程大学 Sectional semantic annotation method in weak annotation environment
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN111209728A (en) * 2020-01-13 2020-05-29 深圳市企鹅网络科技有限公司 Automatic test question labeling and inputting method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1367446A (en) * 2001-01-22 2002-09-04 前程无忧网络信息技术(北京)有限公司上海分公司 Chinese personal biographical notes information treatment system and method
CN106021227A (en) * 2016-05-16 2016-10-12 南京大学 State transition and neural network-based Chinese chunk parsing method
CN106407407A (en) * 2016-09-22 2017-02-15 江苏通付盾科技有限公司 A file tagging system and method
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN107908642A (en) * 2017-09-29 2018-04-13 江苏华通晟云科技有限公司 Industry text entities extracting method based on distributed platform
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN109635280A (en) * 2018-11-22 2019-04-16 园宝科技(武汉)有限公司 A kind of event extraction method based on mark
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110888991A (en) * 2019-11-28 2020-03-17 哈尔滨工程大学 Sectional semantic annotation method in weak annotation environment
CN111209728A (en) * 2020-01-13 2020-05-29 深圳市企鹅网络科技有限公司 Automatic test question labeling and inputting method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023168964A1 (en) * 2022-03-07 2023-09-14 华为云计算技术有限公司 Data segmentation method and related apparatus

Also Published As

Publication number Publication date
CN112052646B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
US20090144277A1 (en) Electronic table of contents entry classification and labeling scheme
US10445359B2 (en) Method and system for classifying media content
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN110321420B (en) Intelligent question-answering system and method based on question generation
CN113158653B (en) Training method, application method, device and equipment for pre-training language model
CN106407182A (en) A method for automatic abstracting for electronic official documents of enterprises
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN110175334B (en) Text knowledge extraction system and method based on custom knowledge slot structure
CN108959566A (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN107562843B (en) News hot phrase extraction method based on title high-frequency segmentation
CN110209828A (en) Case querying method and case inquiry unit, computer equipment and storage medium
CN112364172A (en) Method for constructing knowledge graph in government official document field
CN115618014B (en) Standard document analysis management system and method applying big data technology
CN109033225A (en) Chinese address identifying system
CN116340467B (en) Text processing method, text processing device, electronic equipment and computer readable storage medium
CN112052646A (en) Text data labeling method
CN114090736A (en) Enterprise industry identification system and method based on text similarity
US7107524B2 (en) Computer implemented example-based concept-oriented data extraction method
JP3735336B2 (en) Document summarization method and system
Natsev et al. IBM Research TRECVID-2008 Video Retrieval System.
CN113515622A (en) Classified storage system for archive data
CN105608137A (en) Method and device for extracting identity label
CN115544975B (en) Log format conversion method and device
CN111222031A (en) Website distinguishing method and system
CN111026743A (en) Rail transit engineering project structure data standardization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant