CN112052646B - Text data labeling method - Google Patents

Text data labeling method Download PDF

Info

Publication number
CN112052646B
CN112052646B CN202010881236.6A CN202010881236A CN112052646B CN 112052646 B CN112052646 B CN 112052646B CN 202010881236 A CN202010881236 A CN 202010881236A CN 112052646 B CN112052646 B CN 112052646B
Authority
CN
China
Prior art keywords
text
data
labeling
file
marking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010881236.6A
Other languages
Chinese (zh)
Other versions
CN112052646A (en
Inventor
江灏
汤智
曾东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Jurong Science And Technology Information Consulting Co ltd
Original Assignee
Anhui Jurong Science And Technology Information Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Jurong Science And Technology Information Consulting Co ltd filed Critical Anhui Jurong Science And Technology Information Consulting Co ltd
Priority to CN202010881236.6A priority Critical patent/CN112052646B/en
Publication of CN112052646A publication Critical patent/CN112052646A/en
Application granted granted Critical
Publication of CN112052646B publication Critical patent/CN112052646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification

Abstract

The invention discloses a text data labeling method, which comprises the following steps: text information extraction: determining a text searching range, text marking data and a text marking standard; text information segmentation and numbering: the text labeling data is segmented, and the segmented data is numbered according to the segmentation order. According to the text data labeling method, the text labeling data or the data in the text searching range are segmented before labeling, then the text labeling data or the data in the text searching range are issued to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the speed of labeling the text data can be increased, the time of labeling the text data can be shortened, and the pressure of the labeling platforms can be reduced; this patent divides text annotation data, then issues to each different annotation platform after the combination, and the data that every annotation platform received is incomplete, reduces the data leakage risk, improves the security of text data annotation.

Description

Text data labeling method
Technical Field
The invention relates to the field of data management, in particular to a text data labeling method.
Background
Along with the rapid development of society, the living standard of people is continuously improved, information becomes an important part of each industry, people usually check the designed text when designing products, related contents need to be marked, and in order to be convenient for people to mark text data, people invent a plurality of text data marking methods;
the existing text data labeling method has certain defects when in use, firstly, the existing text data labeling method generally gives texts to a related labeling platform to be directly compared with all data, the larger the data in a database is, the more and more time is required for labeling, the higher and more requirements on labeling platform equipment cannot be timely labeled, secondly, all files are summarized together, the condition of file leakage is easy to occur, and the safety performance is not high.
Disclosure of Invention
The invention mainly aims to provide a text data labeling method which can effectively solve the problems in the background technology
Problems.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a text data labeling method comprises the following steps:
(1) Extracting text information: determining a text searching range, text marking data and a text marking standard;
(2) Text information segmentation and numbering: dividing the text labeling data, and numbering the divided data according to the dividing sequence;
(3) And (3) issuing text information: issuing numbered data, text searching range and text labeling standard to different labeling platforms;
(4) Labeling text information: the labeling platform receives the data, searches numbered data in a text searching range according to a text labeling standard, and then gathers the data;
(5) Recall of the annotation information: summarizing all the data searched by all the labeling platforms;
(6) Text ordering combination: and sequencing the summarized data according to the numbering sequence to obtain the marked complete content.
Preferably, in the step (1), the text searching range is defined according to the main body or type of the text marking data, namely marking data to be displayed, and the text marking standard is divided into different grades according to the similarity of the text marking data.
Preferably, the text labeling data original is reserved before the file is segmented in the step (2), and the segmentation and numbering steps of the text labeling data are as follows:
(1) dividing the text into various parts according to the association degree of the text content, and numbering the text for the first time according to the segmentation sequence;
(2) randomly integrating the contents of all the parts, enabling the data of all the parts after integration to be equal, and numbering the integrated data for the second time; (3) and summarizing the first time numbering content and the second time numbering content.
Preferably, in the step (2), the files in the text searching range can be further divided, the total files are directly divided into parts with equal data during the division, and the divided data are numbered according to the dividing sequence.
Preferably, in the step (3), the numbered data, the text labeling data and the text labeling standard can be issued to different labeling platforms, and the numbers of the numbered data issued to the respective platforms are recorded.
Preferably, after the file is issued in the step (3), the serial number is checked with the platform, whether the file is issued completely or not is judged, and when the serial number is inaccurate or is missing, the file is issued again for the platform with the wrong issuing.
Preferably, in the step (5), after the file is recalled, the file state is judged, the numbers are compared in the judging process, whether the file is marked, whether the file is completely marked and whether the file is missing or not is checked, and when the file state is abnormal, the file is applied to the platform again.
Preferably, in the step (6), the file after the summarized data are sequenced according to the serial number sequence is compared with the original document of the reserved text marking data before the file is segmented, whether the sequencing is wrong or not is judged, and when the sequencing is wrong, the marked data are reordered according to the original document content of the reserved text marking data before the segmentation.
Compared with the prior art, the text data labeling method has the following beneficial effects:
1. according to the method, the text marking data or the data in the text searching range are segmented before marking, then the text marking data or the data in the text searching range are issued to different marking platforms for synchronous marking, and finally the marked results are summarized, so that the speed of marking the text data can be increased, the time of marking the text data can be shortened, and the pressure of the marking platforms can be reduced;
2. this patent divides text annotation data, then issues to each different annotation platform after the combination, and the data that every annotation platform received is incomplete, reduces the data leakage risk, improves the security of text data annotation.
Drawings
Fig. 1 is a flowchart of a text data labeling method according to the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
A text data labeling method comprises the following steps:
(1) Extracting text information: determining a text searching range, text marking data and a text marking standard;
when the text searching range is formulated, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the text marking standard is divided into 5 grades according to the similarity of the texts marking data, and each grade is marked by different colors.
(2) Text information segmentation and numbering: dividing the text labeling data, and numbering the divided data according to the dividing sequence;
the original text marking data is reserved before the file is segmented, and the steps of segmentation and numbering of the text marking data are as follows:
(1) dividing the text into various parts according to the association degree of the text content, and numbering the text data for the first time according to the division sequence, for example dividing the text data into numbers 1, 2, 3, 4, 5, 6, 7, 8 and 9;
(2) randomly integrating the contents of all the parts, wherein the data of all the parts after integration are equal, and then numbering the integrated data for the second time, for example, numbering 14, 235, 68 and 79;
(3) and summarizing the first numbering content and the second numbering content, and recording the summarized numbers.
(3) And (3) issuing text information: issuing numbered data, text searching range and text labeling standard to different labeling platforms;
after the file is issued, checking the number with the platform, judging whether the file is issued completely, and re-issuing the file to the platform with wrong issuing when the number is inaccurate or the number is missing.
(4) Labeling text information: the labeling platform receives the data, searches numbered data in a text searching range according to a text labeling standard, gathers the data, and extracts the extracted features to the text corresponding position of the issued text labeling data according to the text labeling standard;
(5) Recall of the annotation information: summarizing all the data searched by all the labeling platforms;
judging the file state after the file is recalled, comparing numbers in judging the state, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and re-applying the file to the platform when the file state is abnormal.
(6) Text ordering combination: sequencing the summarized data according to the numbering sequence to obtain marked complete content;
comparing the file after ordering the summarized data with the original document with the reserved text marking data before the file is divided according to the numbering sequence, judging whether the sorting is wrong, and re-sorting the marked data according to the original content of the reserved text marked data before segmentation when the sorting is wrong.
Examples
A text data labeling method comprises the following steps:
(1) Extracting text information: determining a text searching range, text marking data and a text marking standard;
when the text searching range is formulated, the text marking data is divided according to the main body or the type of the text marking data, namely the marking data to be displayed, the text marking standard is divided into different grades according to the similarity of the text marking data, for example, the text marking standard is divided into 5 grades according to the similarity of the texts marking data, and each grade is marked by different colors.
(3) Text information segmentation and numbering: dividing the file in the text searching range, directly dividing the total file into parts with equal data, numbering the divided data according to the dividing sequence, and keeping the record of the text searching range before the file is divided.
(3) And (3) issuing text information: the numbered data, the text marking data and the text marking standard are issued to different marking platforms, and the numbers of the numbered data issued to each platform are recorded;
after the file is issued, checking the number with the platform, judging whether the file in the text searching range is issued completely, and re-issuing the file aiming at the platform with the issuing error when the number is missing.
(4) Labeling text information: the labeling platform receives the data, searches text labeling data in the numbered data according to the text labeling standard, summarizes the data, and extracts the extracted features to the text corresponding position of the issued text labeling data according to the text labeling standard;
(5) Recall of the annotation information: summarizing all the data searched by all the labeling platforms;
judging the file state after the file is recalled, comparing numbers in judging the state, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and re-applying the file to the platform when the file state is abnormal.
(6) Text ordering combination: and integrating all marked texts to obtain marked complete contents.
The embodiment 1 is to divide text labeling data, and a labeling platform searches, extracts and labels the divided data in a complete text searching range; embodiment 2 is to divide a text searching range, and the labeling platform searches, extracts and labels text labeling data in the divided text searching range; compared with the embodiment 2, the embodiment 1 has high file security, the embodiment 2 does not need to be numbered for a plurality of times compared with the embodiment 1, and the integration is convenient;
in summary, according to the text data labeling method, the text labeling data or the data in the text searching range are segmented before labeling, then the segmented text labeling data or the data in the text searching range are issued to different labeling platforms for synchronous labeling, and finally the labeled results are summarized, so that the speed of labeling the text data can be increased, the time of labeling the text data can be shortened, and the pressure of the labeling platforms can be reduced; this patent divides text annotation data, then issues to each different annotation platform after the combination, and the data that every annotation platform received is incomplete, reduces the data leakage risk, improves the security of text data annotation.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (5)

1. A text data labeling method is characterized in that: the method comprises the following steps:
(1) Extracting text information: determining a text searching range, text marking data and a text marking standard;
(2) Text information segmentation and numbering: dividing the text labeling data, and numbering the divided data according to the dividing sequence;
(3) And (3) issuing text information: issuing numbered data, text searching range and text labeling standard to different labeling platforms;
(4) Labeling text information: the labeling platform receives the data, searches numbered data in a text searching range according to a text labeling standard, and then gathers the data;
(5) Recall of the annotation information: summarizing all the data searched by all the labeling platforms;
(6) Text ordering combination: sequencing the summarized data according to the numbering sequence to obtain marked complete content;
dividing the text searching range in the step (1) according to the main body or type of the text marking data, namely marking data to be displayed, and dividing the text marking standard into different grades according to the similarity of the text marking data;
the original text marking data is reserved before the file is segmented in the step (2), and the segmentation and numbering steps of the text marking data are as follows:
(1) dividing the text into various parts according to the association degree of the text content, and numbering the text for the first time according to the segmentation sequence;
(2) randomly integrating the contents of all the parts, enabling the data of all the parts after integration to be equal, and numbering the integrated data for the second time;
(3) summarizing the first numbering content and the second numbering content;
in the step (2), the files in the text searching range are further divided, the total files are directly divided into parts with equal data during division, and the divided data are numbered according to the dividing sequence.
2. A method for labeling text data as recited in claim 1, wherein: and (3) further issuing numbered data, text marking data and text marking standards to different marking platforms, and recording the numbers of the numbered data issued to the platforms.
3. A method for labeling text data as recited in claim 2, wherein: and (3) checking the serial numbers with the platform after the file is issued, judging whether the file is issued completely, and re-issuing the file aiming at the platform with wrong issuing when the serial numbers are inaccurate or missing.
4. A method for labeling text data as recited in claim 1, wherein: and (5) judging the file state after the file is recalled, comparing numbers in judging the state, checking whether the file is marked, whether the file is completely marked and whether the file is missing, and applying for the file to the platform again when the file state is abnormal.
5. A method for labeling text data as recited in claim 1, wherein: in the step (6), the file after ordering the summarized data according to the number sequence is compared with the original text marking data reserved before the file is divided, judging whether the sorting is wrong, and re-sorting the marked data according to the original content of the reserved text marking data before segmentation when the sorting is wrong.
CN202010881236.6A 2020-08-27 2020-08-27 Text data labeling method Active CN112052646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010881236.6A CN112052646B (en) 2020-08-27 2020-08-27 Text data labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010881236.6A CN112052646B (en) 2020-08-27 2020-08-27 Text data labeling method

Publications (2)

Publication Number Publication Date
CN112052646A CN112052646A (en) 2020-12-08
CN112052646B true CN112052646B (en) 2024-03-29

Family

ID=73600287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010881236.6A Active CN112052646B (en) 2020-08-27 2020-08-27 Text data labeling method

Country Status (1)

Country Link
CN (1) CN112052646B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796356A (en) * 2022-03-07 2023-09-22 华为云计算技术有限公司 Data segmentation method and related device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1367446A (en) * 2001-01-22 2002-09-04 前程无忧网络信息技术(北京)有限公司上海分公司 Chinese personal biographical notes information treatment system and method
CN106021227A (en) * 2016-05-16 2016-10-12 南京大学 State transition and neural network-based Chinese chunk parsing method
CN106407407A (en) * 2016-09-22 2017-02-15 江苏通付盾科技有限公司 A file tagging system and method
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN107908642A (en) * 2017-09-29 2018-04-13 江苏华通晟云科技有限公司 Industry text entities extracting method based on distributed platform
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109635280A (en) * 2018-11-22 2019-04-16 园宝科技(武汉)有限公司 A kind of event extraction method based on mark
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110888991A (en) * 2019-11-28 2020-03-17 哈尔滨工程大学 Sectional semantic annotation method in weak annotation environment
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN111209728A (en) * 2020-01-13 2020-05-29 深圳市企鹅网络科技有限公司 Automatic test question labeling and inputting method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1367446A (en) * 2001-01-22 2002-09-04 前程无忧网络信息技术(北京)有限公司上海分公司 Chinese personal biographical notes information treatment system and method
CN106021227A (en) * 2016-05-16 2016-10-12 南京大学 State transition and neural network-based Chinese chunk parsing method
CN106407407A (en) * 2016-09-22 2017-02-15 江苏通付盾科技有限公司 A file tagging system and method
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN107908642A (en) * 2017-09-29 2018-04-13 江苏华通晟云科技有限公司 Industry text entities extracting method based on distributed platform
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN109635280A (en) * 2018-11-22 2019-04-16 园宝科技(武汉)有限公司 A kind of event extraction method based on mark
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110888991A (en) * 2019-11-28 2020-03-17 哈尔滨工程大学 Sectional semantic annotation method in weak annotation environment
CN111209728A (en) * 2020-01-13 2020-05-29 深圳市企鹅网络科技有限公司 Automatic test question labeling and inputting method

Also Published As

Publication number Publication date
CN112052646A (en) 2020-12-08

Similar Documents

Publication Publication Date Title
US8843815B2 (en) System and method for automatically extracting metadata from unstructured electronic documents
CN101770446B (en) Method and system for identifying form in layout file
CN113158653B (en) Training method, application method, device and equipment for pre-training language model
CN111177332B (en) Method and device for automatically extracting judge document case-related label and judge result
JPS62229368A (en) Document processor
CN111061742B (en) Method and device for marking data and service system thereof
EP2782023A2 (en) Method for the automated analysis of text documents
CN110705223A (en) Footnote recognition and extraction method for multi-page layout document
CN103488627B (en) Full piece patent document interpretation method and translation system
CN112052646B (en) Text data labeling method
CN101989289A (en) Data clustering method and device
CN104035993B (en) Memory search method, e-book management system, the reading system of e-book
CN112926299B (en) Text comparison method, contract review method and auditing system
CN106599048A (en) Method and device for recovering deleted records of SQLite database file
CN115080704B (en) Computer file security check method and system based on scoring mechanism
CN104064182A (en) A voice recognition system and method based on classification rules
CN104156373B (en) Coded format detection method and device
CN111209831A (en) Document table content identification method and device based on classification algorithm
CN111026743B (en) Rail transit engineering project structure data standardization method
CN115544975B (en) Log format conversion method and device
CN105608137A (en) Method and device for extracting identity label
CN109918638B (en) Network data monitoring method
CN112101007A (en) Method and system for extracting structured data from unstructured text data
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN106294875B (en) A kind of name entity fuzzy retrieval method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant