CN109582925B - Man-machine combined corpus labeling method and system - Google Patents

Man-machine combined corpus labeling method and system Download PDF

Info

Publication number
CN109582925B
CN109582925B CN201811323385.XA CN201811323385A CN109582925B CN 109582925 B CN109582925 B CN 109582925B CN 201811323385 A CN201811323385 A CN 201811323385A CN 109582925 B CN109582925 B CN 109582925B
Authority
CN
China
Prior art keywords
corpus
data
marked
key
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811323385.XA
Other languages
Chinese (zh)
Other versions
CN109582925A (en
Inventor
张泽明
肖龙源
蔡振华
李稀敏
刘晓葳
谭玉坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Corp ltd
Original Assignee
Xiamen Kuaishangtong Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Corp ltd filed Critical Xiamen Kuaishangtong Technology Corp ltd
Priority to CN201811323385.XA priority Critical patent/CN109582925B/en
Publication of CN109582925A publication Critical patent/CN109582925A/en
Application granted granted Critical
Publication of CN109582925B publication Critical patent/CN109582925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes

Abstract

The invention discloses a human-computer combined corpus labeling method and a human-computer combined corpus labeling system, which are used for acquiring corpus data to be labeled and carrying out manual observation; positioning key corpora of the corpus data according to positioning information input by a user; highlighting and marking the positioned key corpus to obtain a marked corpus; extracting the marked corpus from the corpus data through a screening algorithm; labeling the corpus categories of the labeled corpus to obtain labeled corpus; therefore, the corpus labeling of human-computer combination is realized, the labeling personnel can be assisted to improve the labeling efficiency, the workload of the labeling personnel is reduced, certain interactivity is realized, and the taste fatigue is reduced.

Description

Man-machine combined corpus labeling method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a human-computer combined corpus tagging method and a system applying the method.
Background
The corpus is a basic resource for linguistic research of the corpus and is also a main resource of an empirical language research method. The traditional corpus is mainly applied to the aspects of lexicography, language teaching, traditional language research, statistics or example-based research in natural language processing and the like. With the development of internet big data and artificial intelligence technology, the corpus is also widely applied.
The language material which is actually appeared in the actual use of the language, such as user leave words and customer service conversations which are directly obtained from the web pages, is stored in the corpus; the corpus is a basic resource bearing linguistic knowledge, but is not equal to the linguistic knowledge; the real corpus can be useful resources only after being processed, the processing of the real corpus can comprise dirty data removal, semantic labeling, part of speech labeling and the like, and when the corpus is labeled, each corpus data is often labeled by manpower or machine learning.
However, the large-scale data acquired in reality is often not as useful as expected by corresponding personnel, and the processing and labeling of large-scale corpora cannot be completed by a machine in reality, and more, certain manpower is needed to complete the labeling. The existence of such a situation leads to a certain amount of human resources or financial resources being spent, and even the efficiency of a development team is reduced.
Therefore, if this difficulty can be reduced, the human resources are freed from the difficulty, and the efficiency and progress of the project can be improved by a certain amount.
Disclosure of Invention
The invention provides a human-computer combined corpus labeling method and system for solving the problems, which can assist the labeling personnel to improve the labeling efficiency and reduce the workload of the labeling personnel.
In order to realize the purpose, the invention adopts the technical scheme that:
a human-computer combined corpus labeling method comprises the following steps:
a. obtaining corpus data to be labeled and carrying out manual observation;
b. positioning key corpora of the corpus data according to positioning information input by a user;
c. highlighting and marking the positioned key corpus to obtain a marked corpus;
d. extracting the marked corpus from the corpus data through a screening algorithm;
e. and marking the corpus category of the marked corpus to obtain a marked corpus.
Preferably, in the step a, the corpus data is a table text; in the step b, the key corpus is located by a method of cell location, and the key corpus corresponding to the cell is obtained according to the row and column information input by the user.
Or, in the step a, the corpus data is a document text; in the step b, the key corpus is located by a line number locating method, and the key corpus corresponding to the line number is obtained according to the line number information input by the user.
Preferably, in the step b, the positioning information is input through a command window; and displaying the prompt words of the positioning information to the user in the command window.
Preferably, in the step c, the highlighting mark is to add a font color or a background color different from the original corpus data to the marked corpus.
Preferably, in the step d, the filtering algorithm is to extract the labeled corpus from the corpus data according to a color condition.
Preferably, in the step e, the corpus classification of the tagged corpus is labeled by manually labeling the corpus classification or by training the corpus classification of the tagged corpus by machine learning.
Correspondingly, the invention also provides a human-computer combined corpus labeling system, which comprises:
the data acquisition module is used for acquiring corpus data to be labeled and carrying out manual observation;
the corpus positioning module is used for positioning key corpuses of the corpus data according to positioning information input by a user;
the corpus marking module is used for highlighting and marking the positioned key corpus to obtain a marked corpus;
the corpus screening module is used for extracting the marked corpus from the corpus data through a screening algorithm;
and the corpus labeling module is used for labeling the labeled corpus according to the corpus category to obtain the labeled corpus.
The invention has the beneficial effects that:
(1) According to the invention, by means of methods of manual observation, corpus positioning, corpus marking, corpus extraction and corpus marking, the corpus marking in combination with a human machine is realized, so that marking personnel can be assisted to improve marking efficiency and reduce the workload of the marking personnel;
(2) The corpus data of the invention adopts table texts or document texts and adopts a method of cell positioning or line number positioning, so that key corpuses can be quickly positioned and extracted;
(3) According to the method, the key linguistic data are prominently marked by adopting a color marking method, and the marked linguistic data are screened and extracted according to color conditions, so that the method is more visual, and the accuracy is improved;
(4) The invention provides a command window for the user to input the positioning information and displays the prompt words of the positioning information to the user, thereby having certain interactivity and reducing the smell-lacking feeling.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a simplified flow chart of a human-computer combined corpus tagging method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a command window (table text) according to a first embodiment of the present invention;
FIG. 3 is a diagram illustrating corpus tagging results (table text) according to a first embodiment of the present invention;
FIG. 4 is a diagram illustrating a screening result of markup corpora (table text) according to a first embodiment of the present invention;
FIG. 5 is a diagram illustrating the labeling result of the corpus category (table text) according to the first embodiment of the present invention;
fig. 6 is a schematic diagram of a command window (document text) according to a second embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects of the present invention more clear and obvious, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
First embodiment (Table text)
As shown in fig. 1, the present invention provides a method for labeling corpora by human-machine combination, which includes the following steps:
a human-computer combined corpus labeling method comprises the following steps:
a. obtaining corpus data to be labeled and carrying out manual observation;
b. positioning key corpora of the corpus data according to positioning information input by a user;
c. highlighting and marking the positioned key corpus to obtain a marked corpus;
d. extracting the marked corpus from the corpus data through a screening algorithm;
e. and marking the corpus category of the marked corpus to obtain a marked corpus.
In this embodiment, the corpus data is a table text; and the key corpus is positioned by a cell positioning method, and the key corpus corresponding to the cells is obtained according to the row and column information input by the user.
In the step b, the positioning information is input through a command window; and, show the prompt language of the said locating information to the user in the said command window; as shown in fig. 2, in the embodiment, the row and column information is obtained by first specifying column information, and then further specifying more than one row information based on the column information; the prompt words are used for prompting the user to input column information and then prompting the user to input more than one row information, so that the column information does not need to be input repeatedly, and the operation time is saved.
In the step c, the highlighting mark is to add a font color or a background color different from the original corpus data to the marked corpus; as shown in fig. 3, in the present embodiment, the markup corpus is red-marked with a font color added with red.
In the step d, the screening algorithm is to extract the labeled corpus from the corpus data according to color conditions; as shown in fig. 4, in this embodiment, an excel self-contained screening function is adopted to screen the tagged corpora, so as to ignore other corpora and make the interface simpler.
In the step e, the corpus categories of the marked corpus are labeled, namely, the corpus categories are labeled manually or the corpus categories of the marked corpus are trained by machine learning; as shown in FIG. 5, another column is used to record the corpus category of the tagged corpus, and no processing is performed on other corpora.
Second embodiment (document text)
The main differences between this embodiment and the first embodiment are: in this embodiment, in the step a, the corpus data is a document text; in the step b, the key corpus is located by a line number locating method, and the key corpus corresponding to the line number is obtained according to the line number information input by the user.
In addition, the embodiment also provides an optimized command window; as shown in fig. 6, the command window not only shows the prompt words of the positioning information to the user, but also further shows feedback words for confirming the positioning information input by the user, such as "OK", "correct", or "input error", so that the user can receive feedback in time, and the interactivity is better.
The remaining labeling process of this embodiment is basically similar to that of the first embodiment, and is not repeated herein.
Third embodiment (labeling System)
In addition, the invention also provides a system corresponding to the man-machine combined corpus labeling method, which comprises the following steps:
the data acquisition module is used for acquiring the corpus data to be labeled and carrying out manual observation;
the corpus positioning module is used for positioning key corpuses of the corpus data according to positioning information input by a user;
the corpus marking module is used for highlighting and marking the positioned key corpus to obtain marked corpus;
a corpus screening module for extracting the tagged corpus from the corpus data by a screening algorithm;
and the corpus labeling module is used for labeling the labeled corpus according to the corpus category to obtain the labeled corpus.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element. In addition, those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing associated hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.
While the foregoing specification illustrates and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the precise forms disclosed herein and is not to be construed as limited to other embodiments, but may be used in various other combinations, modifications, and environments and may be modified within the scope of the inventive concept as expressed herein, by the above teachings or by the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A human-computer combined corpus labeling method is characterized by comprising the following steps:
a. obtaining corpus data to be labeled and carrying out manual observation;
b. positioning key linguistic data according to positioning information input by a user;
c. highlighting and marking the positioned key corpus to obtain a marked corpus;
d. extracting the marked corpus from the corpus data through a screening algorithm;
e. and marking the corpus categories of the marked corpuses to obtain marked corpuses.
2. The method for linguistic data annotation combined with human-computer according to claim 1, wherein: in the step a, the corpus data is a table text; in the step b, the key corpus is located by obtaining the key corpus corresponding to the cell according to the row and column information input by the user by a cell locating method.
3. The method for linguistic data annotation combined with human-computer according to claim 1, wherein: in the step a, the corpus data is a document text; in the step b, the key corpus is located by a line number locating method, and the key corpus corresponding to the line number is obtained according to the line number information input by the user.
4. The human-computer combined corpus tagging method according to any one of claims 1 to 3, wherein: in the step b, the positioning information is input through a command window; and displaying the prompt words of the positioning information to the user in the command window.
5. The human-computer combined corpus tagging method according to any one of claims 1 to 3, wherein: in the step c, the highlighting mark is to add a font color or a background color different from the original corpus data to the marked corpus.
6. The method according to claim 5, wherein said method comprises: in the step d, the screening algorithm is to extract the labeled corpus from the corpus data according to a color condition.
7. The method according to any one of claims 1 to 3, wherein: and e, marking the corpus categories of the marked corpus, wherein the corpus categories are marked manually or are trained by machine learning.
8. A corpus annotation system with human-computer combination is characterized by comprising:
the data acquisition module is used for acquiring the corpus data to be labeled and carrying out manual observation;
the corpus positioning module is used for positioning key corpuses of the corpus data according to positioning information input by a user;
the corpus marking module is used for highlighting and marking the positioned key corpus to obtain a marked corpus;
a corpus screening module for extracting the tagged corpus from the corpus data by a screening algorithm;
and the corpus labeling module is used for labeling the labeled corpus according to the corpus category to obtain the labeled corpus.
CN201811323385.XA 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system Active CN109582925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811323385.XA CN109582925B (en) 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811323385.XA CN109582925B (en) 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system

Publications (2)

Publication Number Publication Date
CN109582925A CN109582925A (en) 2019-04-05
CN109582925B true CN109582925B (en) 2023-02-14

Family

ID=65921772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811323385.XA Active CN109582925B (en) 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system

Country Status (1)

Country Link
CN (1) CN109582925B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347921B (en) * 2019-07-04 2022-04-19 有光创新(北京)信息技术有限公司 Label extraction method and device for multi-mode data information

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831131A (en) * 2011-06-16 2012-12-19 富士通株式会社 Method and device for establishing labeling webpage linguistic corpus
CN102982036A (en) * 2011-09-05 2013-03-20 北大方正集团有限公司 Method of corpus structuralization and device
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN106782509A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of corpus labeling method and device and terminal
CN107729921A (en) * 2017-09-20 2018-02-23 厦门快商通科技股份有限公司 A kind of machine Active Learning Method and learning system
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018000272A1 (en) * 2016-06-29 2018-01-04 深圳狗尾草智能科技有限公司 Corpus generation device and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831131A (en) * 2011-06-16 2012-12-19 富士通株式会社 Method and device for establishing labeling webpage linguistic corpus
CN102982036A (en) * 2011-09-05 2013-03-20 北大方正集团有限公司 Method of corpus structuralization and device
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN106782509A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of corpus labeling method and device and terminal
CN107729921A (en) * 2017-09-20 2018-02-23 厦门快商通科技股份有限公司 A kind of machine Active Learning Method and learning system
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
语料库中语料的标注;董爱华;《北京印刷学院学报》;20160531;67-70 *

Also Published As

Publication number Publication date
CN109582925A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
Day et al. Mixed-initiative development of language processing systems
Ciravegna et al. User-system cooperation in document annotation based on information extraction
CN111259631B (en) Referee document structuring method and referee document structuring device
CN108090400A (en) A kind of method and apparatus of image text identification
CN111428467A (en) Method, device, equipment and storage medium for generating reading comprehension question topic
CN109740159B (en) Processing method and device for named entity recognition
CN113779345B (en) Teaching material generation method and device, computer equipment and storage medium
CN110688856B (en) Referee document information extraction method
CN106682224B (en) Data entry method, system and database
CN109582925B (en) Man-machine combined corpus labeling method and system
CN107451215B (en) Feature text extraction method and device
CN116304023A (en) Method, system and storage medium for extracting bidding elements based on NLP technology
Balk et al. IMPACT: working together to address the challenges involving mass digitization of historical printed text
CN116010569A (en) Online answering method, system, electronic equipment and storage medium
CN116306506A (en) Intelligent mail template method based on content identification
CN110830851B (en) Method and device for making video file
CN113837167A (en) Text image recognition method, device, equipment and storage medium
CN113901793A (en) Event extraction method and device combining RPA and AI
CN113408290A (en) Intelligent marking method and system for Chinese text
Bosch et al. Computer-assisted transcription of a historical botanical specimen book: organization and process overview
CN112328812A (en) Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
EP4303716A1 (en) Method for generating data input, data input system and computer program
CN117494806B (en) Relation extraction method, system and medium based on knowledge graph and large language model
CN110728116B (en) Method and device for generating video file dubbing manuscript
CN116127042A (en) Text question-answer pair extraction method and device based on multiple models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant