CN109582925B

CN109582925B - Man-machine combined corpus labeling method and system

Info

Publication number: CN109582925B
Application number: CN201811323385.XA
Authority: CN
Inventors: 张泽明; 肖龙源; 蔡振华; 李稀敏; 刘晓葳; 谭玉坤
Original assignee: Xiamen Kuaishangtong Technology Corp ltd
Current assignee: Xiamen Kuaishangtong Technology Corp ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2023-02-14
Anticipated expiration: 2038-11-08
Also published as: CN109582925A

Abstract

The invention discloses a human-computer combined corpus labeling method and a human-computer combined corpus labeling system, which are used for acquiring corpus data to be labeled and carrying out manual observation; positioning key corpora of the corpus data according to positioning information input by a user; highlighting and marking the positioned key corpus to obtain a marked corpus; extracting the marked corpus from the corpus data through a screening algorithm; labeling the corpus categories of the labeled corpus to obtain labeled corpus; therefore, the corpus labeling of human-computer combination is realized, the labeling personnel can be assisted to improve the labeling efficiency, the workload of the labeling personnel is reduced, certain interactivity is realized, and the taste fatigue is reduced.

Description

Man-machine combined corpus labeling method and system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a human-computer combined corpus tagging method and a system applying the method.

Background

The corpus is a basic resource for linguistic research of the corpus and is also a main resource of an empirical language research method. The traditional corpus is mainly applied to the aspects of lexicography, language teaching, traditional language research, statistics or example-based research in natural language processing and the like. With the development of internet big data and artificial intelligence technology, the corpus is also widely applied.

The language material which is actually appeared in the actual use of the language, such as user leave words and customer service conversations which are directly obtained from the web pages, is stored in the corpus; the corpus is a basic resource bearing linguistic knowledge, but is not equal to the linguistic knowledge; the real corpus can be useful resources only after being processed, the processing of the real corpus can comprise dirty data removal, semantic labeling, part of speech labeling and the like, and when the corpus is labeled, each corpus data is often labeled by manpower or machine learning.

However, the large-scale data acquired in reality is often not as useful as expected by corresponding personnel, and the processing and labeling of large-scale corpora cannot be completed by a machine in reality, and more, certain manpower is needed to complete the labeling. The existence of such a situation leads to a certain amount of human resources or financial resources being spent, and even the efficiency of a development team is reduced.

Therefore, if this difficulty can be reduced, the human resources are freed from the difficulty, and the efficiency and progress of the project can be improved by a certain amount.

Disclosure of Invention

The invention provides a human-computer combined corpus labeling method and system for solving the problems, which can assist the labeling personnel to improve the labeling efficiency and reduce the workload of the labeling personnel.

In order to realize the purpose, the invention adopts the technical scheme that:

a human-computer combined corpus labeling method comprises the following steps:

a. obtaining corpus data to be labeled and carrying out manual observation;

b. positioning key corpora of the corpus data according to positioning information input by a user;

c. highlighting and marking the positioned key corpus to obtain a marked corpus;

d. extracting the marked corpus from the corpus data through a screening algorithm;

e. and marking the corpus category of the marked corpus to obtain a marked corpus.

Preferably, in the step a, the corpus data is a table text; in the step b, the key corpus is located by a method of cell location, and the key corpus corresponding to the cell is obtained according to the row and column information input by the user.

Or, in the step a, the corpus data is a document text; in the step b, the key corpus is located by a line number locating method, and the key corpus corresponding to the line number is obtained according to the line number information input by the user.

Preferably, in the step b, the positioning information is input through a command window; and displaying the prompt words of the positioning information to the user in the command window.

Preferably, in the step c, the highlighting mark is to add a font color or a background color different from the original corpus data to the marked corpus.

Preferably, in the step d, the filtering algorithm is to extract the labeled corpus from the corpus data according to a color condition.

Preferably, in the step e, the corpus classification of the tagged corpus is labeled by manually labeling the corpus classification or by training the corpus classification of the tagged corpus by machine learning.

Correspondingly, the invention also provides a human-computer combined corpus labeling system, which comprises:

the data acquisition module is used for acquiring corpus data to be labeled and carrying out manual observation;

the corpus positioning module is used for positioning key corpuses of the corpus data according to positioning information input by a user;

the corpus marking module is used for highlighting and marking the positioned key corpus to obtain a marked corpus;

the corpus screening module is used for extracting the marked corpus from the corpus data through a screening algorithm;

and the corpus labeling module is used for labeling the labeled corpus according to the corpus category to obtain the labeled corpus.

The invention has the beneficial effects that:

(1) According to the invention, by means of methods of manual observation, corpus positioning, corpus marking, corpus extraction and corpus marking, the corpus marking in combination with a human machine is realized, so that marking personnel can be assisted to improve marking efficiency and reduce the workload of the marking personnel;

(2) The corpus data of the invention adopts table texts or document texts and adopts a method of cell positioning or line number positioning, so that key corpuses can be quickly positioned and extracted;

(3) According to the method, the key linguistic data are prominently marked by adopting a color marking method, and the marked linguistic data are screened and extracted according to color conditions, so that the method is more visual, and the accuracy is improved;

(4) The invention provides a command window for the user to input the positioning information and displays the prompt words of the positioning information to the user, thereby having certain interactivity and reducing the smell-lacking feeling.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a simplified flow chart of a human-computer combined corpus tagging method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a command window (table text) according to a first embodiment of the present invention;

FIG. 3 is a diagram illustrating corpus tagging results (table text) according to a first embodiment of the present invention;

FIG. 4 is a diagram illustrating a screening result of markup corpora (table text) according to a first embodiment of the present invention;

FIG. 5 is a diagram illustrating the labeling result of the corpus category (table text) according to the first embodiment of the present invention;

fig. 6 is a schematic diagram of a command window (document text) according to a second embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more clear and obvious, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

First embodiment (Table text)

As shown in fig. 1, the present invention provides a method for labeling corpora by human-machine combination, which includes the following steps:

a human-computer combined corpus labeling method comprises the following steps:

a. obtaining corpus data to be labeled and carrying out manual observation;

In this embodiment, the corpus data is a table text; and the key corpus is positioned by a cell positioning method, and the key corpus corresponding to the cells is obtained according to the row and column information input by the user.

In the step b, the positioning information is input through a command window; and, show the prompt language of the said locating information to the user in the said command window; as shown in fig. 2, in the embodiment, the row and column information is obtained by first specifying column information, and then further specifying more than one row information based on the column information; the prompt words are used for prompting the user to input column information and then prompting the user to input more than one row information, so that the column information does not need to be input repeatedly, and the operation time is saved.

In the step c, the highlighting mark is to add a font color or a background color different from the original corpus data to the marked corpus; as shown in fig. 3, in the present embodiment, the markup corpus is red-marked with a font color added with red.

In the step d, the screening algorithm is to extract the labeled corpus from the corpus data according to color conditions; as shown in fig. 4, in this embodiment, an excel self-contained screening function is adopted to screen the tagged corpora, so as to ignore other corpora and make the interface simpler.

In the step e, the corpus categories of the marked corpus are labeled, namely, the corpus categories are labeled manually or the corpus categories of the marked corpus are trained by machine learning; as shown in FIG. 5, another column is used to record the corpus category of the tagged corpus, and no processing is performed on other corpora.

Second embodiment (document text)

The main differences between this embodiment and the first embodiment are: in this embodiment, in the step a, the corpus data is a document text; in the step b, the key corpus is located by a line number locating method, and the key corpus corresponding to the line number is obtained according to the line number information input by the user.

In addition, the embodiment also provides an optimized command window; as shown in fig. 6, the command window not only shows the prompt words of the positioning information to the user, but also further shows feedback words for confirming the positioning information input by the user, such as "OK", "correct", or "input error", so that the user can receive feedback in time, and the interactivity is better.

The remaining labeling process of this embodiment is basically similar to that of the first embodiment, and is not repeated herein.

Third embodiment (labeling System)

In addition, the invention also provides a system corresponding to the man-machine combined corpus labeling method, which comprises the following steps:

the data acquisition module is used for acquiring the corpus data to be labeled and carrying out manual observation;

the corpus marking module is used for highlighting and marking the positioned key corpus to obtain marked corpus;

a corpus screening module for extracting the tagged corpus from the corpus data by a screening algorithm;

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element. In addition, those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing associated hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.

While the foregoing specification illustrates and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the precise forms disclosed herein and is not to be construed as limited to other embodiments, but may be used in various other combinations, modifications, and environments and may be modified within the scope of the inventive concept as expressed herein, by the above teachings or by the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A human-computer combined corpus labeling method is characterized by comprising the following steps:

a. obtaining corpus data to be labeled and carrying out manual observation;

b. positioning key linguistic data according to positioning information input by a user;

e. and marking the corpus categories of the marked corpuses to obtain marked corpuses.

2. The method for linguistic data annotation combined with human-computer according to claim 1, wherein: in the step a, the corpus data is a table text; in the step b, the key corpus is located by obtaining the key corpus corresponding to the cell according to the row and column information input by the user by a cell locating method.

3. The method for linguistic data annotation combined with human-computer according to claim 1, wherein: in the step a, the corpus data is a document text; in the step b, the key corpus is located by a line number locating method, and the key corpus corresponding to the line number is obtained according to the line number information input by the user.

4. The human-computer combined corpus tagging method according to any one of claims 1 to 3, wherein: in the step b, the positioning information is input through a command window; and displaying the prompt words of the positioning information to the user in the command window.

5. The human-computer combined corpus tagging method according to any one of claims 1 to 3, wherein: in the step c, the highlighting mark is to add a font color or a background color different from the original corpus data to the marked corpus.

6. The method according to claim 5, wherein said method comprises: in the step d, the screening algorithm is to extract the labeled corpus from the corpus data according to a color condition.

7. The method according to any one of claims 1 to 3, wherein: and e, marking the corpus categories of the marked corpus, wherein the corpus categories are marked manually or are trained by machine learning.

8. A corpus annotation system with human-computer combination is characterized by comprising: