CN113408290A

CN113408290A - Intelligent marking method and system for Chinese text

Info

Publication number: CN113408290A
Application number: CN202110730230.3A
Authority: CN
Inventors: 辛国茂; 孙露; 吴士伟; 李钊; 卢凤; 郭梦燕; 孙浩; 陈通
Original assignee: Shandong Ecloud Information Technology Co ltd
Current assignee: Shandong Ecloud Information Technology Co ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-17

Abstract

The invention discloses an intelligent marking method and system for Chinese texts, wherein the method comprises the following steps: acquiring a data set to be marked, an entity to be marked and a relation label; performing entity and relationship identification on the data set to be annotated based on an entity relationship extraction model to obtain a pre-annotation result; and receiving the correction of the user on the pre-labeling result to finish the labeling. According to the invention, the entity relationship is identified through the model to obtain the pre-labeling result, and then the manual labeling method based on the interactive page is provided based on the pre-labeling result, so that the precision of the entity and relationship labeling is ensured.

Description

Intelligent marking method and system for Chinese text

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to an intelligent marking method and system for Chinese texts.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Information extraction is an important task of natural language processing, and one of the basic purposes of the information extraction is to extract meaningful structured information from an original unstructured text so as to be used for natural language processing applications such as intelligent question answering and retrieval, and therefore more intelligent experience is brought to users. The information extraction is an enormous task and comprises subtasks of named entity identification, relation extraction, event extraction and the like. Due to the shortage of Chinese markup corpora, information extraction is less studied than on English datasets. And the labeled corpus data set needs manual labeling, so that the data contains high-quality entity relationship triples and almost has no noise. However, manually labeling data sets is time and labor consuming, and therefore these data sets are generally small in size and do not support subsequent modeling well.

The word segmentation is used as the first step of entity relationship extraction, and the accuracy of the word segmentation has great influence on the efficiency and the accuracy of entity relationship extraction. As is well known, since spaces naturally exist between english words to separate them, it is very easy to segment words by spaces when processing english text. Compared with English, Chinese has no separator in the middle of each sentence, and is formed by connecting a string of continuous Chinese characters in sequence. How to correctly complete Chinese segmentation according to semantics is a challenging task, once word segmentation fails, the subsequent text processing can generate linkage problems, and barriers are brought to correctly understanding semantics.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an intelligent marking method and system for Chinese texts, wherein the entity relationship is identified through a model to obtain a pre-marking result, and then an artificial marking method based on an interactive page is provided based on the pre-marking result, so that the precision of entity and relationship marking is ensured.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

an intelligent labeling method for Chinese texts comprises the following steps:

acquiring a data set to be marked, an entity to be marked and a relation label;

performing entity and relationship identification on the data set to be annotated based on an entity relationship extraction model to obtain a pre-annotation result;

and receiving the correction of the user on the pre-labeling result to finish the labeling.

Further, after the pre-labeling result is obtained, displaying the pre-labeling result, specifically comprising:

displaying the text content of the data set to be marked, and simultaneously displaying a pre-marking result, wherein the pre-marking result is displayed for different entities or relationship labels in a distinguishing manner according to a set style; and, corresponding styles for different entities or relationship labels are provided.

Further, receiving user corrections to the pre-annotated results comprises:

capturing a mouse event on a screen, and carrying out real-time distinguishing display on words pointed by a mouse in the mouse sliding process; when the word pointed by the mouse needs to be labeled or the label needs to be modified, receiving the labeling operation of the user and endowing the word with a corresponding style.

Further, the real-time differential display of the words pointed by the mouse comprises:

capturing a mouse event on a screen, when a mouse slides over a word in a text of a data set to be labeled, searching whether the word exists in a word segmentation table built in a system, if so, acquiring the word in front of or behind the word to combine with the word, searching whether the word segmentation table has the combination, if so, identifying the combination as a word, and differentially displaying the word.

One or more embodiments provide a client, connected with a server, comprising:

the annotation task configuration module is used for configuring the data set to be annotated, the entity to be annotated, the relationship label and the required entity relationship extraction model, and generating an annotation task;

the annotation task issuing module is used for issuing the annotation task and distributing personnel;

and the manual labeling module is used for acquiring and displaying the pre-labeling result obtained by the identification and the labeling of the entity relationship extraction model and receiving the correction of the user on the pre-labeling result.

Further, in the manual labeling module, displaying the pre-labeling result includes:

Further, in the manual annotation module, receiving the correction of the user on the pre-annotation result includes:

One or more embodiments provide a server, connected with the client, including:

the model management module is used for managing an entity relationship model, and the entity relationship model comprises a model architecture and a trained model; acquiring the configuration of the client about the entity relationship extraction model to obtain a required entity relationship extraction model;

and the intelligent labeling module is used for identifying the entity and the relationship of the data set to be labeled based on the entity relationship extraction model to obtain a pre-labeling result.

One or more embodiments provide an intelligent marking system for Chinese text, which comprises the client and the server.

The above one or more technical solutions have the following beneficial effects:

the technical scheme provides a labeling method combining intelligent labeling and manual labeling, entity relation recognition is carried out through a model to obtain a pre-labeling result, then the manual labeling method based on the interactive page is provided based on the pre-labeling result, and the labeling precision is guaranteed.

The manual labeling method based on the interactive page can identify the participles through the VUE mouse event and perform differential display, and the combination stacking style sheet is used for realizing the association of the labels and the styles, so that the user can flexibly label the entities and the relationships, the user experience is improved, and the words are displayed in real time along with the sliding of the mouse, the word-by-word reading of the user is avoided, the manual labeling efficiency is improved, the large-scale labeling data integration is obtained, and the data support is provided for the improvement of the accuracy of the subsequent modeling.

In addition, in the system provided by the technical scheme, various model frames and trained models are preset, so that a user can retrain the model frames by configuring model parameters or directly adopt the trained models, and the flexibility is high; moreover, the model is updated by the user according to the manually labeled labeling result, so that the accuracy of the model is improved, and the quality and the quantity of the corpus are improved in the process of multiple times of training.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a frame diagram of an intelligent labeling system for Chinese text according to an embodiment of the present invention;

fig. 2 is a flowchart of an intelligent labeling method for a chinese text in the second embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment provides an intelligent marking system for Chinese texts, which comprises a server and a client.

The client configured to include:

and the annotation task configuration module is used for newly building an annotation task, inputting a task name, selecting a required data source, an entity/relation label and selecting a required annotation model. The method specifically comprises the following steps:

and the data source configuration submodule is used for acquiring the data file to be marked and sending the data file to the server for storage. The data file format is not limited herein, and data sources such as txt/csv text, SQL database, NoSQL database, etc. are added according to user requirements.

And the label configuration submodule is used for setting the entity labels and the relation labels required in the process of labeling the tasks and can also select the entity labels and the relation labels based on the existing entity labels and relation labels of the system.

And the model configuration submodule is used for configuring the entity relationship extraction model, sending the configuration information to the server and receiving the entity relationship extraction model obtained by the server training. Specifically, the user may configure the parameters needed for the modeling process according to specific scenarios and requirements. After configuration is successful, the tasks can be submitted to task management, and the modeling tasks are supervised through the task management. Of course, it is also possible to select an already constructed model and manage a third party model or other models by configuring the model input and output parameters.

And the labeling task issuing module is used for issuing the labeling task after the task configuration is successful and assigning the labeling task to a labeling person for manual labeling.

And the manual marking module is used for manually verifying the model marking result on the basis of intelligent marking. The manual checking of the page is realized by means of VUE mouse events and a cascading style sheet. Is configured to perform the steps of:

(1) and acquiring the pre-labeling result from the server, and displaying the pre-labeling result in the data set to be labeled.

Specifically, different entities or relationships are displayed in a differentiated manner according to a set style, and corresponding styles of different entities or relationship labels are provided. As a specific implementation manner, different display styles are set for different entities or relationships, for example, different highlighting colors are used for differentiated display. Of course, other ways of displaying the objects in different ways may be set, for example, different colors may be assigned to different entities or relationships, and the like, which is not limited herein.

(2) And receiving the correction of the user on the pre-labeling result to finish the labeling.

Specifically, the mouse events include: and the mouse presses, lifts, moves in and out elements and the like, and the entity to which the marked data belongs is changed through a mouse event, so that the argument has the like formula.

As a specific embodiment, the step (2) specifically includes:

(2.1) capturing a mouse event on a screen, and carrying out real-time distinguishing display on words pointed by the mouse in the mouse sliding process; specifically, a mouse event on a screen is captured, when a mouse slides over a word in a text of a data set to be labeled, whether the word exists in a word segmentation table built in a system is searched, if so, a word in front of or behind the word is acquired to be combined with the word, whether the word is a combination is searched in the word segmentation table, if so, the combination is identified as a word, and the word is displayed in a distinguishing mode.

And (2.2) when the word pointed by the mouse needs to be labeled or the label needs to be modified, receiving the labeling operation of the user and endowing the word with a corresponding style. Specifically, because the pre-labeling result is displayed, the system labels according to the recognition of the machine learning model, when the mouse is stroked, the word pointed by the mouse is judged, whether the labeling is needed or not or the labeling needs to be modified, and when the labeling is needed, the word is selected and labeled as the style of the label to which the word belongs.

In the process of labeling, the events of pressing, moving and lifting of the mouse are monitored, the positions of pressing and lifting of the mouse and the selected character strings are obtained, and then the selected state style is added to each selected character, so that the labeling of the entities and the relationships is realized. The flexible labeling of the entities and the relationships through mouse clicking has better user experience.

And the secondary checking module is used for performing secondary checking and checking on the model and the result of manual marking by a task auditor in the task checking module.

The data passing the verification is stored in the server, so that the entity relation extraction model can be updated conveniently by adopting the latest labeled corpus, and the accuracy of the model is improved.

And the task management module is used for inquiring the task state and the completion progress. Specifically, the method is used for inquiring the annotation task, the modeling task and the process progress, and checking the current task state, the current processing personnel, the completion condition and the like.

The server configured to include:

and the personnel management module is used for managing the basic information and the authority information of the personnel. The functional modules that can be used in the clients corresponding to the staff with different authorities are different, that is, the authority information limits the functional modules that can be used by the corresponding clients of the staff.

For example, a worker with administrator authority can use all functional modules in a client, and can configure and issue tasks, and can also perform labeling by himself, while a worker with ordinary authority can only use a manual labeling module and a secondary auditing module, and can only perform labeling according to a dispatched task, and the specific authority configuration can be adjusted according to specific requirements, which is not limited herein.

And the data source management module is used for managing the data source files marked by the history.

And the label management module is used for managing entity labels and relationship labels required in the manual configuration management labeling task process.

And the model management module is used for managing an entity relationship model, and the entity relationship model comprises a model architecture and a trained model. Receiving the configuration of a client about an entity relationship extraction model to obtain the entity relationship extraction model; and receiving the labeling result of the client after secondary verification, and updating the entity relationship extraction model. The model architecture can be used for parameter configuration through a model configuration module of the client side, and then historical annotation data is selected for training. The trained models comprise models obtained through training of the model management module and offline models uploaded by users. Specifically, a TensorFlow deep learning framework is built in the model management module, online modeling can be performed, and a BERT model is used for extracting Chinese text features in the modeling process.

And the intelligent labeling module is used for calling an entity relationship extraction model (intelligent labeling model) based on deep learning to perform model-based entity and relationship labeling on data in the data source after the task is issued, so as to obtain a pre-labeling result, and sending the pre-labeling result to the client.

And the historical labeled data management module is used for storing the labeled data file as training data for constructing or updating the entity relationship extraction model.

Example two

The embodiment discloses an intelligent labeling method for Chinese texts, which comprises the following steps:

step 1: acquiring a data set to be marked, an entity to be marked and a relation label;

step 2: performing entity and relationship identification on the data set to be annotated based on an entity relationship extraction model to obtain a pre-annotation result;

and step 3: and receiving the correction of the user on the pre-labeling result to finish the labeling.

The step 2 specifically comprises:

step 2.1: performing entity and relationship identification on the data set to be labeled based on the entity relationship extraction model;

step 2.2: and labeling the recognized words to obtain a pre-labeling result, and displaying the pre-labeling result.

Displaying the pre-marked result comprises: and displaying the text content of the data set to be marked, and simultaneously displaying a pre-marking result, wherein the pre-marking result is displayed in a distinguishing way for different entities or relations according to a set style. As a specific implementation manner, different display styles are set for different entities or relationships, for example, different highlighting colors are used for differentiated display. Of course, other ways of displaying the objects in different ways may be set, for example, different colors may be assigned to different entities or relationships, and the like, which is not limited herein.

The entity relationship extraction model in the step 2 is obtained by training based on historical labeling data.

The step 3 specifically includes:

step 3.1: capturing a mouse event on a screen, and carrying out real-time distinguishing display on words pointed by a mouse in the mouse sliding process;

specifically, a mouse event on a screen is captured, when a mouse slides over a word in a text of a data set to be labeled, whether the word exists in a word segmentation table built in a system is searched, if so, a word in front of or behind the word is acquired to be combined with the word, whether the word is a combination is searched in the word segmentation table, if so, the combination is identified as a word, and the word is displayed in a distinguishing mode.

By the method, when the mouse sweeps the text to be marked, the word pointed by the mouse can be highlighted in real time, so that a user does not need to read word by word, and the efficiency of manual marking is greatly improved.

Step 3.2: when the word pointed by the mouse needs to be labeled or the label needs to be modified, receiving the labeling operation of the user and endowing the word with a corresponding style.

Specifically, because the pre-labeling result is displayed, the system labels according to the recognition of the machine learning model, when the mouse is stroked, the word pointed by the mouse is judged, whether the labeling is needed or not or the labeling needs to be modified, and when the labeling is needed, the word is selected and labeled as the style of the label to which the word belongs.

And finally, the obtained labeling result after manual examination and correction is used for updating the parameters of the entity relationship extraction model in the step 2.

One or more embodiments above provide an intelligent labeling system and method for a chinese text, where after configuring an entity tag and a relationship tag, a user selects a data set to be labeled, an entity and relationship model, and a corresponding entity relationship extraction model when creating a labeling task, and then generates the labeling task. And intelligent labeling of entities and relations is realized in the process of generating a labeling task, and Chinese word segmentation is automatically identified through a VUE mouse event and a cascading style sheet to optimize user experience. And the entity and the relation can be flexibly marked manually through mouse click, and the intelligent marking result of the model is optimized. Meanwhile, on the premise of improving the marking accuracy, the data quality of the marked corpus is continuously improved. The corpus obtained by the labeling method can be subsequently used for constructing a knowledge graph, a machine learning model and the like.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An intelligent labeling method for Chinese texts is characterized by comprising the following steps:

acquiring a data set to be marked, an entity to be marked and a relation label;

2. The intelligent labeling method for the chinese text as claimed in claim 1, wherein after obtaining the pre-labeling result, displaying the pre-labeling result, specifically comprising:

3. The intelligent labeling method for chinese text as claimed in claim 2, wherein receiving user modifications to the pre-labeling results comprises:

4. The intelligent labeling method for the chinese text as claimed in claim 3, wherein the real-time differential display of the words pointed by the mouse comprises:

5. A client connected to a server, comprising:

6. The client of claim 5, wherein the displaying the pre-labeling result in the manual labeling module comprises:

7. The client of claim 6, wherein the receiving, in the manual annotation module, the user's modification to the pre-annotation result comprises:

8. The client of claim 7, wherein the real-time differential display of the words pointed by the mouse comprises:

9. A server connected to the client according to any one of claims 5 to 8, comprising:

10. An intelligent annotation system for chinese text, comprising a client according to any one of claims 5 to 8 and a server according to claim 9.