CN110750990A

CN110750990A - Entity identification corpus labeling method, system, device and storage medium

Info

Publication number: CN110750990A
Application number: CN201910875551.5A
Authority: CN
Inventors: 陈秀玲
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2020-02-04

Abstract

The invention relates to the technical field of semantic analysis, and provides a semi-automatic labeling method, a semi-automatic labeling system, a semi-automatic labeling device and a storage medium for entity recognition corpora, wherein the method comprises the steps of obtaining model training data, and training an entity recognition model based on the model training data; meanwhile, acquiring an identification tool and setting an identification rule; respectively inputting the linguistic data to be recognized into the trained entity recognition model, recognition tool and recognition rule to perform entity recognition, and acquiring corresponding recognition results; grouping the recognition results of the entity recognition model, the recognition tool and the recognition rule in pairs, and performing union processing on the two recognition results in each group to obtain corresponding union results; and performing intersection processing on the union result to obtain a final recognition result of the corpus to be recognized. The invention can improve the accuracy of text identification and marking and greatly reduce the workload of manual marking.

Description

Entity identification corpus labeling method, system, device and storage medium

Technical Field

The invention relates to the technical field of semantic parsing, in particular to a semi-automatic labeling method, a semi-automatic labeling system, a semi-automatic labeling device and a computer-readable storage medium for entity identification corpora.

Background

At present, the significance of labeling of data to an algorithm is self-evident, and the data can be said to be blood of the algorithm. However, in actual work, a large amount of labor and time are needed to label data, so that the labeling of the data is reduced, and the effect of the algorithm is greatly affected, or the labeling workload is too large, so that the use of some algorithms with good effect is abandoned.

The existing entity recognition algorithms are various, but the supervised algorithm is used for obtaining higher accuracy. Supervised algorithms require labeling (tagging) of the corpus. At present, the labeling method of entity identification adopts a BEMSO or BISO mode, that is, each character in a text needs to be labeled as an entity type plus one of the BEMSO or BISO letters. Such labeling is labor intensive and requires a significant amount of labor cost.

In addition, at present, some existing entity identification open source tools or entity identification public labeled corpora exist, but most of the existing entity identification open source tools are corpora of some general fields (for example, most of Chinese corpora adopt the language corpora of the national diary of 1998), but the existing entity identification open source tools or entity identification public labeled corpora have poor identification effects on some new words and proper nouns of specific fields, and cannot meet the use requirements. In specific business development, most scenes aim at entity recognition in a specific field, and corpora in the specific field need to be labeled to train an entity recognition model of the user.

Disclosure of Invention

The invention provides a semi-automatic labeling method, a semi-automatic labeling system, an electronic device and a computer readable storage medium for entity identification corpora, and mainly aims to identify the same corpora respectively through an entity identification model, an identification rule and an identification tool, and perform merging and/or intersection processing on results of the corpora, so that the accuracy of text identification labeling can be improved, the labeling results only need to be checked manually subsequently, and the workload of manual labeling can be greatly reduced.

In order to achieve the above object, the present invention provides a semi-automatic labeling method for entity recognition corpus, which is applied to an electronic device, and the method comprises:

obtaining model training data, and training an entity recognition model based on the model training data; meanwhile, acquiring an identification tool and setting an identification rule;

respectively inputting the linguistic data to be recognized into the trained entity recognition model, recognition tool and recognition rule to perform entity recognition, and acquiring corresponding recognition results;

grouping the recognition results of the entity recognition model, the recognition tool and the recognition rule in pairs, and performing union processing on the two recognition results in each group to obtain corresponding union results;

and performing intersection processing on the union result to obtain a final recognition result of the corpus to be recognized.

Preferably, the entity recognition model is one or more of a long-short term memory network model, a bidirectional long-short term memory network model, a conditional random field model and a bidirectional encoder characterization model;

and the identification rule is generated according to the characteristics of the linguistic data to be identified and the type setting of the named entity.

Preferably, the merging the two recognition results in each group, and the step of obtaining the corresponding merging result includes,

when the recognition results of the recognition tool and the entity recognition model are a group, the merging processing of the recognition result of the recognition tool and the recognition result of the entity recognition model comprises the following steps:

acquiring the recognition result and the recognition accuracy of the entity recognition model and the recognition tool on the same corpus, and sequencing the two recognition accuracies according to the recognition accuracy;

constructing an empty set, and adding an identification result corresponding to the minimum identification accuracy into the empty set to form a primary set;

and adding the recognition results corresponding to the maximum recognition accuracy into the preliminary set one by one to obtain a union result of the recognition result recognized by the recognition tool and the recognition result of the entity recognition model.

Preferably, the step of adding the recognition results corresponding to the maximum recognition accuracy into the preliminary set one by one includes:

if the newly added identification result is not conflicted with the identification result in the preliminary set, taking the union set of the newly added identification result and the identification result in the preliminary set for processing;

and if the newly added identification result conflicts with the result in the preliminary set, retaining the newly added identification result and deleting the identification result in the preliminary set.

Preferably, the merging the two recognition results in each group, and the step of obtaining the corresponding merging result includes:

when the recognition result of the recognition rule and the recognition tool is a group or the recognition result of the recognition rule and the recognition model is a group, the step of merging the recognition result of the recognition rule with the recognition result of the recognition tool or the recognition result of the recognition model comprises the following steps:

acquiring the recognition result and the recognition accuracy of the recognition rule and the entity recognition model or the recognition tool on the same corpus, and sequencing the two recognition accuracies according to the recognition accuracy;

and adding the identification results corresponding to the maximum identification accuracy into the preliminary set one by one to obtain a union result of the identification rules and the identification tools or the identification rules and the identification results of the entity identification models.

Preferably, the step of adding the identification results corresponding to the identification modules with high accuracy into the preliminary set one by one includes:

if the newly added identification result conflicts with the identification result in the primary set, judging whether the newly added identification result has an inclusion relationship with the identification result in the primary set, and if so, retaining the identification result with a longer length;

if the inclusion relationship does not exist, judging whether the newly added identification result and the identification result in the preliminary set have a cross relationship, and if the cross relationship exists, keeping the identification result of the identification rule.

Preferably, the recognition tool is Stanford CoreNLP or hanlp.

In order to achieve the above object, the present invention further provides a semi-automatic labeling system for entity recognition corpus, the system comprising:

the identification module confirming unit is used for acquiring model training data and training an entity identification model based on the model training data; meanwhile, acquiring an identification tool and setting an identification rule;

the recognition result acquisition unit is used for respectively inputting the linguistic data to be recognized into the trained entity recognition model, recognition tool and recognition rule to perform entity recognition and acquiring corresponding recognition results;

the recognition result processing unit is used for grouping the recognition results of the entity recognition model, the recognition tool and the recognition rule in pairs, and performing union processing on the two recognition results in each group to obtain corresponding union results;

and the recognition result confirming unit is used for performing intersection processing on the union set results to obtain the final recognition result of the linguistic data to be recognized.

To achieve the above object, the present invention also provides an electronic device, including: the semi-automatic labeling method comprises a memory and a processor, wherein the memory comprises a semi-automatic labeling program of the entity identification corpus, and the semi-automatic labeling program of the entity identification corpus is executed by the processor to realize the semi-automatic labeling method of the entity identification corpus.

In addition, to achieve the above object, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a semi-automatic labeling program of an entity identification corpus, and when the semi-automatic labeling program of the entity identification corpus is executed by a processor, any step in the semi-automatic labeling method of the entity identification corpus is implemented.

According to the semi-automatic labeling method, the semi-automatic labeling system, the semi-automatic labeling electronic device and the computer readable storage medium for the entity recognition corpora, the corpora to be recognized are recognized through the entity recognition model, the recognition tool and the recognition rule, and the results of the corpora to be recognized are subjected to merging and/or intersection processing, so that the accuracy of text recognition labeling can be improved, the labeling results only need to be checked manually subsequently, and the workload of manual labeling can be greatly reduced.

Drawings

FIG. 1 is a schematic diagram of an application environment of a semi-automatic labeling method for entity recognition corpus according to a preferred embodiment of the present invention;

FIG. 2 is a block diagram illustrating a preferred embodiment of a semi-automated annotation process for the entity identification corpus of FIG. 1;

FIG. 3 is a flowchart illustrating a semi-automatic labeling method for entity identification corpus according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a semi-automatic labeling method for entity identification corpora, which is applied to an electronic device 1. Fig. 1 is a schematic diagram of an application environment of a semi-automatic labeling method for entity recognition corpus according to a preferred embodiment of the present invention.

In the present embodiment, the electronic device 1 may be a terminal device having an arithmetic function, such as a server, a smart phone, a tablet computer, a portable computer, or a desktop computer.

The electronic device 1 includes: a processor 12, a memory 11, a network interface 14, and a communication bus 15.

The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory 11, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic apparatus 1, such as a hard disk of the electronic apparatus 1. In other embodiments, the readable storage medium may also be an external memory 11 of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1.

In the present embodiment, the readable storage medium of the memory 11 is generally used for storing a semi-automatic labeling program 10 and the like of the entity identification corpus installed in the electronic device 1. The memory 11 may also be used to temporarily store data that has been output or is to be output.

The processor 12 may be, in some embodiments, a Central Processing Unit (CPU), microprocessor or other data Processing chip, for running program codes stored in the memory 11 or Processing data, such as executing the semi-automated labeling program 10 for entity identification corpora.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic apparatus 1 and other electronic devices.

The communication bus 15 is used to realize connection communication between these components.

Fig. 1 only shows the electronic device 1 with components 11-15, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may alternatively be implemented.

Optionally, the electronic device 1 may further include a user interface, the user interface may include an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone (microphone) or other equipment with a voice recognition function, a voice output device such as a sound box, a headset, etc., and optionally the user interface may further include a standard wired interface, a wireless interface.

Optionally, the electronic device 1 may further comprise a display, which may also be referred to as a display screen or a display unit. In some embodiments, the display device may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like. The display is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.

Optionally, the electronic device 1 further comprises a touch sensor. The area provided by the touch sensor for the user to perform touch operation is called a touch area. Further, the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example.

The area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, a display is stacked with the touch sensor to form a touch display screen. The device detects touch operation triggered by a user based on the touch display screen.

Optionally, the electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described herein again.

In the embodiment of the apparatus shown in fig. 1, the memory 11 as a computer storage medium may include therein an operating system and a semi-automated annotation program 10 of entity identification corpus; the processor 12 executes the semi-automated annotation program 10 of the entity identification corpus stored in the memory 11 to implement the following steps:

The invention also provides a semi-automatic labeling system for entity identification corpora, which comprises:

Correspondingly, in other embodiments, the semi-automated annotation process 10 for entity recognition corpora may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by the processor 12 to implement the present invention. The modules referred to herein are referred to as a series of computer program instruction segments capable of performing specified functions. Referring to fig. 2, a block diagram of a preferred embodiment of the semi-automatic labeling procedure 10 for entity identification corpus of fig. 1 is shown. The semi-automated annotation process 10 of the entity recognition corpus may be segmented into: an identification module confirming unit 11, an identification result acquiring unit 12, an identification result processing unit 13, and an identification result confirming unit 14. The functions or operation steps performed by the modules 11-14 are similar to those described above and will not be described in detail here.

In addition, the invention also provides a semi-automatic labeling method for the entity identification corpus. FIG. 3 is a flowchart illustrating a semi-automatic labeling method for entity recognition corpus according to a preferred embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, a semi-automatic labeling method for entity identification corpora includes: step S110-step S140.

Step S110: obtaining model training data, and training an entity recognition model based on the model training data; meanwhile, an identification tool is obtained, and an identification rule is set.

Wherein, the entity recognition model, the recognition tool and the recognition rule form three recognition modules; the entity recognition model is one or more of a long short-term memory network LSTM model, a bidirectional long short-term memory network BilSTM model, a conditional random field CRF model and a bidirectional encoder representation Bert model; in addition, the identification rule is set according to the characteristics of the corpus to be identified and the type of the named entity.

Specifically, the entity recognition model can independently select an LSTM algorithm, or independently use a BilTM algorithm, or independently use a CRF algorithm, or independently use a Bert algorithm, or mutually combine a plurality of algorithms, and the like, in the semi-automatic labeling method for entity recognition corpora, a combination model of BilTM + CRF is preferred, model training data for training the model can adopt the existing corpora data such as name-Japanese-report corpora, or other small-scale corpora data labeled by other people, and when small-scale corpora data is selected, the model training data needs to be downloaded and checked whether the labeling format meets the requirements.

In other words, in step S110, the entity recognition model may be trained based on the BiLSTM + CRF model and the labeled daily training data of people; meanwhile, a recognition tool Stanford CoreNLP (Stanford neural-linguistic programming, Stanford Neuro-linguistic programming) or hand (Han Language Processing, natural Language Processing package) or the like is acquired, and a recognition rule is set.

The recognition rule can also be called as a rule algorithm, a reading rule is set in the rule algorithm in the process of recognizing and labeling the linguistic data to be recognized through the recognition rule, the linguistic data to be recognized is read through the reading rule by the rule algorithm, and when words containing preset content are read, the words containing the preset content are used as recognition results.

Specifically, the setting of the recognition rule is set according to the characteristics of the corpus to be recognized and the type of the named entity, for example, the Chinese place name can be recognized according to a place name dictionary of China provinces and cities; for example, the company name can be determined by referring to the predecessor words such as "create", "manage", "offer", "work", "leave", etc., in combination with the suffix words "limited liability company", "stock company", etc., and in combination with the Chinese registered company name dictionary.

Step S120: and respectively inputting the linguistic data to be recognized into the trained entity recognition model, recognition tool and recognition rule to perform entity recognition, and acquiring corresponding recognition results.

The recognition tool can adopt a stanford coreNLP tool, the tool can provide entity recognition interface calling, directly input the text to be recognized, and the interface outputs the result of named entity recognition. The recognition rules can be set as some obvious rules in a specific field, for example, set rules such as predecessor words and suffix words of a company name, last names in a person name and provinces and cities in a place name, and after a text to be recognized is input, a named entity recognition result of the text can be judged and output according to the rules.

In this step, the corpora to be recognized are respectively input into the entity recognition model, the recognition tool and the recognition rule for entity recognition, and three recognition results are correspondingly obtained, for example, the input corpora to be recognized are: i am wangming, the result of the recognition may be: "I" is a non-entity, "Y" is a non-entity, "Wang" is the beginning of an entity, "Xiao" is the middle of an entity, and "Ming" is the end of an entity. For the entity recognition model, the recognition tool and the recognition rule, corresponding recognition results can be obtained respectively.

Step S130: and grouping the recognition results of the entity recognition model, the recognition tool and the recognition rule in pairs, and performing union processing on the two recognition results in each group to obtain corresponding union results.

Wherein, two by two of each recognition result are grouped, and three different combination forms can be obtained, including: the recognition results of the entity recognition model and the recognition tool are used as a group, the recognition results of the entity recognition model and the recognition rule are used as a group, and the recognition results of the recognition tool and the recognition rule are used as a group. Then, union processing is performed on the two recognition results in each group. Here the union process is further divided into two cases:

in a first case, when the grouping result is that the recognition result of the recognition tool and the recognition result of the entity recognition model are a group, the merging the recognition result of the recognition tool and the recognition result of the entity recognition model includes:

1. according to the model training data, acquiring the recognition accuracy of the entity recognition model and the recognition tool to the same corpus, and acquiring a recognition module with low accuracy;

2. constructing an empty set, and adding an identification result corresponding to an identification module with low accuracy into the empty set to form a primary set;

3. and adding the identification results corresponding to the identification modules with high accuracy into the preliminary set one by one to obtain a union result of the identification tool and the identification result of the entity identification model.

In other words, when the recognition result of the recognition tool and the recognition result of the entity recognition model are a set, the merging the recognition result of the recognition tool and the recognition result of the entity recognition model includes:

In step 3, if the newly added identification result does not conflict with the results in the preliminary set, merging the newly added identification result with the identification results in the preliminary set;

and if the newly added identification result conflicts with the identification result in the preliminary set, retaining the newly added identification result and deleting the identification result in the preliminary set.

In the second case, when the grouping result is that the recognition rule and the recognition tool or the recognition result of the entity recognition model are a group, the merging the recognition result of the recognition rule and the recognition result of the recognition tool or the recognition result of the entity recognition model includes:

1. acquiring the recognition rule and the recognition accuracy of the entity recognition model or the recognition tool to the same corpus according to the model training data, and acquiring a recognition module with low accuracy;

In other words, when the recognition result of the recognition rule and the recognition tool is a set or the recognition result of the recognition rule and the recognition model of the entity is a set, the merging the recognition result of the recognition rule and the recognition result of the recognition tool or the recognition result of the recognition model of the entity includes:

acquiring the recognition rule and the recognition result and the recognition accuracy of the entity recognition model or the recognition tool on the same corpus, and sequencing the two recognition accuracies according to the recognition accuracy;

In step 3, if the newly added identification result is not conflicted with the identification result in the preliminary set, taking the union set of the newly added identification result and the identification result in the preliminary set for processing;

if the inclusion relationship does not exist, further judging whether the newly added identification result and the identification result in the preliminary set have a cross relationship, if so, retaining the identification result of the identification rule, otherwise, ending.

Step S140: and performing intersection processing on the union result to obtain a final recognition result of the corpus to be recognized.

In step S130, intersection processing is performed on the obtained union result, and the contents included in all the three union results are selected as the final recognition result of the corpus to be recognized. And then outputting the identification result information of the corpus to a manual checking editing page, and performing manual checking.

It should be noted that, in the semi-automatic labeling method for entity identification corpora provided in the above embodiment, through the entity identification model, the identification tool and the identification rule, the corpora to be identified are identified, and the results thereof are subjected to merging and/or intersection processing, so that the accuracy of text identification labeling can be improved, and subsequently, only the labeling result needs to be manually checked, thereby greatly reducing the workload of manual labeling.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a semi-automatic labeling program of an entity identification corpus, and when executed by a processor, the semi-automatic labeling program of the entity identification corpus implements the following operations:

Preferably, the recognition tool is Stanford CoreNLP or hanlp.

The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the specific implementation of the above-mentioned semi-automatic labeling method, system and electronic device for entity identification corpus, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A semi-automatic labeling method for entity recognition corpora is applied to an electronic device, and is characterized by comprising the following steps:

2. The method for semi-automated labeling of entity identification corpora according to claim 1,

the entity recognition model is one or more of a long-short term memory network model, a bidirectional long-short term memory network model, a conditional random field model and a bidirectional encoder characterization model;

3. The semi-automatic labeling method for entity identification corpus according to claim 1, wherein said merging the two identification results in each group to obtain the corresponding merging result comprises,

4. The semi-automatic labeling method for entity identification corpora according to claim 3, wherein the step of adding the identification results corresponding to the maximum identification accuracy into the preliminary set one by one includes:

5. The semi-automatic labeling method for entity identification corpora according to claim 1, wherein the step of merging the two identification results in each group to obtain a corresponding merged result includes:

6. The semi-automatic labeling method for entity identification corpora according to claim 5, wherein the step of adding the identification results corresponding to the identification modules with high accuracy to the preliminary set one by one includes:

7. The semi-automatic labeling method for entity identification corpora according to claim 1, wherein the identification tool is Stanford CoreNLP or hanlp.

8. A semi-automated annotation system for entity recognition corpora, the system comprising:

9. An electronic device, comprising: the storage comprises a semi-automatic labeling program of the entity identification corpus, and the semi-automatic labeling program of the entity identification corpus realizes the steps of the semi-automatic labeling method of the entity identification corpus according to any one of claims 1 to 7 when being executed by the processor.

10. A computer-readable storage medium, wherein the computer-readable storage medium includes a semi-automatic labeling program of an entity identification corpus, and when the semi-automatic labeling program of the entity identification corpus is executed by a processor, the steps of the semi-automatic labeling method of the entity identification corpus according to any one of claims 1 to 7 are implemented.