CN109977402B

CN109977402B - Named entity identification method and system

Info

Publication number: CN109977402B
Application number: CN201910202512.9A
Authority: CN
Inventors: 张金贺; 徐安华; 欧阳佑
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2022-11-11
Anticipated expiration: 2039-03-11
Also published as: CN109977402A

Abstract

The application discloses a named entity identification method and a named entity identification system, wherein the method comprises the following steps: preprocessing a text to be processed to obtain a preprocessing result; obtaining character-level expression information sensitive to context information corresponding to the text to be processed according to the preprocessing result; creating conditional random field CRF decoding units which correspond to different named entity types one by one, wherein each conditional random field CRF decoding unit decodes character-level expression information sensitive to the context information respectively to generate a label sequence corresponding to each named entity type; and extracting corresponding named entities according to the label sequences respectively. The method and the device solve the problem of low efficiency in the overlapped named entity identification scheme in the prior art, reduce redundant information through a sharing mechanism, reduce inference time, enable different types of entities to be mutually assisted during identification, and improve the identification effect of single type of entities.

Description

Named entity identification method and system

Technical Field

The present application relates to the field of natural language processing, and in particular, to a named entity recognition method and system.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and studies on various theories and methods for realizing effective communication between people and computers using Natural Language. Applications based on natural language processing have begun to affect aspects of people's life and production, such as intelligent question and answer robots, automatic text summarization, and so on. As a basic stone for information extraction, named Entity Recognition (NER) technology is applied in every mature NLP application. Named entity recognition refers to entities identified by names, such as: name of person, place name, organization name, time, etc. Due to the location of the keystone where the NER technique is located, the effect of NER will directly affect the effect of the entire chain of information extraction. A problem to be solved by a NER system is to identify all entities contained in the input text. For example, the text "zhangxiaoming, sunrise at 9 month 27 in 1961 in hong kong in china" contains three entities, zhangxiaoming (name of person), 9 month 27 day in 1961 (time), and hong kong in china (place).

Traditionally, the NER system is mostly implemented based on Conditional Random Fields (CRFs) of a given feature template. The CRF algorithm decodes text by labeling the text with the correct predictive label. Based on the general BIESO label system, taking the text "zhaoming sheng in hong kong of china" as an example, the schematic diagram after labeling the text is shown in fig. 1, wherein labels of three characters included in the named entity "zhaoming" are B _ PER, I _ PER, and E _ PER, respectively.

Recently, people gradually derive more demands on named entity recognition systems in production and life, such as the phenomenon that named entities overlap. As shown in fig. 2, the text "go together to washington, dc" contains overlapping entities "washington, dc" (place) and "washington" (name of a person). Where "washington" has two labels: (1) B _ PER, I _ PER and E _ PER; and (2) B _ LOC, I _ LOC and I _ LOC. However, CRF algorithms based on feature templates can only sequence a label for a text, and are ineffective for such texts containing overlapping entities.

To solve the above problem, one feasible solution is to allocate a separate NER system for each type of entity to enable decoding of a single text sequence into multiple tag sequences. As for the text containing the overlapped named entities as shown in fig. 2, two NER systems can be created, which are respectively responsible for the recognition of the name of a person and the name of a place individually as shown in fig. 3, wherein NER (person name) is responsible for the recognition of the name of a person entity in the text and NER (place name) is responsible for the recognition of the name of a place entity in the text. However, due to the independence between these sub NER systems, knowledge of commonality is difficult to share between subsystems, and there is a high degree of information redundancy across the entire system. Therefore, in practical situations, the solution is inefficient.

How to solve the problem of low efficiency in the overlapping named entity identification scheme in the prior art, and reduce redundant information, thereby improving the identification effect of single-class entities, is a problem to be solved urgently at present.

Disclosure of Invention

The method for identifying the named entities mainly aims to solve the problem that an overlapping named entity identification scheme in the prior art is low in efficiency, redundant information is reduced through a sharing mechanism, inference time is shortened, different types of entities can cooperate with one another during identification, and therefore the identification effect of single entities is improved.

In order to achieve the above object, an embodiment of the present application provides a named entity identification method, including:

preprocessing a text to be processed to obtain a preprocessing result;

obtaining character-level expression information which is sensitive to context information corresponding to the text to be processed according to the preprocessing result;

creating conditional random field CRF decoding units which correspond to different named entity types one by one, wherein each conditional random field CRF decoding unit decodes the character-level expression information sensitive to the context information respectively to generate a label sequence corresponding to each named entity type;

and extracting corresponding named entities according to the label sequences respectively.

Optionally, wherein the type of the preprocessing result includes: and corresponding to the character set of the text to be processed, performing word collection after word segmentation on the text to be processed, and performing sentence segmentation on the text to be processed to obtain a sentence set and a part of speech set corresponding to the word collection.

Optionally, the obtaining, according to the preprocessing result, character-level expression information sensitive to context information corresponding to the text to be processed includes:

constructing feature information corresponding to the type according to the type of the preprocessing result;

and processing the characteristic information to obtain character-level expression information sensitive to the context information of the text to be processed.

Optionally, wherein the feature information includes: character coding information corresponding to the character set, word segmentation boundary information corresponding to the word set, sentence boundary distance information corresponding to the sentence subset, and part of speech feature information corresponding to the part of speech set.

Optionally, the processing the feature information to obtain character-level expression information sensitive to context information corresponding to the text to be processed includes:

and scanning the characteristic information from a forward dimension and a reverse dimension by using a bidirectional long-time and short-time memory cyclic neural network to construct character-level expression information sensitive to the context information of the text to be processed.

An embodiment of the present application further provides a named entity recognition system, including:

the text preprocessing module is used for preprocessing the text to be processed to obtain a preprocessing result;

the encoding module is used for obtaining character-level expression information which is sensitive to the context information corresponding to the text to be processed according to the preprocessing result;

the multitask CRF decoding module is arranged for creating conditional random field CRF decoding units which correspond to different named entity types one by one, and each conditional random field CRF decoding unit decodes the character-level expression information sensitive to the context information to generate a label sequence corresponding to each named entity type;

and the output integration module is arranged for extracting corresponding named entities according to the label sequences respectively.

Optionally, wherein the type of the preprocessing result includes: and corresponding to the character set of the text to be processed, performing word aggregation after word segmentation on the text to be processed, and performing sentence segmentation on the text to be processed and word part set corresponding to the word aggregation.

Optionally, the encoding module is specifically configured to:

the characteristic extraction module is used for constructing characteristic information corresponding to the type according to the type of the preprocessing result;

and the context expression construction module is configured to process the characteristic information to obtain character-level expression information sensitive to the context information corresponding to the text to be processed.

Optionally, wherein the feature information includes: character coding information corresponding to the character set, word segmentation boundary information corresponding to the word set, sentence boundary distance information corresponding to the sentence set and part of speech characteristic information corresponding to the part of speech set.

Optionally, the context expression building module is specifically configured to:

and scanning the characteristic information from two dimensions of forward and reverse by using a bidirectional long-time and short-time memory cyclic neural network to construct character-level expression information sensitive to the context information corresponding to the text to be processed.

The technical scheme provided by the application comprises the following steps: preprocessing a text to be processed to obtain a preprocessing result; obtaining character-level expression information sensitive to context information corresponding to the text to be processed according to the preprocessing result; creating conditional random field CRF decoding units which correspond to different named entity types one by one, wherein each conditional random field CRF decoding unit decodes character-level expression information sensitive to the context information to generate a label sequence corresponding to each named entity type; and extracting corresponding named entities according to the label sequences respectively.

The application provides a named entity recognition system based on a multitask learning mechanism to solve the problem of low efficiency in an overlapped named entity recognition scheme in the prior art, redundant information is reduced through a sharing mechanism, inference time is reduced, different types of entities can be mutually assisted during recognition, and therefore the recognition effect of single entities is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application in a non-limiting sense. In the drawings:

FIG. 1 is a diagram illustrating a CRF decoding tag sequence in the prior art;

FIG. 2 is a diagram illustrating a tag sequence when exemplary text contains overlapping entities in the prior art;

FIG. 3 is a diagram of a prior art set of independent NER systems;

FIG. 4 is a schematic diagram of a multitasking learning system;

FIG. 5 is a schematic diagram of a named entity recognition system based on multitask learning according to the present application;

fig. 6 is a flowchart of a named entity recognition method according to embodiment 1 of the present application;

fig. 7 is a diagram showing a structure of a named entity recognition system according to embodiment 2 of the present application;

the implementation, functional features and advantages of the objectives of the present application will be further described with reference to the accompanying drawings.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The multi-task learning mechanism combines a plurality of subtasks for learning, can mine and utilize common knowledge of different subtasks, and can learn to obtain special knowledge of the subtasks at the same time. The multi-task learning mechanism is widely applied to many fields of machine learning, such as images (semantic segmentation + depth prediction), heterogeneous text classification, and the like. Compared with the strategy of learning each subtask independently, the mechanism of multi-task joint learning enables different subtasks to assist each other to obtain better effect. Fig. 4 is a schematic diagram of a multitask learning system.

The method and the system for identifying the named entities are designed based on a multitask learning mechanism. By abstracting each type of entity recognition task as a subtask and modeling the named entity recognition system as a multitask learning neural network system with an encoding module shared among the subtasks and a decoding module independent among the subtasks. The multitask CRFs structure in the decoding stage allows a multitask model to learn to obtain the specific knowledge of each type of named entity, and simultaneously reduces redundant information through a sharing mechanism, so that the problem of low efficiency in an overlapped named entity identification scheme in the prior art is solved, and fig. 5 is a schematic diagram of the named entity identification system based on multitask learning.

Fig. 6 is a flowchart of a named entity recognition method according to embodiment 1 of the present application, including the following steps:

step 601: preprocessing a text to be processed to obtain a preprocessing result;

the "text to be processed" in this application may be the input text of the user, and may contain overlapping named entities, such as: the text "family goes to washington, dc" in fig. 2 includes two named entities "washington" and "washington, dc", and both named entities include "washington", that is, in the text, two types of named entities are partially overlapped.

In this step 601, the text to be processed is processed to generate various information that can be used for subsequent multitask model input.

In an exemplary embodiment, a corresponding word/word library may be first constructed based on the data set, and the low frequency words/words may be added to the low frequency word/word library. For the text d to be processed, the preprocessing stage performs word segmentation, sentence segmentation and part-of-speech recognition on the text d, and replaces low-frequency characters appearing in the text with uniform invalid characters.

In an exemplary embodiment, after step 601, a preprocessing result { C, W, S, P } may be obtained according to the text d to be processed, where C, W, S, P respectively represent a character set, a word set, a sentence set, and a part-of-speech set. This information can be integrated and input into subsequent multitask models for named entity recognition.

Step 602: obtaining character-level expression information sensitive to context information corresponding to the text to be processed according to the preprocessing result;

specifically, the step 602 may be implemented by the following specific steps:

step 6021: constructing characteristic information corresponding to the type according to the type of the preprocessing result;

in this step 6021, the textual information from the pre-processing is received and constructed into input features. Four character-level characteristics including characters, word segmentation boundaries, sentence boundary distances and part-of-speech characteristics can be constructed by processing the preprocessed text information. These features are input into the subsequent multitask model after discretization and vectorization. The various features are configured as follows:

character encoding: each character in the text is converted to a corresponding character encoding after passing through the query vocabulary.

Word segmentation boundary: given the word segmentation information of the input text, if (1) the character appears at the head of a certain vocabulary, the word segmentation boundary characteristic is coded as 0; (2) The character appears at the tail of a certain vocabulary, and the character is coded as 1 by the segmentation boundary characteristics; and (3) otherwise, the segmentation boundary feature code is 2.

Sentence boundary distance: given sentence break information of the input text, the sentence boundary distance feature of the character can be defined as log ₂ (d ₁ ) And log ₂ (d ₂ ) In which d is ₁ ,d ₂ The distance between the character and the beginning and the end of the sentence is marked respectively.

The part of speech characteristics: the part-of-speech information of the given input text comprises nouns, verbs, adjectives, pronouns, numerics, quantifiers and the like, and the part-of-speech characteristics of the characters are defined as the part-of-speech codes of the words in which the characters are located.

Step 6022: and processing the characteristic information to obtain character-level expression information which is sensitive to the context information corresponding to the text to be processed.

In this step 6022, a recurrent neural network common to the language model may be employed to capture information of the character context. Specifically, based on the features of four character levels, a bidirectional long-short time memory cyclic neural network is adopted to scan texts from two dimensions of forward and reverse directions, and a character level expression sensitive to context information is constructed.

Step 603: creating conditional random field CRF decoding units which correspond to different named entity types one by one, wherein each conditional random field CRF decoding unit decodes character-level expression information sensitive to the context information respectively to generate a label sequence corresponding to each named entity type;

in this step 603, the application defines the types of named entities to be acquired based on design requirements, and then assigns a conditional random field CRF decoding unit to each type of named entity, all of which form a set { CRF for N types of entities ₁ ,CRF ₂ ,…,CRF _N }. In order to exploit as much knowledge as possible of the commonality between different entity types to improve the effect of individual tasks, these conditional random field CRF decoding units will receive common inputs (context information sensitive character-level representation information).

The context information sensitive character-level representation information from the previous step is subjected to parallel decoding operation in this step. Each conditional random field CRF decoding unit outputs a decoded label sequence S for the text _i ＝{s ₁ ,s ₂ ,…,s _|M| }，

Step 604: and extracting corresponding named entities according to the label sequences respectively.

In this step, all N tag sequences decoded by different CRF decoding units in the previous step are processed, and then the overlapped set of named entities can be extracted. For example, for the example sentence "home going to Washington, D.C." CRF ₁ The tag sequence corresponding to the named entity in the place type is obtained through decoding, and the location' Washington D.C. can be extracted from the decoded tag sequence in the step; CRF ₂ The tag sequence corresponding to the named entity in the name type is obtained through decoding, and the tag sequence after decoding can extract the name 'Washington' in the step.

The named entity recognition system is trained through a learner, different from a strategy of alternately training a multi-task model according to subtasks, the named entity recognition system adopts a joint optimization mechanism to carry out joint learning on a multi-task CRFs structure, and the optimization target (loss function) is as follows:

wherein, J _i (θ) characterizing the loss function of the i-th decoding unit, w _i Are weighting factors used to balance different tasks. Considering that different subtasks of the present application are named entity recognition tasks, and the dimension of the loss function corresponding to the subtasks is the same, the present application sets the weighting factor w _i ＝1，

Based on the joint optimization target, the parameters in the multi-task CRFs neural network structure can be learned by adopting a back propagation algorithm.

It should be noted that, the present application provides a named entity recognition system based on a multitask learning mechanism to solve the problem of low efficiency in the overlapped named entity recognition scheme in the prior art, and reduces redundant information and inference time through a sharing mechanism, so that mutual assistance can be performed during recognition of different types of entities, thereby improving the recognition effect of single type of entities.

Fig. 7 is a structural diagram of a named entity recognition system in embodiment 2 of the present application, and as shown in fig. 7, the system includes:

the encoding module is used for obtaining character-level expression information sensitive to the context information of the text to be processed according to the preprocessing result;

and the output integration module is set to extract corresponding named entities according to the label sequences respectively.

Wherein the type of the preprocessing result comprises: and corresponding to the character set of the text to be processed, performing word collection after word segmentation on the text to be processed, and performing sentence segmentation on the text to be processed to obtain a sentence set and a part of speech set corresponding to the word collection.

Specifically, the encoding module is specifically configured to:

and the context expression construction module is used for processing the characteristic information to obtain character-level expression information sensitive to the context information corresponding to the text to be processed.

Wherein the feature information includes: character coding information corresponding to the character set, word segmentation boundary information corresponding to the word set, sentence boundary distance information corresponding to the sentence subset and part of speech characteristic information corresponding to the part of speech set.

Specifically, the context expression building module is specifically configured to:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A named entity recognition method, comprising:

preprocessing a text to be processed to obtain a preprocessing result;

obtaining character-level expression information sensitive to context information corresponding to the text to be processed according to the preprocessing result;

creating conditional random field CRF decoding units corresponding to different named entity types one by one, wherein each conditional random field CRF decoding unit respectively decodes character-level expression information sensitive to the context information, and each conditional random field CRF decoding unit outputs a decoded label sequence for the text

And extracting corresponding named entities according to the label sequences, processing all the N label sequences decoded by different CRF decoding units in the previous step, and then extracting a superimposable named entity set.

2. The method of claim 1, wherein the type of the pre-processing result comprises: and corresponding to the character set of the text to be processed, performing word collection after word segmentation on the text to be processed, and performing sentence segmentation on the text to be processed to obtain a sentence set and a part of speech set corresponding to the word collection.

3. The method according to claim 2, wherein the obtaining of the character-level expression information sensitive to the context information corresponding to the text to be processed according to the preprocessing result comprises:

constructing characteristic information corresponding to the type according to the type of the preprocessing result;

and processing the characteristic information to obtain character-level expression information which is sensitive to the context information corresponding to the text to be processed.

4. The method of claim 3, wherein the feature information comprises: character coding information corresponding to the character set, word segmentation boundary information corresponding to the word set, sentence boundary distance information corresponding to the sentence subset, and part of speech characteristic information corresponding to the part of speech set.

5. The method according to claim 4, wherein the processing the feature information to obtain character-level expression information sensitive to context information corresponding to the text to be processed comprises:

and scanning the characteristic information from two dimensions of forward and reverse by using a bidirectional long-time memory cyclic neural network to construct character-level expression information sensitive to the context information of the text to be processed.

6. A named entity recognition system, comprising:

the encoding module is arranged to obtain character-level expression information which is sensitive to the context information corresponding to the text to be processed according to the preprocessing result;

the multitask CRF decoding module is arranged for creating conditional random field CRF decoding units corresponding to different named entity types one by one, each conditional random field CRF decoding unit decodes character-level expression information sensitive to the context information, and each conditional random field CRF decoding unit outputs a decoded label sequence for the text

And the output integration module is set to extract corresponding named entities according to the label sequences, process all the N label sequences decoded by different CRF decoding units in the previous step and extract a stackable named entity set.

7. The system of claim 6, wherein the type of the pre-processing result comprises: and corresponding to the character set of the text to be processed, performing word collection after word segmentation on the text to be processed, and performing sentence segmentation on the text to be processed to obtain a sentence set and a part of speech set corresponding to the word collection.

8. The system of claim 7, wherein the encoding module is specifically configured to:

9. The system of claim 8, wherein the feature information comprises: character coding information corresponding to the character set, word segmentation boundary information corresponding to the word set, sentence boundary distance information corresponding to the sentence set and part of speech characteristic information corresponding to the part of speech set.

10. The system of claim 9, wherein the context expression building module is specifically configured to: