CN109800437B

CN109800437B - Named entity recognition method based on feature fusion

Info

Publication number: CN109800437B
Application number: CN201910099671.0A
Authority: CN
Inventors: 赵青; 王丹; 杜金莲; 付利华; 苏航
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2023-11-14
Anticipated expiration: 2039-01-31
Also published as: CN109800437A

Abstract

A named entity recognition method based on feature fusion belongs to the field of computers, text features with different granularities, conceptual features and non-conceptual word features are extracted and fused through two aspects, so that the accuracy of named entity recognition is improved, and the calculated amount is reduced. The method comprises the following steps: the system comprises a data preprocessing module, a feature construction module, a named entity network model training module and a named entity classifier module, wherein the feature module comprises four sub-modules of semantic feature extraction, word feature extraction, character feature extraction and feature fusion. In the method, the context information of the named entity task is considered by combining the time sequence Memory characteristics of a neural network model LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit), and finally, the entity category label is predicted by using softmax. In the model construction process, sparse data can be used as a training set and the LSTM neural network model and the GRU neural network model can be compared, so that the entity recognition task can be ensured to achieve a satisfactory effect.

Description

Named entity recognition method based on feature fusion

Technical Field

The invention belongs to the field of computers, and relates to a named entity identification method based on feature fusion.

Background

In recent years, with the wide application of artificial intelligence technology in the field of natural language processing (Natural Language Processing, NLP), the search for knowledge in the field is increasing. Named entity recognition is a fundamental and also vital step in building domain knowledge, such as: named entity recognition is needed in the fields of knowledge graph construction, text retrieval, text classification, information extraction and the like.

Named entity recognition (Named Entity Recognition, NER) can be seen as a sequential labeling task, where entities are looked up and separated into a fixed set of categories by extracted information. Two main approaches to the traditional NER problem are rule-based learning methods and supervised learning methods, where the supervised learning methods dominate. Both rule-based and supervised learning methods find tag sequences for candidate entities from a document, assuming that the available training data has been marked in its entirety (i.e., all entities contained in the document are marked). However, the use of fully annotated data as a training set in today's big data age is very time consuming and labor intensive, and due to the specificity of most domain terms, today's named entity recognition tasks also present the following challenges: (1) Most of the information in real life is semi-structural or unstructured, and many pieces of information are descriptive, have no structural information and are not applicable to knowledge discovery and extraction; (2) Domain entities themselves are structurally complex and the same concepts have a variety of expression methods, for example in the medical domain: chronic obstructive pulmonary disease may be abbreviated as COPD; (3) Named entities are typically made up of multiple words, and considering only word features can split semantic information. Based on the above problems, the conventional named entity recognition method is difficult to adapt to the application scene nowadays.

At present, with the excellent performance of deep learning in various fields, the application in the task of identifying named entities is more and more, and compared with the traditional method, the effect of the method of deep learning is better. But most of the NER methods combined with deep learning are based on english or on word vectors and character vectors without considering conceptual features.

In 2016, published in ACL, paper "Neural Architectures for Named Entity Recognition" by guilloume sample et al, a named entity recognition method based on a combination of a recurrent neural network (Recurrent Neural Network, RNN) and a conditional random field (Conditional Random Fields, CRF) was proposed for recognizing english names, place names, etc., the method extracts word features and character features through the RNN, and finally classifies the entities through the CRF.

In 2017, a method for identifying an entity based on word characteristics and combined with an attention mechanism is proposed in computer research and development, and the paper "identifying a chemical drug named entity based on an attention mechanism" by Yang Pei et al, wherein an entity identification classifier is trained through a neural network LSTM (Long Short-Term Memory), and a final entity tag classification result is generated by adopting CRF.

Although the above methods can complete the task of identifying the named entity, the existing named entity identification methods all assume that no domain knowledge exists, and features are learned only through training sets, however, in real life, most domains have partial domain knowledge, and although the methods are imperfect, the domain knowledge can help us to better identify the named entity in sparse data, and meanwhile, the huge calculation amount caused by inconsistent expression can be reduced to a certain extent.

Disclosure of Invention

The invention comprises the following steps:

a named entity identification method based on feature fusion comprises the following steps:

(1) the named entity recognition method based on feature fusion is provided, the effect of predicting new words in a sparse marked prediction library can be achieved according to concepts contained in the domain ontology, unified expression modes can be adopted for entities which are inconsistent in expression and have the same concepts, and therefore accuracy can be improved, and calculation cost can be reduced.

(2) Firstly, extracting semantic features from the preprocessed data by adopting a CBOW model, wherein the semantic features comprise conceptual features and non-conceptual word features, extracting the conceptual features, words and character features from the conceptual features, and directly extracting the word features and character features from the non-conceptual word features.

(3) And secondly, carrying out feature fusion on the extracted new feature set, wherein the feature fusion also comprises two parts, namely feature fusion based on concepts and feature fusion based on non-concept words. And reducing the dimension of the concept features by calculating the concept similarity.

(4) The characteristics of the neural network LSTM or GRU (Gated Recurrent Unit) model time sequence memory are adopted to extract the context information related to the named entity, and the new feature set is used as the input of the training model.

The principle of the invention is that a named entity recognition method based on feature fusion not only adopts traditional word vector features and character vector features, but also considers concept features and character position features contained in words, the dimension of the word vector can be reduced through the concept features, the effect of predicting new words can be achieved to a certain extent in a sparse marked corpus according to concepts contained in a ontology, and finally context information is focused through a neural network LSTM or GRU, so that the accuracy of named entity recognition can be improved well.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

a named entity identification method based on feature fusion comprises the following steps: the system comprises a data preprocessing module, a feature construction module, a named entity network model training module and a named entity classifier module. The feature construction module mainly extracts and fuses text features with different granularities, and specifically comprises four sub-modules, namely a semantic feature extraction module, a word feature extraction module, a character feature extraction module and a feature fusion module.

A semantic feature extraction module, the semantic feature comprising two parts, a conceptual feature and a non-conceptual word feature, the concept referring to a particular domain term consisting of a plurality of separate words comprising semantics, e.g., chronic obstructive pulmonary disease; non-conceptual words refer to a single semantic vocabulary, e.g., difficulty. For the extracted concept features capable of mapping concepts from the domain ontology, the word features of the concepts cannot be directly extracted, and finally the semantic features are extracted through a CBOW model.

The word feature extraction module, since the concept is composed of a plurality of words, for example: chronic pulmonary heart disease, the meaning of the concept is therefore determined by the words it contains. In order to maintain the integrity of semantic information, the method is divided into two aspects, namely, extracting word features based on concepts and extracting word features based on non-concept words, wherein a CBOW model is adopted by the extraction method of the non-concept word features as the extraction method of the semantic features.

The character feature extraction module, the character is the minimum semantic unit of Chinese, also contains certain semantic information, the meaning of the word is determined by the character contained in the character, and the semantic information based on the character can also achieve the effect of predicting new words to a certain extent, thereby being beneficial to the inference of entity categories, such as: pain, vector of pain + vector of pain is close to the term vector of pain. Meanwhile, the position information of the characters is very critical, and the meaning of two words can be completely different in different positions of the same character, so that in order to improve the accuracy of entity identification, the method considers not only character characteristics but also character position characteristics.

And the feature fusion module is used for fusing the extracted conceptual features, word features and character features into a new feature set. Secondly, a new fusion method is provided, which mainly considers two cases, namely, fusion of concepts, words and character features is carried out on words capable of extracting concepts in the domain ontology, and word features are directly extracted and fused with character features on words incapable of extracting concepts from the ontology. Finally, feature dimension reduction is carried out on the extracted conceptual features through the domain ontology, so that the calculated amount can be reduced on the basis of improving the accuracy of named entity identification, and the fused features are used as the input of a model for training.

According to the invention, text features with different granularities are extracted, and a novel feature fusion method is provided, so that semantic information contained in the text can be fully learned, and the ambiguity of domain terms and huge calculation amount caused by expression inconsistency can be solved.

Drawings

FIG. 1 is a diagram of an overall architecture of a named entity recognition method based on feature fusion;

FIG. 2 is a flow chart of a named entity recognition method based on feature fusion;

Detailed Description

Features and exemplary embodiments of various aspects of the invention are described in detail below

The method for extracting the feature extraction and feature fusion of different granularities is used for identifying the named entity, and is hopeful to improve the accuracy of named entity identification and reduce the calculated amount. The whole architecture is shown in fig. 1 and is divided into a data preprocessing module (1), a feature construction module (2), a training named entity network model module (3) and a named entity classifier module (4). A specific process flow diagram is shown in fig. 2.

Data preprocessing module (1): firstly, adding unlabeled data into a labeled training set to form a sparse labeled corpus, and loading a domain ontology; secondly, the corpus of all sparse marks is segmented into shorter Chinese character strings (comprising punctuation marks, numbers and space characters) according to special symbols, and stop words are removed.

Feature construction module (2): the module mainly extracts features with different granularities from the text and fuses the extracted features. More specifically, the method can be divided into semantic feature extraction, word feature extraction, character feature extraction and feature fusion.

Semantic feature extraction module (21): mapping the segmented character string L= (L1 … Ln) to the ontology O, and finding out the length Lmax of the maximum initial matching semantic contained in the character string by adopting a maximum matching method (if the length Lmax of the maximum initial matching semantic is equal to the length Llen of the character string, llen is a semantic). Then extracting Lmax from L, dividing two sides of Lmax into new character strings with segmentation, defining all segmented character strings as a semantic set { Y } ₁ ，...Y _N ) E D, which contains a concept set and a non-concept vocabulary { G ] ₁ ，...G _N }∪{F ₁ ，...F _N E Y. Then extracting semantic features through a CBOW model, wherein the training goal of the CBOW is to maximize the following average logarithmic probability, and the specific formula is as follows:

wherein K is the context information of the target word in the data set D, Y _i Is the semantics in the dataset D.

In CBOW, probability Pr (Y _i |Y _i-K ，...，Y _i+K ) Is calculated by the following formula:

wherein y is ₀ And y _i For target semantics Y _i Vector representations of input and output, and y ₀ For the average vector representation of all contexts, W is the semantic dictionary.

Word feature extraction module (22): word feature is divided into two cases, concept-based word feature extraction and non-concept word-based feature extraction.

Feature extraction based on concept words: since a concept is typically g= { C composed of a plurality of words ₁ ，...C _N The meaning of the concept is determined by the words it contains, so the method will extract word features on the basis of the concept features. The specific formula is as follows

Wherein g _i Is concept G _i Concept vector of c _j G is g _n The j-th word vector, g _n Is concept G _i Number of words contained, Q _i The concept vector and the average word vector are added to obtain +vector addition operation, and the addition calculation method obtained according to the past experimental experience is simpler and faster than the combination method without losing the precision, so that the vector addition mode is adopted to calculate in the following methods.

Feature extraction based on non-concept words will employ a CBOW model in the semantic feature extraction module (21).

Character feature extraction module (23): character features are also classified into two cases, conceptual word-based character feature extraction and non-conceptual word-based character feature extraction.

Character feature extraction based on concept words: in the extracted concept and word characteristics P _i On the basis of the character feature extraction, the specific formula is as follows:

wherein z is _k C is _n The kth character vector, c _n Is the concept word C _i The number of characters included, + is a vector addition operation, Q _i The concept vector, the average word vector and the average character vector are added. The character feature formula is extracted based on the non-concept word features as follows:

wherein w is _i Is a non-conceptual word F _i Word vector representation of f _n Is a non-conceptual word F _i The number of characters d _m Is f _n In the m-th character vector, + is vector addition operation,from the addition of the non-conceptual word vector and the average character vector.

Since the meaning of the word in Chinese generally depends on the position of the character, the meaning expressed differently by the position of the character is also different, so that extracting the position characteristics of the character can more accurately infer the semantic information of the word. For each character we denote B (beginning), I (middle), E (end), the formula can be expressed as:

the same expression is used for extracting the position features of the characters of the non-conceptual feature words.

Feature fusion (24): based on the feature extraction work, the feature fusion part is also considered in two cases, namely a feature fusion method based on concept and a feature fusion method based on non-concept words. The method fuses the extracted new feature sets through vector addition operation, mainly considers that the concept features are very important as word features in the named entity recognition task based on the partial domain ontology, and can directly extract the named entities with partial labels in the sparse labeled corpus, thereby reducing the calculated amount.

The feature fusion method based on the concept comprises the following steps: the extracted conceptual features, word features, character features and character position features are fused, and the formula is as follows:

the feature fusion method based on the non-concept words comprises the following steps: the extracted word features, character features and character position features are fused, and the formula is as follows:

wherein f _n For the word F _i The number of characters to be included in the character string,for the word F _i First character of +.>For the word F _i Middle character feature,/, of->Word F _i Is the last character feature in (c).

The domain terms for Chinese generally have the characteristic of inconsistent expression, especially in the medical domain, the medical terms of the same concept have multiple expression methods, such as: chronic obstructive pulmonary disease may also be expressed as COPD. With the increase of data, huge calculation amount is brought, based on the problem, therefore, a method for calculating the similarity of the concept features based on the ontology is adopted to reduce the dimension of the concept vector, and the formula is as follows:

wherein o is _i G is a conceptual feature in the ontology _i And g _m For the conceptual features identified in dataset D, R () is g _i And g _m Is the cosine similarity, alpha is the similarity threshold, and according to the previous experiment, the similarity thresholdThe value is set to be too small and easy to be misjudged, and too large and easy to be missed, so that the normal similarity threshold value is between 0.87 and 0.93, the recommended initial threshold value is set to be 0.9, the error is calculated by adopting a gradient descent method, namely, the gradient descent slope of the error function is calculated smoothly and continuously, the gradient is smaller when the error function approaches to the minimum value, the overshoot risk can be reduced by adjusting the step length, the step length can be set to be 0.01 in the experimental process, and the threshold value range is set between 0.87 and 0.93, and adjustment is carried out until the gradient reaches the minimum value, namely the optimal threshold value of the similarity.

More specifically, the concept features are mapped to the domain ontology O, if there are two concepts g _i And g _m Close to the ontology concept o _i G is calculated by cosine similarity _i And g _m To ontology concept o _i If less than the similarity threshold α, g _i And g _m Each being an independent concept in the ontology, g can be considered if it is greater than the similarity threshold α _i And g _m Is of the same concept and g can be taken as _i Replaced by g _m Or g will g _m Replaced by g _i . Thereby reducing the dimension of the conceptual features and reducing the amount of computation.

Training a named entity network model module (3): the fused features are used as the input of a model for training, and the named entity recognition is also called a sequence labeling task, so that the context information is very important, and a neural network LSTM or GRU model with a time sequence memory function is adopted as the training model. The specific formula of LSTM is as follows:

i _t ＝σ(W _i x _t +U _i h _t-1 +b _i )

f _t ＝σ(W _f x _t +U _f h _t-1 +b _f )

o _t ＝σ(W _o x _t +U _o h _t-1 +b _o )

wherein i is _t 、f _t 、o _t Input, forget, output gates representing time node t, σ representing a nonlinear function, parameters of each control gateConsists of two matrices and a bias vector, so that the matrix parameters of the three control gates are W _i ,U _i ,W _f ,U _f ,W _o ,U _o The deviation parameter is b _i ,b _f ,b _o . The memory cell parameters of LSTM are W respectively _c ,U _c And b _c . These parameters are updated at each step in the training and storage.

Named entity classifier module (4): the final entity tag classification result is generated according to the neural network LSTM or GRU model softmax classifier.

Claims

1. A named entity recognition method based on feature fusion is characterized by comprising the following four modules: the system comprises a data preprocessing module (1), a feature construction module (2), a training named entity network model module (3) and a named entity classifier module (4);

(1) Data preprocessing module

Adding unlabeled data into the labeled training set to form a sparse labeled corpus, and loading the sparse labeled corpus into the domain ontology; cutting the text to be processed into Chinese character strings according to punctuation marks, numbers and space characters, and removing stop words;

(2) Feature construction module

The module is divided into feature extraction and feature fusion, and is specifically divided into four sub-modules: semantic feature extraction, word feature extraction, character feature extraction and feature fusion;

(3) Training named entity network model module

Training the fused features as the input of the model, and extracting context information to assist in deducing entity types because named entity identification is also called a sequence labeling task, so that a neural network model LSTM or GRU with a time sequence memory function is adopted as the training model;

(4) Named entity classifier module

Generating a final entity tag classification result according to a softmax classifier of a neural network LSTM or GRU model;

step (2), the concrete steps are as follows:

semantic feature extraction (21): semantic features contain two parts: conceptual features and non-conceptual word features; wherein, the concept refers to a special domain term composed of a plurality of independent words containing semantics; the non-concept word refers to a single semantic vocabulary; for extracted concept features that can map concepts out of the domain ontology, direct extracted word features of the concepts cannot be extracted;

firstly mapping the preprocessed corpus to a domain ontology, and segmenting data into semantic sets { Y }, by a maximum matching method ₁ ,…Y _N E D, which contains a conceptual set and a non-conceptual word set { G } ₁ ,…G _U }∪{F ₁ ,…F _V E Y; secondly, extracting semantic features by adopting a CBOW model, wherein the training goal of the CBOW is to maximize the following average logarithmic probability, and the formula is as follows:

wherein K is the context information of the target word in the data set D, Y _i Is the semantics in dataset D;

in CBOW, probability Pr (Y _i |Y _i-K ,…,Y _i+K ) Is calculated by the following formula:

wherein y is ₀ And y _i For target semantics Y _i Vector representations of input and output, and y ₀ For the average vector representation of all contexts, T is the rank of rotation, and W is the semantic dictionary;

word feature extraction (22): word feature extraction is divided into two cases, concept-based word feature extraction and non-concept-based word feature extraction;

concept-based word feature extraction is to extract word features based on concept features, since one concept is g= { C composed of a plurality of words ₁ ,…C _N "thus the meaning of concept isDetermined by the words involved; the formula for concept-based word feature extraction is expressed as:

wherein g _i Is concept G _i Concept vector of c _j G is g _n The j-th word vector, g _n Is concept G _i Number of words contained, P _i The concept vector and the average word vector are added to obtain +vector addition operation;

the non-conceptual word feature extraction method directly extracts word features by adopting a CBOW model of the semantic feature extraction module (21);

character feature extraction (23): extracting character features on the basis of conceptual words and on the basis of non-conceptual words; the character feature formula extracted based on the words in the concept is as follows:

wherein z is _k Is C _i The kth character vector, c _n Is the concept word C _i The number of characters included, + is a vector addition operation, Q _i The concept vector, the average word vector and the average character vector are added;

the character feature formula is extracted based on the non-concept word features as follows:

wherein w is _i Is a non-conceptual word F _i Word vector representation of f _n Is a non-conceptual word F _i The number of characters d _m Is F _i In the m-th character vector, + is vector addition operation,the non-concept word vector and the average character vector are added to obtain;

in Chinese, the meaning of the different expressions of the positions of the characters is also different, so that the extraction of the position characteristics of the characters also assists in deducing the semantic information of the words; for each character we denote the beginning, middle and end of the character with B, I, E, the formula is:

wherein c _n For the word C _i The number of characters to be included in the character string,for the word C _i Is the first character feature of +.>For the word C _i Middle character feature,/, of->For the word C _i The last character feature in (a);

extracting the position features of the characters of the non-conceptual feature words by adopting the same expression mode;

feature fusion (24): feature fusion is also divided into two cases, namely conceptual feature fusion and non-conceptual word feature fusion; in the named entity recognition task based on the part of domain ontology, the concept features are the same as the word features, and part of unlabeled named entities are directly extracted from the sparse marked corpus, so that the calculated amount is reduced;

fusion of conceptual features: fusing the extracted conceptual features, word features and character features and the position features of the characters, wherein the expression of the fusion of the conceptual features is as follows:

non-concept word feature fusion: fusing the extracted word features, character features and position features of the characters, and expressing a non-conceptual word feature fusion formula as follows:

wherein f _n For the word F _i The number of characters to be included in the character string,for the word F _i First character of +.>For the word F _i Middle character feature,/, of->For the word F _i The last character feature in (a);

the dimension of the concept vector is reduced by adopting a method for calculating the similarity of the body concept features, and the formula is as follows:

wherein o is _i G is a conceptual feature in the ontology _i And g _m For the conceptual features identified in dataset D, R () is g _i And g _m The gradient is calculated by adopting a gradient descent method, namely, the gradient descent slope of an error function is calculated smoothly and continuously, the gradient is smaller when the gradient is closer to the minimum value, and the gradient reaches the minimum value, so that the gradient is the optimal threshold of the similarity.