CN109800437A

CN109800437A - A kind of name entity recognition method based on Fusion Features

Info

Publication number: CN109800437A
Application number: CN201910099671.0A
Authority: CN
Inventors: 赵青; 王丹; 杜金莲; 付利华; 苏航
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2019-05-24
Anticipated expiration: 2039-01-31
Also published as: CN109800437B

Abstract

A kind of name entity recognition method based on Fusion Features belongs to computer field, extracts and merge varigrained text feature, concept characteristic and non-notional word feature by two aspects, thus to improve the accuracy rate of name Entity recognition and reduce calculation amount.Method includes: data preprocessing module, feature construction module, training name physical network model module and name entity classification device module, and wherein characteristic module includes semantic feature extraction, word feature extraction, four character feature extraction, Fusion Features submodules.It combines the timing memory feature of neural network model LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) in the method to consider the contextual information of name entity task, finally predicts entity class label using softmax.In model construction process, it can use sparse data and compared as training set and to two kinds of neural network models of LSTM and GRU, it is ensured that the present invention can obtain satisfactory effect in Entity recognition task.

Description

A kind of name entity recognition method based on Fusion Features

Technical field

The invention belongs to computer fields, are related to a kind of name entity recognition method based on Fusion Features.

Background technique

In recent years, with artificial intelligence technology natural language processing (Natural Language Processing, NLP) the extensive use in field, people are also more and more to the exploration of domain knowledge.Name Entity recognition is to constitute domain knowledge Basis and a vital step, such as: knowledge mapping building, text retrieval, text classification and information extraction etc. It requires to be named Entity recognition in field.

Name Entity recognition (Named Entity Recognition, NER) can be regarded as a sequence labelling task, Entity is searched by the information extracted and is classified as the classification of one group of fixation.The main side of two kinds of traditional NER problem Method is rule-based learning method and has the learning method of supervision, wherein there is the learning method of supervision to occupy an leading position.It is based on The method of rule learning and have the learning method of supervision all assume that available training data all label (that is, all include Entity in a document is all labeled) under the premise of, in the sequence label for finding candidate entity from document.However, nowadays Big data era taken time and effort very much using the data sufficiently marked as training set, and due to most of field terms Particularity, it is of today to name Entity recognition task there is also following challenges: (1) to be largely half structure or non-in actual life Structuring, and many information are narrative, no structural informations, are not suitable for the discovery and extraction of knowledge；(2) field is real Structure is complicated and same concept has a variety of expressions for body itself, such as in medical field: Chronic Obstructive Pulmonary Disease can To be abbreviated as COPD；(3) name entity is usually to be made of multiple words, only considers that word feature can be such that semantic information isolates.It is based on Problem above, traditional name entity recognition method have been difficult to be suitable for application scenarios of today.

Currently, the performance all excellent in every field with deep learning, the application in name Entity recognition task Also more and more, compare conventional method, and the method effect of deep learning is more preferable.But the NER method that deep learning combines is big Be all based on more it is English or word-based vector sum character vector, without considering concept characteristic.

2016, it is published in ACL, " the Neural Architectures of the paper as written by Guillaume Lample et al. For Named Entity Recognition " is proposed a kind of based on Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) and condition random field (Conditional Random Fields, CRF) combine name Entity recognition side Method, for identifying English name-to, place name etc., this method extracts word feature and character feature by RNN, finally by CRF pairs Entity is classified.

2017, it is published in the Journal of Computer Research and Development, " chemical drugs based on attention mechanism of paper written by You Yangpei et al. Object names Entity recognition ", a kind of entity recognition method based on words feature and combination attention mechanism is proposed, this method is logical Neural network LSTM (Long Short-Term Memory) Lai Xunlian Entity recognition classifier is crossed, and is generated finally using CRF Entity tag classification results.

Although above method can complete name Entity recognition task, existing name entity recognition method is all Assuming that no domain knowledge, feature is only learnt by training set, however in actual life, most of fields are all with part Domain knowledge, although being still not perfect, these domain knowledges can help us, and preferably identification is ordered in sparse data Name entity, while can also reduce to a certain extent by expressing the inconsistent huge calculation amount of bring.

Summary of the invention

The contents of the present invention:

A kind of name entity recognition method based on Fusion Features, this method comprises:

1. proposing a kind of name entity recognition method based on Fusion Features, this method not only can be according to domain body Included in concept achieve the effect that predict neologisms in the expectation library of sparse markup, can also be inconsistent but have to expressing The entity of same concept takes unified expression way, and can not only improve accuracy rate can also reduce calculating cost.

2. first to pretreated data use CBOW model extraction semantic feature, semantic feature include concept characteristic and Non- notional word feature extracts concept, word and character feature for concept characteristic, and it is special that word is just directly extracted for non-notional word feature It seeks peace character feature.

3. the new feature set extracted is secondly carried out Fusion Features, Fusion Features also include two parts, are based on The Fusion Features of concept and Fusion Features based on non-notional word.And the dimension of concept characteristic is reduced by calculating concept similarity Degree.

4. being extracted using the characteristics of neural network LSTM or GRU (Gated Recurrent Unit) model timing memory The relevant contextual information of entity is named, and using new feature set as the input of training pattern.

The principle of the present invention is a kind of name entity recognition method based on Fusion Features, do not use only traditional word to Measure feature and character vector feature, it is also considered that the concept characteristic and character position feature that word is included, not by concept characteristic Term vector dimension can be only reduced, the concept according to included in ontology is in the corpus of sparse markup, certain journey Achieve the effect that predict neologisms on degree, contextual information is paid close attention to finally by neural network LSTM or GRU, so as to good Improve the accuracy rate of name Entity recognition.

To reach the above goal of the invention, the present invention is adopted the following technical scheme that:

A kind of name entity recognition method based on Fusion Features, comprising: data preprocessing module, feature construction module, Training name physical network model module, name entity classification device module.Wherein, feature construction module is mainly for different grain size Text feature extract and merge, specifically comprising four submodules be semantic feature extraction module, word feature extraction respectively Module, character feature extraction module, Fusion Features module.

Semantic feature extraction module, semantic feature include two parts, and concept characteristic and non-notional word feature, concept refer to By multiple special field terms formed comprising semantic independent vocabulary, for example, Chronic Obstructive Pulmonary Disease；Non- concept Word just refers to an individual semantic vocabulary, for example, difficult.For the extraction concept of concept can be mapped out in domain body Feature cannot extract the direct extraction word feature of concept, finally by CBOW model extraction semantic feature.

Word characteristic extracting module is made of due to concept multiple words, such as: chronic cor pulmonale, thus it is general It reads and is meant that the word being contained by it determined.In order to keep the integrality of semantic information, this method is divided into from the aspect of two, Word feature is extracted based on concept and word feature is extracted based on non-notional word, wherein the extracting method of non-notional word feature and semanteme are special Sign extracting method equally uses CBOW model.

Character feature extraction module, character are the smallest semantic units of Chinese, also include certain semantic information, the meaning of word Think of is that the character being contained by it determines, also, the semantic information based on character itself can also reach to a certain extent pre- The effect for surveying neologisms, facilitates the deduction of entity class, such as: pain, vector+pain vector of pain is close to a painful word Vector.Meanwhile the location information of character be also it is very crucial, identical characters different location may make the meaning of two words complete Difference, therefore in order to improve the accuracy rate of Entity recognition, this method not only considers that character feature also considers character position feature.

A Fusion Features module, firstly, it is new that the concept characteristic extracted, word feature and character feature are permeated Feature set.Secondly, proposing a kind of new fusion method, this method mainly considers two kinds of situations, for can be in domain body In extract the word of concept and just merge concept, word and character feature, for the word of concept cannot be extracted from ontology with regard to direct It extracts word feature and is blended with character feature.Finally, Feature Dimension Reduction is carried out to the concept characteristic extracted by domain body, So as to reduce calculation amount on the basis of improving and naming Entity recognition accuracy rate, and using fused feature as model Input is trained.

The present invention is extracted varigrained text feature and proposes a new Feature fusion, can not only be abundant The study semantic information that includes into text, also can solve the ambiguity of field term and by expression inconsistency bring Huge calculation amount.

Detailed description of the invention

Name entity recognition method integrated stand composition of the Fig. 1 based on Fusion Features；

Name entity recognition method flow chart of the Fig. 2 based on Fusion Features；

Specific embodiment

The feature and exemplary embodiment of various aspects of the present invention is described more fully below

The present invention extracts the method for varigrained feature extraction and Fusion Features to identify name entity, it is desirable to increase life The accuracy rate of name Entity recognition simultaneously reduces calculation amount.Overall architecture is as shown in Figure 1, be divided into data preprocessing module (1), feature structure Model block (2), training name physical network model module (3) and name entity classification device module (4).Specific method flow chart is such as Shown in Fig. 2.

Data preprocessing module (1): firstly, the data not marked are added in the training set marked forms sparse markup Corpus, and be loaded into domain body；Secondly, the corpus of all sparse markups is cut into according to additional character shorter Man's character string (including punctuation mark, number and space character) and remove stop words.

Feature construction module (2): the module is mainly to extract varigrained feature from text and will extract Feature is merged.Semantic feature extraction, word feature extraction, character feature extraction and Fusion Features can be more specifically divided into.

Semantic feature extraction module (21): the character string L=segmented (L1 ... Ln) is mapped to ontology O, using maximum Matching method finds out the length Lmax for the maximum initial matching semanteme for including in character string (if maximum initial matching semanteme length Lmax is equal to string length Llen, then Llen is a semanteme).Then Lmax is extracted from L, and the both sides of Lmax are divided For new band cutting character string, the character string all segmented is defined as a semantic collection { Y₁... Y_N) ∈ D, wherein including Concept set and non-concept word set { G₁... G_N}∪{F₁... F_N}∈Y.Then pass through CBOW model extraction semantic feature, CBOW Training objective be by the maximization of following average log probability, specific formula is as follows:

Wherein, K is the contextual information of target word in data set D, Y_iFor the semanteme in data set D.

In CBOW, probability P r (Y_i|Y_i-K..., Y_i+K) it is to be calculated by following formula:

Wherein, y₀And y_iFor target semanteme Y_iThe vector output and input indicates, and y₀For all contexts it is average to Amount indicates that W is semantic dictionary.

Word characteristic extracting module (22): word feature is divided into two kinds of situations and considers, word feature extraction based on concept and is based on The feature extraction of non-notional word.

Feature extraction based on notional word: since concept is usually the G={ C being made of multiple words₁... C_N, concept It is meant that and is determined by the word that it is included, therefore this method will extract word feature on the basis of concept characteristic.Specific formula It is as follows

Wherein, g_iFor concept G_iConcept Vectors, c_jFor g_nIn j-th of term vector, g_nFor concept G_iFor the word for being included Number, Q_iIt is added and is obtained with its average term vector by Concept Vectors ,+it is addition of vectors operation, according to phase obtained by previous experiment experience The calculation method added compares the method for combining, and in the case where not losing precision, more operation is simple, quickly, therefore in following methods In calculate all by the way of addition of vectors.

Character based on non-notional word extracts will be using the CBOW model in semantic feature extraction module (21).

Character feature extraction module (23): character feature is equally divided into two kinds of situations and considers, the character based on notional word is special Sign mentions and the character feature based on non-notional word extracts.

Character feature based on notional word extracts: in extracted concept and word feature P_iOn the basis of to extract character special Sign, specific formula is as follows:

Wherein, z_kFor c_nIn k-th of character vector, c_nFor notional word C_iThe character number for being included ,+transported for addition of vectors It calculates, Q_iIt is added and is obtained with its average character vector by Concept Vectors, its average term vector.Based on non-notional word feature extraction character Characteristic formula is as follows:

Wherein, w_iFor non-notional word F_iTerm vector indicate, f_nFor non-notional word F_iThe character number for being included, d_mFor f_nIn M-th of character vector ,+it is addition of vectors operation,By non-notional word vector sum, its average character vector addition is obtained.

Position as where the meaning of word in Chinese generally depends on character, the meaning of character position difference expression Also different, therefore the position feature for extracting character can more accurately infer the semantic information of word.For each character We indicate that formula can be expressed as with B (beginning), I (centre), E (end):

The position feature for extracting its character for non-concept characteristic word also uses same expression way.

Fusion Features (24): being worked based on feature extraction, and Fusion Features part is similarly divided into two kinds of situations and considers, is based on The Feature fusion of concept and Feature fusion based on non-notional word.This method passes through the new feature set extracted Addition of vectors operation is merged, primary concern is that concept is special in the name Entity recognition task based on certain fields ontology Words feature of seeking peace equally is very important, it direct extraction unit can be divided into the life of mark in the corpus of sparse markup Name entity, to reduce calculation amount.

Feature fusion based on concept: we are by the concept characteristic of extraction, word feature, character feature and character bit It sets feature to be merged, formula is as follows:

Feature fusion based on non-notional word: we are special by the word feature of extraction, character feature and character position Sign is merged, and formula is as follows:

Wherein, f_nFor word F_iThe character number for being included,For word F_iIn first character,For word F_i's Intermediate character feature,Word F_iIn last character feature.

Usually have the characteristics that express inconsistency for the field term of Chinese, especially in medical field, with without exception The medical terms of thought can there are many expressions, such as: Chronic Obstructive Pulmonary Disease also can be expressed as COPD.With data Huge calculation amount can be brought by increasing, and be based on this problem, therefore we use the side that concept characteristic similarity is calculated based on ontology Method reduces the dimensions of Concept Vectors, and formula is as follows:

Wherein, o_iFor a concept characteristic in ontology, g_iAnd g_mFor the concept characteristic identified in data set D, R () is g_iAnd g_mRelationship, maxsimilarity () be cosine similarity, α is similarity threshold, according to previous experiment, similarity The too small easy misjudgement that threshold value is set, it is excessive to be easy to fail to judge, therefore usually similarity threshold be between 0.87-0.93, recommendation Initial threshold is set as 0.9, calculates error using the method for gradient decline, exactly error function is made smoothly continuously to calculate gradient The slope of decline, it is smaller closer to minimum value gradient, overshoot risk can be reduced by adjusting step-length, during the experiment may be used Step-length is set as 0.01, threshold range, which is located at, to be adjusted until the slope of gradient reaches minimum value just between 0.87 and 0.93 It is the optimal threshold of similarity.

For more specifically, concept characteristic is exactly mapped to domain body O, if there are two concept g_iAnd g_mClose to Ontological concept o_i, g is just calculated by cosine similarity_iAnd g_mTo Ontological concept o_iSimilarity distance, if it is less than similarity Threshold alpha, then g_iAnd g_mAn independent concept respectively in ontology, if it is greater than similarity threshold α, then can think g_iAnd g_mFor same concept, and can be by g_iReplace with g_mOr by g_mReplace with g_i.To reduce the dimension of concept characteristic, reduce Calculation amount.

Training name physical network model module (3): being trained fused feature as the input of model, due to Name Entity recognition is also referred to as sequence labelling task, therefore contextual information is extremely important, and training pattern will be using with timing Neural network LSTM or the GRU model of memory function.LSTM's specific formula is as follows:

i_t=σ (W_ix_t+U_ih_t-1+b_i)

f_t=σ (W_fx_t+U_fh_t-1+b_f)

o_t=σ (W_ox_t+U_oh_t-1+b_o)

Wherein i_t、f_t、o_tThe input, forgetting, out gate of timing node t are represented, σ represents nonlinear function, each control The parameter of door is all made of two matrixes and a bias vector, and therefore, the matrix parameter of three control doors is W_i,U_i,W_f,U_f, W_o,U_o, straggling parameter b_i,b_f,b_o.The memory unit parameter of LSTM is respectively W_c,U_cAnd b_c.These parameters are in training and store When each step be all updated.

Name entity classification device module (4): it is generated most according to neural network LSTM or GRU model softmax classifier Entity tag classification results afterwards.

Claims

1. a kind of name entity recognition method based on Fusion Features, feature include following four module: data prediction mould Block (1), feature construction module (2), training name physical network model module (3), name entity classification device module (4)；

(1) data preprocessing module

The data not marked are added in the training set marked and form the corpus of sparse markup, and are loaded into domain body；Root According to punctuation mark, number and space character by text dividing to be processed at Chinese character string, and remove stop words；

(2) feature construction module

The module is divided into feature extraction and Fusion Features, is specifically divided into four submodules: semantic feature extraction, word feature extraction, Character feature extracts and Fusion Features；

(3) training name physical network model module

It is trained fused feature as the input of model, since name Entity recognition is also referred to as sequence labelling task, It needs to extract contextual information auxiliary and infers entity class, therefore training pattern will be using the nerve net with timing memory function Network model LSTM or GRU；

(4) entity classification device module is named

Last entity tag classification results are generated according to the softmax classifier of neural network LSTM or GRU model.

2. a kind of name entity recognition method based on Fusion Features according to claim 1, it is characterised in that step (2), specific as follows:

Semantic feature extraction (21): semantic feature includes two parts: concept characteristic and non-notional word feature；Wherein, concept is Refer to by multiple special field terms formed comprising semantic independent vocabulary；Non- notional word just refers to an individual language Adopted vocabulary；For the extraction concept characteristic of concept can be mapped out in domain body, the direct extraction word of concept cannot be extracted Feature；

Pretreated corpus is mapped to domain body first, data cutting is collected to be semantic by { Y by maximum matching method₁, ...Y_N∈ D, wherein including concept set and non-concept word set { G₁... G_N}∪{F₁... F_N}∈Y；Secondly CBOW model is used Semantic feature is extracted, the training objective of CBOW is by the maximization of following average log probability, formula are as follows:

Wherein, K is the contextual information of target word in data set D, Y_iFor the semanteme in data set D；

Wherein, y₀And y_iFor target semanteme Y_iThe vector output and input indicates, and y₀For the average vector table of all contexts Show, T is to turn order, and W is semantic dictionary；

Word feature extraction (22): word feature extraction is divided into two kinds of situations, word feature extraction based on concept and based on non-concept Word feature extraction；

Word feature extraction based on concept is extraction word feature on the basis of concept characteristic, since a concept is by multiple words G={ the C of composition₁... C_N, therefore concept is meant that and is determined by the word for being included；Word feature extraction based on concept Formula indicates are as follows:

Wherein, g_iFor concept G_iConcept Vectors, c_jFor g_nIn j-th of term vector, g_nFor concept G_iThe number for the word for being included, Q_i It is added and is obtained with its average term vector by Concept Vectors ,+it is addition of vectors operation；

The word feature extracting method of non-concept will directly extract word spy using the CBOW model of semantic feature extraction module (21) Sign；

Character feature extracts (23): extracting character feature on the basis of notional word and on the basis of non-notional word；Based on general It is as follows that word in thought extracts character feature formula:

Wherein, z_kFor c_nIn k-th of character vector, c_nFor notional word C_iThe character number for being included ,+it is addition of vectors operation, Q_iIt is added and is obtained with its average character vector by Concept Vectors, its average term vector；Based on non-notional word feature extraction character feature Formula is as follows:

Wherein, w_iFor non-notional word F_iTerm vector indicate, f_nFor non-notional word F_iThe character number for being included, d_mFor f_nIn m A character vector ,+it is addition of vectors operation,By non-notional word vector sum, its average character vector addition is obtained；

In Chinese, the meaning of character position difference expression is also different, therefore the position feature for extracting character also assists Infer the semantic information of word；For each character, we are indicated with B (beginning), I (centre), E (end), formula expression Are as follows:

Wherein, c_nFor word C_iThe character number for being included,For word C_iIn first character feature,For word C_iIn Between character feature,For word C_iIn last character feature；

The position feature for extracting its character for non-concept characteristic word also uses same expression way；

Fusion Features (24): according to above content, Fusion Features are equally divided into two kinds of situations, concept characteristic fusion and non-notional word Fusion Features；Primary concern is that concept characteristic and words feature in the name Entity recognition task based on certain fields ontology Equally, it directly extracts the name entity that part does not mark in the corpus of sparse markup, to reduce calculation amount；

Concept characteristic fusion: the position feature of the concept characteristic extracted, word feature and character spy and character are merged, The formula expression of concept characteristic fusion are as follows:

Non- notional word Fusion Features: the position feature of the word feature, character feature and the character that extract is blended, non-notional word The formula of Fusion Features is expressed are as follows:

Wherein, f_nFor word F_iThe character number for being included,For word F_iIn first character,For word F_iCentre Character feature,Word F_iIn last character feature；

The dimension of Concept Vectors is reduced using the method for calculating Ontological concept characteristic similarity, formula is as follows:

Wherein, o_iFor a concept characteristic in ontology, g_iAnd g_mFor the concept characteristic identified in data set D, R () is g_iAnd g_m Relationship, maxsimilarity () be cosine similarity, α is similarity threshold, and initial threshold is set as 0.9, using under gradient The method of drop calculates error, and error function is exactly made smoothly continuously to calculate the slope of gradient decline, closer to minimum value ladder Spend it is smaller, until the slope of gradient reaches the optimal threshold that minimum value is exactly similarity.