CN110110335A

CN110110335A - A kind of name entity recognition method based on Overlay model

Info

Publication number: CN110110335A
Application number: CN201910384659.4A
Authority: CN
Inventors: 吴骏; 顾溢; 张哲成; 谈志文; 李宁
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2019-08-09
Anticipated expiration: 2039-05-09
Also published as: CN110110335B

Abstract

Complexity Chinese name entity recognition method based on Overlay model, 1) model training stage: a is by having the name entity corpus of mark to train low layer BiLSTM-CRF model under the calculating of improved loss function and saving；B is by having the name Entity recognition corpus of mark training high level BiLSTM-CRF model and saving；2) the model prediction stage: it will be sent into low layer model with prediction corpus, and will identify that the name entity of coarseness will be passed to high-level model as PRELIMINARY RESULTS.High-level model continues to identify to PRELIMINARY RESULTS, and result is re-entered high-level model if recognition result is not single name entity, it is known that all results are single name entity；3) it exports result: collecting corpus and pass through all name entities i.e. name entity of all outputs of upper layer network that Overlay model obtains, as the final result identified in entire identification process.

Description

A kind of name entity recognition method based on Overlay model

Technical field

The present invention relates to a kind of name entity recognition method based on Overlay model, this method solve internet text rings Under border, the identification problem of complicated Chinese name entity.

Background technique

Natural language processing (Natural Language Processing) technology is a son of computerized information engineering Field, target are to mass text Data Management Analysis, so that computer program can use the information such as morphology, grammer, semanteme Identification is completed to natural language text, is understood and the tasks such as output, such as word segmentation, name Entity recognition, Relation extraction, machine Device translation, spatial term, question answering system, sentiment analysis etc..Natural language processing technique is in rule learning, statistical learning The methods of exploration and research under it is ripe day by day.In recent years, indicate study, deep neural network class machine learning method to nature Language processing techniques bring new direction and development, can achieve in part natural language processing problem good and stable As a result.Natural language processing technique has a variety of applications in all trades and professions: the comment text data in social media can be used to The trend of auxiliary monitoring public sentiment public opinion；Include many economic datas, company operation situation in financial and economic news, utilizes these textual datas According to the execution that can be traded with aided quantification；Using the mass text data in news media, user interest topic can be carried out Modeling efficiently carries out information filtering for reader and interest is recommended；Different language can be the text of carrier by machine translation mothod Offer automatic translation, promote intercultural communication with exchange；Knowledge mapping technology can link different people and tissue, construction Knowledge base serves a variety of business applications.

It names Entity recognition (Named-Entity Recognition), also known as entity extraction technique, entity partition, It is a subdomains of natural language processing technique.The name entity that will be referred in non-structured text is aimed to extract Come, including name, organization name, location name, medical terms, regulation term, time, quantity, monetary value etc..Such as in finance and economics Need accurately to extract the name entity such as enterprise name, very important person's title, monetary value in article；It is needed in political news Accurately extract the name entity such as politician's title, national geography title, organization's title, event title；In judgement text In book text, need to extract the information such as principal name, penalty clause, measurement of penalty situation, association tissue.It can be said that name is real Body identifies that problem is one of most basic task of natural language processing, and the height of the accuracy rate, recall rate of naming Entity recognition is straight Connect the subsequent natural language processing task that affects, such as information extraction, text classification, text snippet, question answering system etc. research Direction.

In actual engineering field, there are also many good problems to study for Chinese name entity recognition techniques.In project work The problem of much rarely encountering or will not encounter in standard data set experiment can be encountered using name entity recognition system in journey: (1) the name entity that will appear many place names, name, organization name nesting in practical application, when encountering such entity, model Accuracy rate can decline；(2) text message structure in internet is mixed and disorderly, and form is changeable, directly gives Chinese name entity recognition system Effect it is bad；(3) when input text size is long, the ability of Named Entity Extraction Model can be decreased obviously, and be needed using one A little reasonable methods carry out reasonable cutting to text to improve recognition effect.We to the name entity in the case of these one by one Analysis:

1) nested name entity.Such as the name entity of target identification is " Shanghai agriculture firm ", but in the name entity There are also sub- place name naming entity " Shanghai ", after the probability that all kinds of labels are identified by BiLSTM, by condition random field to each The score of kind sequence is compared, and there is no " Shanghai agriculture firm " is identified as an entirety for final system.The result is carried out It analyzes and determines, " B-LOC " " I-LOC " label score of "upper" " sea " two word is very high, even if plus the mark of non-name entity hereinafter Label score still exceeds the sequence label of Shanghai agriculture firm " B-ORG " " I-ORG " " I-ORG " " I-ORG " " I-ORG " " I-ORG " Score affects the success rate that integral entity is identified as organization name.

2) entity context erroneous association is named.Above with have being associated in meaning hereinafter, by name entity recognition system Identify that mistake boundary is also a kind of common classification error.As table ref { mix }, " Nanjing banking operation " center " name entity It is unknown by error label with boundary above.When model analysis text is longer, due to the association of context semanteme, training mark collection The factors such as data are insufficient, name entity recognition system usually hold the boundary for naming entity inaccurate.

3) sentence length long component is complicated.Too long of text difficulty for Named Entity Extraction Model is bigger, especially It is the systematicness of text in network in practical engineering application, the normative comparison with standard data set that punctuation works use is poor very It is more, when text size is very long, maximum sub-path conduct that condition random field algorithm will be calculated by Viterbi Dynamic Programming Final output is as a result, often there are many accuracy rate decline from the point of view of effect.As table shown in ref { long }, when inputting compared with long text There is name Entity recognition mistake, and by text cutting input compared with short text, system can successfully identify correct name Entity.This shows that a biggish needs solve the too long influence name entity recognition system of text size in practical applications Problem.

Summary of the invention

Based on the above reasons, object of the present invention is to a kind of name entity recognition method based on Overlay model, is a kind of base Entity recognition problem is named in the Chinese Named Entity Extraction Model of Overlay model come the Chinese handled under complex situations.This method It solves under internet text environments, the identification problem of complicated Chinese name entity.Overlay model is ordered by two layers of BiLSTM-CRF Name physical model is constituted, and makes different improvement respectively to low layer and high-level model based on different purposes.

The technical scheme is that a kind of complexity Chinese name entity recognition method based on Overlay model, feature It is, includes the following steps: 1) model training stage: low layer BiLSTM-CRF is respectively trained using Chinese name entity data set Model and high level BiLSTM-CRF model simultaneously save, and two-layer model stacking is named Entity recognition；A is by there is the life of mark Name entity corpus is trained low layer BiLSTM-CRF model and is saved under the calculating of improved loss function；B is by there is mark Name Entity recognition corpus training high level BiLSTM-CRF model simultaneously saves；2) it the model prediction stage: will be sent into prediction corpus Low layer model identifies that the name entity of coarseness is passed to high-level model as PRELIMINARY RESULTS.High-level model to PRELIMINARY RESULTS after Result is re-entered high-level model if recognition result is not single name entity, it is known that all results are single by continuous identification Name entity；Corpus to be predicted is sent into low layer model by a, optimizes coding/decoding method, and coarseness result is sent into high-level model；B will Coarseness name entity is sent into upper layer network and is identified；The high-rise output of c judgement returns to 2) b, such as can not as a result, as that can divide again Divide again, exports result；3) it exports result: collecting corpus and pass through all name entities i.e. upper layer network institute that Overlay model obtains There is the name entity of output, as the final result identified in entire identification process.Fig. 4 gives basic subrack of the invention Frame.

Low layer Named Entity Extraction Model and high-rise Named Entity Extraction Model are respectively trained in step 1).Two-layer model is all For BiLSTM-CRF model, but training method is different from purpose.

The utility model has the advantages that the success rate that integral entity is identified as organization name can be improved in the present invention.The present invention is for working as mould When type analysis text is longer, training mark collection data staging is reliable, boundary handle of the name entity recognition system for name entity It is accurate to hold.And solve text size it is very long when also guarantee name entity recognition system it is accurate.

Detailed description of the invention

Fig. 1 is BiLSTM Named Entity Extraction Model flow chart.

Fig. 2 is Overlay model high-level model structure chart.

Fig. 3 is whole stacking model flow figure；

Fig. 4 is basic procedure block diagram of the invention.

Specific embodiment

As shown in Fig. 1, BiLSTM-CRF name physical model structure by distributed embeding layer, deep neural network layer and Condition random field layer composition.Distribution insertion module trains this method of term vector by the distributed table of text using word2vec The phenomenon that showing that the meaning between words connects, eliminating word wide gap.The term vector for using pre-training good is as depth The input for practising processing natural language problem has become a classical mature method.Many work show using preparatory training Good term vector is compared with being randomly-embedded, and entire neural network convergence rate is faster；Trained model is in accuracy and recalls There is biggish promotion on degree；It is more obvious using the method advantage of word2vec especially in the lesser situation of data volume.

There is complicated temporal associativity each other for information sequence, between information, it is often more important that real for name Message length is different for body identification mission, and Recognition with Recurrent Neural Network (RNN) is a good scheme.And LSTM model is A mutation of RNN, while being good at modeling sequence problem, which, which also has, is easy to solve, being capable of long-term preservation weight The advantages of wanting information.And two-way shot and long term memory network (BiLSTM) is a modified version of LSTM model, traditional RNN is defeated Entering is that above, output is hereafter, to be released hereafter according to above, and two-way RNN utilizes reversed information simultaneously, allows model from both direction Study, the word-building that this concept also complies with Chinese natural language send the thought of sentence, and BiLSTM is the bi-directional version of LSTM.

Condition random field layer (CRF) separates the relevance for the level that exports, and can fully consider in prediction label Context relation, it is often more important that the solution viterbi algorithm of CRF is the road that maximum probability is found out using the method for Dynamic Programming Diameter, this and name Entity recognition task agree with it is more preferable, can to avoid in result occur " B-LOC " label be followed by " I-ORG " The problem of this illegal sequence of label.Thus this paper sequence labelling module selects CRF model.

Low layer model in the training process, improves the loss function of BiLSTM-CRF model, the mesh of this layer model Be by corpus carry out coarseness identification, do not lose as far as possible it is potential name entity information.Thus in traditional BiLSTM-CRF On improve.The loss function for being model training is improved, steps are as follows:

For list entries X=(x₁,x₂,…,x_n), it is defeated after BiLSTM network query function if the sentence is embedded in by distributed Matrix out is P, and the dimension of P matrix is n × k, and k is different label number.P_i,jAs i-th of character marking is j-th of mark The score of label, referred to as emission probability.For a potential forecasting sequence y=(y₁,y,…,y_n), define obtaining for this sequence Point:

Wherein A is transition probability matrix, and size is k × k, A_i,jIndicate that label i is transferred to the transition probability of label j.In order to Reach under the premise of not omitting entity information as far as possible, text tentatively identified, by score formula optimization are as follows:

λ is penalty factor in formula, and value is between 0 to 1.Adjust in this way be meant that calculate sequence label path obtain Timesharing is multiplied by a penalty coefficient and counts sequence label path score when true label is O " (not being name entity).Cause It is smaller for the relatively entire data set of the name entity that we often pay close attention in the corpus in reality, so that model is biased to Non- name entity tag is predicted, so that the penalty values of model are smaller.But this preference with it is desirable that finding out all names realities The target of body is disagreed.Here penalty factor reduces the weight that authentic signature is the training examples of " O ", and authentic signature It is not " O " in contrast the weight that is, label belongs to any kind name entity sample is improved.Calculating loss in this way When, true tag is that the prediction result of the characters such as " B-PER ", " I-PER ", " B-ORG " is bigger for network training influence.In order to So that lower layer network is more likely to output name entity tag in decoding sequence rather than the non-name entity tag of output, herein All characters are being belonged to label " O " in lower layer network solution Calculative Process, i.e., are being not the probability of name entity multiplied by punishment Factor mu, μ value is between 0 to 1, so that the sequence containing more name entity tags is easier to obtain high score, by conduct As a result it exports.

In design conditions random field path score, dropped using prediction weighted score of the λ penalty factor to non-name entity It is low, achieve the purpose that improve recall rate.

Coding/decoding method is improved in decoding, steps are as follows:

S01: the emission probability matrix that text word vector matrix to be predicted obtains after model calculates；

S02: being the non-probability for naming entity multiplied by penalty factor μ by current emission probability matrix label；

S03: creation one sequence length × label number null matrix S records each subpath score of Dynamic Programming；

S04: the path clue in creation one sequence length × label number matrix B record s-matrix uses current node A upper node carry out record path；

S05: from first node to a last node traverses: by emission probability matrix and transition probability matrix, in S It calculates in matrix from starting point to the maximum probability path of the corresponding each label of each node, while recording road in B；

S06: the score value in maximum probability path is found out in last column of S, and traverses B matrix with backtracking method, finds out this most The sequence label in maximum probability path is as final output.

As shown in Fig. 2, upper layer network model receives the output of lower layer network model, received text is further processed, Key is to find the boundary of name entity accurately.Here in training high level BiLSTM-CRF model, after the insertion of character distribution The ability that convolutional neural networks model improves high-level model judgement name entity boundary is added.

Whole Overlay model constructs as shown in Figure 3.It is low using penalty factor λ optimization loss function training in training process Layer BiLSTM-CRF model；Convolutional layer is added and pays close attention to local message, training high level BiLSTM-CRF model saves two layers of mould respectively Type.During prediction, testing material is sent into the low layer model saved, optimizes decoding process using penalty factor μ, extracts coarse grain The name entity of degree, send result into high-level model.Name entity in the careful identification corpus of high-level model, judges high-level model As a result, terminating to predict when the output of high-rise result only single name entity.By the single name entity of output and its side Boundary's information is exported as final recognition result.

Model prediction phase characteristic is: step 2 will be sent into low layer model with prediction corpus, optimize coding/decoding method, identification The name entity of coarseness out.PRELIMINARY RESULTS is sent into higher layer entities identification model.High-level model accurately identifies, obtain as a result, Whether judging result is single name entity, if then exporting, if not upper layer network is then passed to again, until output is single Name entity.

The model prediction stage, lower layer network coding/decoding method are characterized in that: in order to enable lower layer network is decoding Output name entity tag is more likely to when sequence rather than the non-name entity tag of output, the application is in lower layer network decoding meter All characters are being belonged to label " O " during calculation, i.e., are being not the probability of name entity multiplied by penalty factor μ, μ value is 0 to 1 Between, so that the sequence containing more name entity tags is easier to obtain high score, exported as a result.

The model prediction stage, upper layer network prediction technique are characterized in that: the conjunction that will be identified in lower layer network Method names recognition result of the entity sequence as coarseness, is passed to upper layer network.For upper layer network, Veterbi decoding side The constant Stringency to guarantee the accuracy and boundary of result of formula.The text that upper layer network receives lower layer network input carries out pre- Survey, prediction result has following situations: 1) upper layer network identifies single entity, and the entity that is accurately identified using high level is as finally Export result.2) upper layer network identifies multiple entities, and multiple entities are re-used as the incoming upper layer network of input respectively, are repeated Above step.

The output result stage, by the output result of the output result of final upper layer network Overlay model as a whole. The legitimate name entity and its boundary that all laminated networks receive are received with stack data structure, using name name entity sets as pre- Survey the prediction result of corpus.

In conclusion a kind of Chinese name entity recognition method based on Overlay model of the invention is known using low layer model The entity information of other coarseness rationally cuts text under the premise of not omitting name entity information, is high-level model Offer is accurately identified effectively to help.Convolution pond process is added in high-level model, improves the judgement to name entity boundary.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of complexity Chinese name entity recognition method based on Overlay model, which comprises the steps of: 1) mould The type training stage: low layer BiLSTM-CRF model and high level BiLSTM-CRF mould is respectively trained using Chinese name entity data set Type simultaneously saves, and two-layer model stacking is named Entity recognition；A is by there is the name entity corpus of mark in improved damage Function is lost to calculate lower training low layer BiLSTM-CRF model and save；B is high by there is the training of the name Entity recognition corpus of mark Layer BiLSTM-CRF model simultaneously saves；2) the model prediction stage: it will be sent into low layer model with prediction corpus, and will identify coarseness Entity is named to be passed to high-level model as PRELIMINARY RESULTS；High-level model continues to identify to PRELIMINARY RESULTS, if recognition result is not single Result is then re-entered high-level model by a name entity, it is known that all results are single name entity；A send corpus to be predicted Enter low layer model, optimize coding/decoding method, coarseness result is sent into high-level model；Coarseness name entity is sent into high-rise net by b Network is identified；The high-rise output of c judgement returns to 2) b, as that can not divide again, exports result as a result, as that can divide again；3) output knot Fruit: it collects corpus and passes through all name entities i.e. name entity of all outputs of upper layer network that Overlay model obtains, as whole The final result identified in a identification process.

2. the complexity Chinese name entity recognition method based on Overlay model according to claim 1, it is characterised in that: step 1) low layer Named Entity Extraction Model and high-rise Named Entity Extraction Model are respectively trained in, two-layer model is BiLSTM-CRF Model.

3. the complexity Chinese name entity recognition method according to claim 2 based on Overlay model, it is characterised in that: the low layer Corpus is carried out the identification of coarseness by Named Entity Extraction Model, the information without losing potential name entity；In tradition It is improved on BiLSTM-CRF, improves the loss function for being model training, steps are as follows:

For list entries X=(x₁,x₂,…,x_n), if the sentence is embedded in by distributed, exported after BiLSTM network query function Matrix is P, and the dimension of P matrix is n × k, and k is different label number；P_i,jAs i-th of character marking is j-th of label Score, referred to as emission probability；For a potential forecasting sequence y=(y₁,y,…,y_n), define the score of this sequence:

Wherein A is transition probability matrix, and size is k × k, A_i,jIndicate that label i is transferred to the transition probability of label j；

Under the premise of reaching and not omitting entity information, text is tentatively identified, by score formula optimization are as follows:

λ is penalty factor in formula, and value is between 0 to 1；It adjusts and is meant that in calculating sequence label path score in this way When, when true label is O " (not being name entity), it is multiplied by a penalty coefficient and counts sequence label path score；Here Penalty factor the weight that authentic signature is the training examples of " O " is reduced, and authentic signature is not " O ", i.e., label belongs to In contrast the weight of any kind name entity sample is improved；In this way when calculating loss, true tag is " B- PER ", " I-PER ", the prediction result of " B-ORG " character are bigger for network training influence；In order to enable lower layer network is decoding Output name entity tag is more likely to when sequence rather than the non-name entity tag of output, in lower layer network solution Calculative Process It is middle to belong to all characters label " O ", i.e., for name entity probability multiplied by penalty factor μ, μ value between 0 to 1, So that the sequence containing more name entity tags is easier to obtain high score, exported as a result.

4. the complexity Chinese name entity recognition method based on Overlay model according to claim 2, it is characterised in that: described High-rise Named Entity Extraction Model receives the output of lower layer network model, and received text is further processed, and key is to look for The boundary of quasi- name entity；In training high level BiLSTM-CRF model, convolutional Neural net is added after the insertion of character distribution Network model (CNN) improves the ability on high-level model judgement name entity boundary；Be added convolutional neural networks MODEL C NN purpose be Finer feature extraction is carried out to the feature that character distribution indicates, so that local message generates more effectively connection, to reality The identification on body boundary is more accurate.

5. the complexity Chinese name entity recognition method based on Overlay model according to claim 1, it is characterised in that: model In forecast period: step 2) will be sent into low layer model with prediction corpus, optimize coding/decoding method, identify that the name of coarseness is real Body；PRELIMINARY RESULTS is sent into higher layer entities identification model；High-level model accurately identifies, and obtains as a result, whether judging result is single A name entity, if then exporting, if not upper layer network is then passed to again, until output is single name entity.

6. the complexity Chinese name entity recognition method based on Overlay model according to claim 5, it is characterised in that: described Model prediction stage neural network forecast method and step on the middle and senior level: using the legitimate name entity sequence identified in lower layer network as The recognition result of coarseness is passed to upper layer network；For upper layer network, Veterbi decoding mode is constant to guarantee result The Stringency of accuracy and boundary；The text that upper layer network receives lower layer network input is predicted that prediction result has following Situation: 1) upper layer network identifies single entity, using the entity that high level accurately identifies as final output result；2) high-rise net Network identifies multiple entities, and multiple entities are re-used as the incoming upper layer network of input respectively, repeat above step.

7. the complexity Chinese name entity recognition method based on Overlay model according to claim 1, it is characterised in that: described The result stage is exported, by the output result of the output result of final upper layer network Overlay model as a whole；Use stack data structure The legitimate name entity and its boundary that all laminated networks receive are received, using name name entity sets as the prediction of prediction corpus As a result.

8. the complexity Chinese name entity recognition method based on Overlay model according to claim 1, it is characterised in that: BiLSTM-CRF name physical model structure is made of distributed embeding layer, deep neural network layer and condition random field layer；Point The distributed of text is indicated containing between words using word2vec training this method of term vector by the module of cloth embeding layer The phenomenon that justice connects, and eliminates word wide gap；The term vector for using pre-training good is as the depth of deep neural network layer Practise the input of processing natural language problem.

9. the complexity Chinese name entity recognition method based on Overlay model according to claim 1, it is characterised in that: two-way Shot and long term memory network (BiLSTM) is a modified version of LSTM model, and traditional RNN input is above, under output is Text is released hereafter according to above, and two-way RNN utilizes reversed information simultaneously, and model is allowed to learn from both direction, this concept also accords with The word-building for closing Chinese natural language sends the thought of sentence, and BiLSTM is the bi-directional version of LSTM.

10. the complexity Chinese name entity recognition method based on Overlay model according to claim 1, it is characterised in that: item Part random field layer (CRF) separates the relevance for the level that exports, and context relation can be fully considered in prediction label, Avoid the problem that occurring " B-LOC " label in result is followed by this illegal sequence of " I-ORG " label；

In design conditions random field path score, is reduced, reached using prediction weighted score of the λ penalty factor to non-name entity To the purpose for improving recall rate；

Coding/decoding method is improved in decoding, steps are as follows:

S04: the path clue in creation one sequence length × label number matrix B record s-matrix, it is upper with current node One node carrys out record path；

S05: from first node to a last node traverses: by emission probability matrix and transition probability matrix, in s-matrix Middle calculating records road from starting point to the maximum probability path of the corresponding each label of each node in B；

S06: the score value in maximum probability path is found out in last column of S, and traverses B matrix with backtracking method, it is most general to find out this The sequence label in rate path is as final output.