CN110298042A

CN110298042A - Based on Bilstm-crf and knowledge mapping video display entity recognition method

Info

Publication number: CN110298042A
Application number: CN201910572843.1A
Authority: CN
Inventors: 孙云云; 唐军
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-10-01

Abstract

The invention discloses based on Bilstm-crf and knowledge mapping video display entity recognition method, by the character vector and part of speech vector that obtain text to be identified, summation is weighted to character vector and part of speech vector, and result is input in target bi LSTM model and is handled, obtain text feature sequence；Text feature sequence inputting is handled into target CRF model, obtains the name Entity recognition result of text to be identified；It goes in video display knowledge mapping name Entity recognition result to inquire further verification result.The present invention effectively can carry out entity extraction to the partially colloquial video display search text of user, and video display knowledge mapping is made full use of to excavate the video display search intention that user is abstracted, and promote the usage experience of user.It is inputted by the term vector of language model training as the bottom of neural network in the case where labeled data is few, improves training effectiveness, have a good application prospect, can be widely applied to the Entity recognition scene in each field.

Description

Based on Bilstm-crf and knowledge mapping video display entity recognition method

Technical field

Deep learning natural language processing technique field of the present invention, more particularly to based on Bilstm-crf and knowledge mapping shadow Depending on entity recognition method.

Background technique

TV is the equipment of each family's indispensability, almost has new movie and television play restocking daily, this makes people can be from electricity A large amount of the resources of movie & TV is searched depending on, such as people can be by director, performer, title, type information search the resources of movie & TV, such as What can accurately extract video display entity with a kind of effectively mode, to help user to be quickly found out the movie and television play that it is admired, at For an important demand.

Traditional name Entity recognition mostly uses rule-based and statistical machine learning method.Initially, name entity is known It Cai Yong not the method based on dictionary and rule.The rule-based knowledge base and dictionary that these methods are established using linguistic expertise mostly is bases Plinth, using pattern match or the method for string matching identification name entity.The text strong for regularity, it is rule-based Method is accurate and efficient.But the text not strong for regularity, regular writing become difficult, and recognition effect is also undesirable, And the name Entity recognition based on dictionary depends critically upon dictionary, can not identify unregistered word.So people start mesh The method of light trend of purchasing machine learning.

There is hidden Markov model (Hidden in the name common machine learning method in Entity recognition field MarkovModel, HMM), conditional random field models (Conditional Random Fields, CRF), maximum entropy model (MaximumEntropy), supporting vector machine model (Support Vector Machine, SVM) etc..It is wherein most typical It is using being more successfully hidden Markov model and conditional random field models.Based on the method for machine learning migration, The performance of recognition effect etc. is better than rule-based method, but uses the name Entity recognition mould of statistical machine learning method There is also some limitations for type.On the one hand, in order to be easily handled reasoning, it needs specific dependence to assume；On the other hand, Requirement of the machine learning method to Feature Selection based on statistical model is relatively high, needs to select to appoint name Entity recognition It is engaged in influential various features, i.e. Feature Engineering (feature engineering), it has a major impact recognition result, but It is that the process is time-consuming and laborious, and HMM with the CRF method based on word frequency statistics can only be associated with the language of the previous word of current word Justice, accuracy of identification is not high enough, and the discrimination of especially unregistered word is lower；Finally, they usually require largely with task phase The specific knowledge of pass such as designs the state model of HMM, or the input feature vector of selection CRF.

In recent years, with the development of hardware capabilities and the distributed appearance for indicating (word embedding) of word, mind Become the model that many NLP (natural language processing) task can be effectively treated through network.Such methods appoint sequence labelling The processing mode of business (such as POS (part-of-speech tagging), NER (name Entity recognition)) be it is similar, by token (label) from discrete One-hot (one-hot encoding) expression, which is mapped in lower dimensional space, becomes dense embedding (vector insertion), then by sentence Embedding (vector insertion) sequence inputting in RNN (Recognition with Recurrent Neural Network), automatically extracts feature with neural network, Softmax (normalization exponential function) predicts the label of each token (label).This method becomes the training of model One overall process end to end, and unconventional pipeline (pipeline), do not depend on Feature Engineering, are a kind of data-drivens Method；But network mutation is more, it is big to rely on parameter setting, and model interpretation is poor.In addition, one of this method the disadvantage is that right Each token is independent classification during labelling, and directly (can only cannot lean on hidden shape using label predicted above State transmits information above), and then the sequence label predicted is caused to may be illegal, such as label B-PER (BIO sequence mark Injection-molded) it is followed by impossible followed by I-LOC (BIO sequence labelling mode), but Softmax (normalization exponential function) is no This information can be used.

Currently, educational circles, which proposes LSTM-CRF (shot and long term memory network adds condition random field) model, does sequence labelling.? CRF (condition random field) layer is accessed after LSTM (shot and long term memory network) layer to do the Tag Estimation of sentence level, so that mark Process is no longer to each token (label) independent sorting.

Summary of the invention

In view of the above-mentioned problems, the invention proposes based on Bilstm-crf and knowledge mapping video display entity recognition method, solution Certainly labeled data is few and the Entity recognition problem of brief, the colloquial video display text data of text.

The present invention through the following technical solutions to achieve the above objectives:

Based on Bilstm-crf and knowledge mapping video display entity recognition method, comprising the following steps:

Step 1: from major movie data source real-time collecting movie data information, e.g., bean cotyledon, Baidupedia etc. are crawled each Each entity information such as video display name, performer, role, character relation, establishes video display knowledge mapping；

Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television；What analysis was collected into Data, to there is the common search statement of certain rule user to label, for model training and term vector training；

Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:

(1), it character representation layer: is made of part of speech vector sum character vector；Character vector is obtained by LM model training , part of speech vector obtains one-hot part of speech vector by part-of-speech tagging after segmenting, and finally presses part of speech vector layer and character vector layer Weight is spliced into final term vector layer；Finally, splicing part of speech vector sum character level vector to indicate word specific Feature under semantic space；

(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term；Forward and reverse LSTM receives special That levies expression layer goes out feature as input, is separately encoded the information above and below at current time；The encoded information of the two merges Constitute score information to be decoded；

(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, according to sequence Arrange to obtain the complete optimal sequence label of component selections；

Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition effect Rate.

Further scheme is in the step 2, to do frequency statistics, k- to a large number of users data acquired from television Means clustering.

The beneficial effects of the present invention are:

The present invention effectively can carry out entity extraction to the partially colloquial video display search text of user, and make full use of video display Knowledge mapping excavates the video display search intention that user is abstracted, and promotes the usage experience of user.It is logical in the case where labeled data is few The term vector for crossing language model training is inputted as the bottom of neural network, improves training effectiveness, before having application well Scape can be widely applied to the Entity recognition scene in each field.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art In required practical attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only the one of the present embodiment A little embodiments for those of ordinary skill in the art without creative efforts, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is the video display Entity recognition flow chart of the method for the present invention；

Fig. 2 is the Bilstm-crf model structure in the method for the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, technical solution of the present invention will be carried out below Detailed description.Obviously, the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art without making creative work it is obtained it is all its Its embodiment belongs to the range that the present invention is protected.

In any embodiment, as shown in Figure 1, it is of the invention based on Bilstm-crf and knowledge mapping video display Entity recognition Method, comprising the following steps:

In the step 2, frequency statistics, k-Means clustering are done to a large number of users data acquired from television.

K-Means cluster algorithm is summarized as follows:

Select K point as initial mass center；

Each point is assigned to nearest mass center, forms K cluster；

Recalculate the mass center of each cluster；

Until cluster does not change or reaches maximum number of iterations；

It adjusts ginseng to select 15 cluster points here by test, user's saying can be intended to similar sentence cluster to together；

In conjunction with frequency analysis and clustering as a result, the general common video display of precognition user search sentence, determination will be known Other entity type, label Uniform Name have 27 labels at present.

Language model pre-training process: special punctuation mark, English size are removed to data prediction before term vector training Conversion etc. is write, data a large number of users authority data uses the word2vec of gensim kit to train by treated, is trained for The term vector that dimension 300 is tieed up；Part of speech vector first segments each sentence with jieba, and the dictionary of participle is analyzed by data The fixed word of the user obtained afterwards common such as ' I wants to see ', ' I wants to listen ', this dictionary are labelled with its word by weight Property；Then every corresponding part of speech sequence of words is converted to the one-hot term vector of 300 dimensions, finally by part of speech vector with The character vector that word2vec is trained is added by certain weight, finally obtains initial ginseng of the term vector as two-way lstm network Number.

Depth is used in the case where solving labeled data to a certain extent less with the term vector of a large amount of truthful datas training Neural network is done the problem of Entity recognition, and the initial parameter of BIlstm neural network is no longer nonsensical random parameter, greatly The term vector of amount data training can obtain the initial informations such as middle text radical and input as the bottom of neural network, be additionally added originally Part of speech vector, can effective identifying to input this paper field.

Model training data preparation: data mark；

The frequently-used data comprising each label is filtered out from user data manually to mark, and is evaluated and tested using Bakeoff-3 Employed in BIO mark collection, following labeling form:

I wants to see the World Without Thieve of Liu Dehua；

0 0 0 B-actor I-actor I-actor 0 B-movie_name I-movie_name I_movie_ name I_movie_name；

In order to allow each label data to be not in too many inclination, the tag that we lack frequency of occurrence, type are marked At a label, predicts and according to priority go to look into knowledge mapping verifying after coming again.Model training data 25674 at present, with The change of user demand, not counting iteration updates, corresponding training data also be will increase for model meeting.

Model training:

By the training data of all marks by 0.6,0.3,0.1 ratio cut partition be training dataset, test data set and Validation data set.

As unit of sentence, a sentence (sequence of word) containing n word is denoted as:

X=(x₁, x₂..., x_n)；

Wherein, x_iIndicate id of i-th of word in dictionary of sentence, so the word2Id of available each word to Amount, dimension is dictionary size.

Here dictionary be each frequency is counted from all training datas, and by sorting from large to small after, obtain each The corresponding unique id of word, for posting term marker bit ' UNK '.

The first layer of model is look-up layers, will be in sentence using the embedding matrix of pre-training or random initializtion Each word x_iIt is dense word vector (character embedding) x of low-dimensional by one-hot DUAL PROBLEMS OF VECTOR MAPPING_i∈R²It is The dimension of embedding.Before inputting next layer, dropout is set to alleviate over-fitting.

The second layer of model be it is LSTM layers two-way, automatically extract sentence characteristics.By each word of a sentence Character embedding sequence (x₁, x₂..., x_n) input as each time step of two-way LSTM, then by positive LSTM The hidden status switch of outputWith reversed LSTM'sThe hidden state exported at various locations carries out Opsition dependent splicingObtain complete hidden status switch:

(h₁, h₂..., h_n)∈R^n*m；

After dropout is set, a linear layer is accessed, hidden state vector is mapped to k dimension from m dimension, k is mark collection Number of tags, so that the sentence characteristics automatically extracted, are denoted as matrix P=(p₁, p₂..., p_n)∈R^n*k.It can be p_i∈R^k's Per one-dimensional p_ijAll it is regarded as word x_iIt is categorized into the marking value of j-th of label, if carrying out Softmax to P again, is equivalent to pair Each position independently carries out k class classification.But the information marked can not be utilized when being labeled in this way to each position, So next one CRF layers will be entered to be labeled.

The third layer of model is CRF layers, carries out the sequence labelling of Sentence-level, and CRF layers of parameter is (k+2) × (k+ 2) matrix A,

A_ijWhat is indicated is the transfer score from i-th of label to j-th of label, and then when a position is labeled Time can use the label marked before this, and why to add 2 is that should be sentence stem to be to add an initial state And a final state is added for sentence tail portion.If one length of note is equal to the sequence label y=(y of sentence length₁, y₂..., y_n), then marking of the model for the label of sentence x equal to y is

It can be seen that the marking of entire sequence is equal to the sum of the marking of each position, and the marking of each position is by two parts It obtains, a part is to export p by LSTM_iIt determines, another part is then determined by the shift-matrix A of CRF.And then it can use Softmax normalized after probability:

It is given below to training sample (x, a y by maximizing log-likelihood function when model training^x) logarithm Likelihood:

logP(y^x| x)=score (x, y^x)-log (∑ exp (score (x, y ')))；

If this algorithm wants formula to realize, it should be noted that the logarithm of the sum of index will be converted intoThe Section 2 of above formula is efficiently calculated using forward-backward algorithm algorithm in CRF.

Model solves optimal path using the Viterbi algorithm of Dynamic Programming when predicting process (decoding):

Bilstm adds crf rather than reason, and CRF layers can obtain constrained rule from training data, and CRF layers can be It is legal that the label finally predicted, which adds some constraints come the label for guaranteeing prediction,.In training data training process, these Constraint can be arrived by CRF layers of automatic study.

These constraints may is that

I: first word is always started with label " B- " or " 0 " in sentence, rather than " I- "；

II: label " B-label1 I-label2 I-label3 I- ... ", label1, label2, Iabel3 should belong to Same class entity.For example, " B-Person I-Person " is legal sequence, still " B-Person I-Organization " It is illegal sequence label；

III: the first label of sequence label " 0 I-labe1 " is illegal entity tag should be " B- ", rather than " I- ", in other words, effective sequence label should be " 0 B-labe1 ".

There are these to constrain, the probability that illegal sequence occurs in sequence label prediction will will be greatly reduced, due to BiLSTM Output be unit each label score value, we can select the highest label as the unit of score value；Although The correct label of each unit in our available sentence x, but it is that prediction is correct that we, which cannot be guaranteed label every time, The structure of entire model is as shown in Figure 2.

In addition, data analysis process of the invention is as follows:

The basic of user's video display search is analyzed by K-means cluster, frequency etc. from a large amount of collected user data Demand searches for clause, what conditional search video to determine entity class and name in conjunction with business demand by such as common；Then Manually mark training data by BIO standard, due to not having readily available labeled data, using a large number of users truthful data and The character vector and one-hot part of speech vector of word2vec language model 300 dimensions of training, and merge text by certain weight Character vector and part of speech vector, the bottom as two-way lstm input.

The training of entity recognition model is as follows:

X=(x₁, x₂..., x_n)；

Here dictionary is the frequency that each word is counted from all training datas, and by sorting from large to small after, obtain The corresponding unique id of each word, unregistered word marker bit ' UNK '.

Model training mainly includes following 3 part:

1. inputting character/word vector indicates.

Each word is indicated using close (dense) vector, loads trained word vector (Word2Vec) and part of speech in advance Vector.Some meanings will be extracted from single word (single letter), and the meaning of sentence is obtained from part of speech vector.To each Word, it would be desirable to construct a vector to obtain the meaning of this word and some features useful to Entity recognition, this to Made of the word vector sum that amount is trained as Word2Vec is stacked from the vector for extracting feature in part of speech by weight

2. the semantic expressiveness of text about: to each of context word, needing a significant vector is indicated. The vector that word in context is obtained using BILSTM is indicated.After we, which obtain the final vector of word, indicates, to word vector Sequence carries out bi-LSTM.Using the hidden state at each time point, rather than just end-state.M term vector is inputted, The vector of m hidden state is obtained, however it includes the information of word rank that word vector, which is, and the vector of hidden state considers Hereafter.

3. decoding: after we have the vector of each word to indicate, the prediction of Lai Jinhang entity tag.

Label score is calculated in decoding stage, does last prediction using the corresponding hidden state vector of each word, it can be with The score of each entity tag is obtained using a full Connection Neural Network.

Assuming that there are 9 classifications, W ∈ R is used^9×kWith b ∈ R⁹To calculate score s ∈ R⁹=Wh+b can understand s [i] For the score of word w corresponding label i.

Using linear crf to the score of entity tag: softmax method is to do local selection, in other words, even if bi- Some contextual informations are contained in the h that LSTM is generated, but label decision is still local.Not using the label of surrounding come Aid decision making.Such as: " poplar power ", after we have given power " I-actor " this label, this should help us to determine that " poplar " is right The initial position of I-actor is answered, linear CRF defines global score.

Finally, trained model and relevant parameter are saved.

Process of data preprocessing is as follows:

Here additional character etc. mainly is gone to the data processing before model prediction；Text data is handled as model prediction It is required that format, that is, convert text to wordId term vector, dimension is the length of training data dictionary dictionary.

Model prediction is as follows:

By treated, data input model is predicted, the possible situation of prediction result is as follows:

(1), I wants to see coming back for Zhang Yimou director；

0 0 0 B-director_name I-director_name I-director_name 0 0 B-movie_ name I-movie_name；

(2), the talk on the journey to west of Liu Dehua；

B-actor I-actor I-actor 0 B-movie_name I-movie_name I-movie_name；

(3), the film that wife Deng Chao drills；

B-actor I-actor 0 0 B-relation I-relation 0 00；

(4), cool life can not be sad；

0 0 0 0 0 0 B-movie_name I-movie_name I-movie_name；

(5), recommend a most fiery film；

0 0 0 0 0 0 0 0 0；

1, prediction result incorporeity is handled；

To there is no the case where entity occurs in prediction result in model prediction (5):

Data processing: 1, removal front and back redundancy department ' I wants to see ', ' I will see ', ' broadcasting ', ' having ' etc.；

2, the entity rules such as film collection/season/portion, version, language are extracted, maintains language, version, country etc. in advance no Long variation, wired special data, this data are being present in knowledge mapping simultaneously, and the present invention is to deposit this partial data with word The form deposit memory of allusion quotation.Similar { ' English ': ' English ', ' English ': ' English ', ' foreign language ': ' English ' } form, can be by it All synonyms are taken into account.It is replaced with after corresponding entity is matched with canonical and by entity empty as ' I wants to see speed and swash Feelings English edition ', if model is not previously predicted entity result, remove after front and back redundancy section and special entity ' the fast and the furious English ' Corresponding entity result is obtained being searched for knowledge mapping.

3, prediction result has entity handles；

There is the knowing correspondent entity search of entity result label in prediction result (1), (2), (3), (4) in model prediction as above Know map and verified whether entity as necessary being, does not drill talk on the journey to west as Liu De China is practical in (2), it will be to user Recommend other films of Liu Dehua, rather than return to user and do not find the film, improves the experience of user；Prediction result (3) is real What border user wanted viewing is the film of grandson pari, this is the excavation of entity abstraction relation, can better meet user demand.Know Know map verifying and further improves the effect of entity.To (4) although this entity result, do not found in knowledge mapping pair The video display name entities answered are considered as prediction of failure, then execute following entity result encapsulation output processing.

Entity result encapsulation output is as follows:

It is not inconsistent logical entity prediction result processing, such as ' Liu Dehua third ' actor: Liu Dehua, season of identification: Season entity will be deleted, will as a result be encapsulated.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Specific technical features described in the above specific embodiments, in not lance In the case where shield, can be combined in any appropriate way, in order to avoid unnecessary repetition, the present invention to it is various can No further explanation will be given for the combination of energy.Various embodiments of the present invention can be combined randomly, only Want it without prejudice to thought of the invention, it should also be regarded as the disclosure of the present invention.

Claims

1. based on Bilstm-crf and knowledge mapping video display entity recognition method, which comprises the following steps:

Step 1: from major movie data source real-time collecting movie data information, crawling each video display name, performer, role, Ren Wuguan Each entity informations such as system, establish video display knowledge mapping；

Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television；Analyze the number being collected into According to there is the common search statement of certain rule user to label, for model training and term vector training；

(1), it character representation layer: is made of part of speech vector sum character vector；Character vector is obtained by LM model training, word Property vector one-hot part of speech vector obtained by part-of-speech tagging after segmenting, part of speech vector layer and character vector layer are finally pressed into weight It is spliced into final term vector layer；Finally, splicing part of speech vector sum character level vector to indicate word in certain semantic Feature under space；

(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term；Forward and reverse LSTM receives mark sheet That shows layer goes out feature as input, is separately encoded the information above and below at current time；The encoded information of the two, which merges, to be constituted Score information to be decoded；

(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, obtained according to sequence The complete optimal sequence label of component selections；

Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition efficiency.

2. as described in claim 1 based on Bilstm-crf and knowledge mapping video display entity recognition method, which is characterized in that described In step 2, frequency statistics, k-Means clustering are done to a large number of users data acquired from television.