CN110298042A - Based on Bilstm-crf and knowledge mapping video display entity recognition method - Google Patents

Based on Bilstm-crf and knowledge mapping video display entity recognition method Download PDF

Info

Publication number
CN110298042A
CN110298042A CN201910572843.1A CN201910572843A CN110298042A CN 110298042 A CN110298042 A CN 110298042A CN 201910572843 A CN201910572843 A CN 201910572843A CN 110298042 A CN110298042 A CN 110298042A
Authority
CN
China
Prior art keywords
vector
video display
crf
entity recognition
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910572843.1A
Other languages
Chinese (zh)
Inventor
孙云云
唐军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201910572843.1A priority Critical patent/CN110298042A/en
Publication of CN110298042A publication Critical patent/CN110298042A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses based on Bilstm-crf and knowledge mapping video display entity recognition method, by the character vector and part of speech vector that obtain text to be identified, summation is weighted to character vector and part of speech vector, and result is input in target bi LSTM model and is handled, obtain text feature sequence;Text feature sequence inputting is handled into target CRF model, obtains the name Entity recognition result of text to be identified;It goes in video display knowledge mapping name Entity recognition result to inquire further verification result.The present invention effectively can carry out entity extraction to the partially colloquial video display search text of user, and video display knowledge mapping is made full use of to excavate the video display search intention that user is abstracted, and promote the usage experience of user.It is inputted by the term vector of language model training as the bottom of neural network in the case where labeled data is few, improves training effectiveness, have a good application prospect, can be widely applied to the Entity recognition scene in each field.

Description

Based on Bilstm-crf and knowledge mapping video display entity recognition method
Technical field
Deep learning natural language processing technique field of the present invention, more particularly to based on Bilstm-crf and knowledge mapping shadow Depending on entity recognition method.
Background technique
TV is the equipment of each family's indispensability, almost has new movie and television play restocking daily, this makes people can be from electricity A large amount of the resources of movie & TV is searched depending on, such as people can be by director, performer, title, type information search the resources of movie & TV, such as What can accurately extract video display entity with a kind of effectively mode, to help user to be quickly found out the movie and television play that it is admired, at For an important demand.
Traditional name Entity recognition mostly uses rule-based and statistical machine learning method.Initially, name entity is known It Cai Yong not the method based on dictionary and rule.The rule-based knowledge base and dictionary that these methods are established using linguistic expertise mostly is bases Plinth, using pattern match or the method for string matching identification name entity.The text strong for regularity, it is rule-based Method is accurate and efficient.But the text not strong for regularity, regular writing become difficult, and recognition effect is also undesirable, And the name Entity recognition based on dictionary depends critically upon dictionary, can not identify unregistered word.So people start mesh The method of light trend of purchasing machine learning.
There is hidden Markov model (Hidden in the name common machine learning method in Entity recognition field MarkovModel, HMM), conditional random field models (Conditional Random Fields, CRF), maximum entropy model (MaximumEntropy), supporting vector machine model (Support Vector Machine, SVM) etc..It is wherein most typical It is using being more successfully hidden Markov model and conditional random field models.Based on the method for machine learning migration, The performance of recognition effect etc. is better than rule-based method, but uses the name Entity recognition mould of statistical machine learning method There is also some limitations for type.On the one hand, in order to be easily handled reasoning, it needs specific dependence to assume;On the other hand, Requirement of the machine learning method to Feature Selection based on statistical model is relatively high, needs to select to appoint name Entity recognition It is engaged in influential various features, i.e. Feature Engineering (feature engineering), it has a major impact recognition result, but It is that the process is time-consuming and laborious, and HMM with the CRF method based on word frequency statistics can only be associated with the language of the previous word of current word Justice, accuracy of identification is not high enough, and the discrimination of especially unregistered word is lower;Finally, they usually require largely with task phase The specific knowledge of pass such as designs the state model of HMM, or the input feature vector of selection CRF.
In recent years, with the development of hardware capabilities and the distributed appearance for indicating (word embedding) of word, mind Become the model that many NLP (natural language processing) task can be effectively treated through network.Such methods appoint sequence labelling The processing mode of business (such as POS (part-of-speech tagging), NER (name Entity recognition)) be it is similar, by token (label) from discrete One-hot (one-hot encoding) expression, which is mapped in lower dimensional space, becomes dense embedding (vector insertion), then by sentence Embedding (vector insertion) sequence inputting in RNN (Recognition with Recurrent Neural Network), automatically extracts feature with neural network, Softmax (normalization exponential function) predicts the label of each token (label).This method becomes the training of model One overall process end to end, and unconventional pipeline (pipeline), do not depend on Feature Engineering, are a kind of data-drivens Method;But network mutation is more, it is big to rely on parameter setting, and model interpretation is poor.In addition, one of this method the disadvantage is that right Each token is independent classification during labelling, and directly (can only cannot lean on hidden shape using label predicted above State transmits information above), and then the sequence label predicted is caused to may be illegal, such as label B-PER (BIO sequence mark Injection-molded) it is followed by impossible followed by I-LOC (BIO sequence labelling mode), but Softmax (normalization exponential function) is no This information can be used.
Currently, educational circles, which proposes LSTM-CRF (shot and long term memory network adds condition random field) model, does sequence labelling.? CRF (condition random field) layer is accessed after LSTM (shot and long term memory network) layer to do the Tag Estimation of sentence level, so that mark Process is no longer to each token (label) independent sorting.
Summary of the invention
In view of the above-mentioned problems, the invention proposes based on Bilstm-crf and knowledge mapping video display entity recognition method, solution Certainly labeled data is few and the Entity recognition problem of brief, the colloquial video display text data of text.
The present invention through the following technical solutions to achieve the above objectives:
Based on Bilstm-crf and knowledge mapping video display entity recognition method, comprising the following steps:
Step 1: from major movie data source real-time collecting movie data information, e.g., bean cotyledon, Baidupedia etc. are crawled each Each entity information such as video display name, performer, role, character relation, establishes video display knowledge mapping;
Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television;What analysis was collected into Data, to there is the common search statement of certain rule user to label, for model training and term vector training;
Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:
(1), it character representation layer: is made of part of speech vector sum character vector;Character vector is obtained by LM model training , part of speech vector obtains one-hot part of speech vector by part-of-speech tagging after segmenting, and finally presses part of speech vector layer and character vector layer Weight is spliced into final term vector layer;Finally, splicing part of speech vector sum character level vector to indicate word specific Feature under semantic space;
(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term;Forward and reverse LSTM receives special That levies expression layer goes out feature as input, is separately encoded the information above and below at current time;The encoded information of the two merges Constitute score information to be decoded;
(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, according to sequence Arrange to obtain the complete optimal sequence label of component selections;
Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition effect Rate.
Further scheme is in the step 2, to do frequency statistics, k- to a large number of users data acquired from television Means clustering.
The beneficial effects of the present invention are:
The present invention effectively can carry out entity extraction to the partially colloquial video display search text of user, and make full use of video display Knowledge mapping excavates the video display search intention that user is abstracted, and promotes the usage experience of user.It is logical in the case where labeled data is few The term vector for crossing language model training is inputted as the bottom of neural network, improves training effectiveness, before having application well Scape can be widely applied to the Entity recognition scene in each field.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art In required practical attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only the one of the present embodiment A little embodiments for those of ordinary skill in the art without creative efforts, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is the video display Entity recognition flow chart of the method for the present invention;
Fig. 2 is the Bilstm-crf model structure in the method for the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical solution of the present invention will be carried out below Detailed description.Obviously, the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art without making creative work it is obtained it is all its Its embodiment belongs to the range that the present invention is protected.
In any embodiment, as shown in Figure 1, it is of the invention based on Bilstm-crf and knowledge mapping video display Entity recognition Method, comprising the following steps:
Step 1: from major movie data source real-time collecting movie data information, e.g., bean cotyledon, Baidupedia etc. are crawled each Each entity information such as video display name, performer, role, character relation, establishes video display knowledge mapping;
Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television;What analysis was collected into Data, to there is the common search statement of certain rule user to label, for model training and term vector training;
Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:
(1), it character representation layer: is made of part of speech vector sum character vector;Character vector is obtained by LM model training , part of speech vector obtains one-hot part of speech vector by part-of-speech tagging after segmenting, and finally presses part of speech vector layer and character vector layer Weight is spliced into final term vector layer;Finally, splicing part of speech vector sum character level vector to indicate word specific Feature under semantic space;
(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term;Forward and reverse LSTM receives special That levies expression layer goes out feature as input, is separately encoded the information above and below at current time;The encoded information of the two merges Constitute score information to be decoded;
(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, according to sequence Arrange to obtain the complete optimal sequence label of component selections;
Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition effect Rate.
In the step 2, frequency statistics, k-Means clustering are done to a large number of users data acquired from television.
K-Means cluster algorithm is summarized as follows:
Select K point as initial mass center;
Each point is assigned to nearest mass center, forms K cluster;
Recalculate the mass center of each cluster;
Until cluster does not change or reaches maximum number of iterations;
It adjusts ginseng to select 15 cluster points here by test, user's saying can be intended to similar sentence cluster to together;
In conjunction with frequency analysis and clustering as a result, the general common video display of precognition user search sentence, determination will be known Other entity type, label Uniform Name have 27 labels at present.
Language model pre-training process: special punctuation mark, English size are removed to data prediction before term vector training Conversion etc. is write, data a large number of users authority data uses the word2vec of gensim kit to train by treated, is trained for The term vector that dimension 300 is tieed up;Part of speech vector first segments each sentence with jieba, and the dictionary of participle is analyzed by data The fixed word of the user obtained afterwards common such as ' I wants to see ', ' I wants to listen ', this dictionary are labelled with its word by weight Property;Then every corresponding part of speech sequence of words is converted to the one-hot term vector of 300 dimensions, finally by part of speech vector with The character vector that word2vec is trained is added by certain weight, finally obtains initial ginseng of the term vector as two-way lstm network Number.
Depth is used in the case where solving labeled data to a certain extent less with the term vector of a large amount of truthful datas training Neural network is done the problem of Entity recognition, and the initial parameter of BIlstm neural network is no longer nonsensical random parameter, greatly The term vector of amount data training can obtain the initial informations such as middle text radical and input as the bottom of neural network, be additionally added originally Part of speech vector, can effective identifying to input this paper field.
Model training data preparation: data mark;
The frequently-used data comprising each label is filtered out from user data manually to mark, and is evaluated and tested using Bakeoff-3 Employed in BIO mark collection, following labeling form:
I wants to see the World Without Thieve of Liu Dehua;
0 0 0 B-actor I-actor I-actor 0 B-movie_name I-movie_name I_movie_ name I_movie_name;
In order to allow each label data to be not in too many inclination, the tag that we lack frequency of occurrence, type are marked At a label, predicts and according to priority go to look into knowledge mapping verifying after coming again.Model training data 25674 at present, with The change of user demand, not counting iteration updates, corresponding training data also be will increase for model meeting.
Model training:
By the training data of all marks by 0.6,0.3,0.1 ratio cut partition be training dataset, test data set and Validation data set.
As unit of sentence, a sentence (sequence of word) containing n word is denoted as:
X=(x1, x2..., xn);
Wherein, xiIndicate id of i-th of word in dictionary of sentence, so the word2Id of available each word to Amount, dimension is dictionary size.
Here dictionary be each frequency is counted from all training datas, and by sorting from large to small after, obtain each The corresponding unique id of word, for posting term marker bit ' UNK '.
The first layer of model is look-up layers, will be in sentence using the embedding matrix of pre-training or random initializtion Each word xiIt is dense word vector (character embedding) x of low-dimensional by one-hot DUAL PROBLEMS OF VECTOR MAPPINGi∈R2It is The dimension of embedding.Before inputting next layer, dropout is set to alleviate over-fitting.
The second layer of model be it is LSTM layers two-way, automatically extract sentence characteristics.By each word of a sentence Character embedding sequence (x1, x2..., xn) input as each time step of two-way LSTM, then by positive LSTM The hidden status switch of outputWith reversed LSTM'sThe hidden state exported at various locations carries out Opsition dependent splicingObtain complete hidden status switch:
(h1, h2..., hn)∈Rn*m
After dropout is set, a linear layer is accessed, hidden state vector is mapped to k dimension from m dimension, k is mark collection Number of tags, so that the sentence characteristics automatically extracted, are denoted as matrix P=(p1, p2..., pn)∈Rn*k.It can be pi∈Rk's Per one-dimensional pijAll it is regarded as word xiIt is categorized into the marking value of j-th of label, if carrying out Softmax to P again, is equivalent to pair Each position independently carries out k class classification.But the information marked can not be utilized when being labeled in this way to each position, So next one CRF layers will be entered to be labeled.
The third layer of model is CRF layers, carries out the sequence labelling of Sentence-level, and CRF layers of parameter is (k+2) × (k+ 2) matrix A,
AijWhat is indicated is the transfer score from i-th of label to j-th of label, and then when a position is labeled Time can use the label marked before this, and why to add 2 is that should be sentence stem to be to add an initial state And a final state is added for sentence tail portion.If one length of note is equal to the sequence label y=(y of sentence length1, y2..., yn), then marking of the model for the label of sentence x equal to y is
It can be seen that the marking of entire sequence is equal to the sum of the marking of each position, and the marking of each position is by two parts It obtains, a part is to export p by LSTMiIt determines, another part is then determined by the shift-matrix A of CRF.And then it can use Softmax normalized after probability:
It is given below to training sample (x, a y by maximizing log-likelihood function when model trainingx) logarithm Likelihood:
logP(yx| x)=score (x, yx)-log (∑ exp (score (x, y ')));
If this algorithm wants formula to realize, it should be noted that the logarithm of the sum of index will be converted intoThe Section 2 of above formula is efficiently calculated using forward-backward algorithm algorithm in CRF.
Model solves optimal path using the Viterbi algorithm of Dynamic Programming when predicting process (decoding):
Bilstm adds crf rather than reason, and CRF layers can obtain constrained rule from training data, and CRF layers can be It is legal that the label finally predicted, which adds some constraints come the label for guaranteeing prediction,.In training data training process, these Constraint can be arrived by CRF layers of automatic study.
These constraints may is that
I: first word is always started with label " B- " or " 0 " in sentence, rather than " I- ";
II: label " B-label1 I-label2 I-label3 I- ... ", label1, label2, Iabel3 should belong to Same class entity.For example, " B-Person I-Person " is legal sequence, still " B-Person I-Organization " It is illegal sequence label;
III: the first label of sequence label " 0 I-labe1 " is illegal entity tag should be " B- ", rather than " I- ", in other words, effective sequence label should be " 0 B-labe1 ".
There are these to constrain, the probability that illegal sequence occurs in sequence label prediction will will be greatly reduced, due to BiLSTM Output be unit each label score value, we can select the highest label as the unit of score value;Although The correct label of each unit in our available sentence x, but it is that prediction is correct that we, which cannot be guaranteed label every time, The structure of entire model is as shown in Figure 2.
In addition, data analysis process of the invention is as follows:
The basic of user's video display search is analyzed by K-means cluster, frequency etc. from a large amount of collected user data Demand searches for clause, what conditional search video to determine entity class and name in conjunction with business demand by such as common;Then Manually mark training data by BIO standard, due to not having readily available labeled data, using a large number of users truthful data and The character vector and one-hot part of speech vector of word2vec language model 300 dimensions of training, and merge text by certain weight Character vector and part of speech vector, the bottom as two-way lstm input.
The training of entity recognition model is as follows:
By the training data of all marks by 0.6,0.3,0.1 ratio cut partition be training dataset, test data set and Validation data set.
As unit of sentence, a sentence (sequence of word) containing n word is denoted as:
X=(x1, x2..., xn);
Wherein, xiIndicate id of i-th of word in dictionary of sentence, so the word2Id of available each word to Amount, dimension is dictionary size.
Here dictionary is the frequency that each word is counted from all training datas, and by sorting from large to small after, obtain The corresponding unique id of each word, unregistered word marker bit ' UNK '.
Model training mainly includes following 3 part:
1. inputting character/word vector indicates.
Each word is indicated using close (dense) vector, loads trained word vector (Word2Vec) and part of speech in advance Vector.Some meanings will be extracted from single word (single letter), and the meaning of sentence is obtained from part of speech vector.To each Word, it would be desirable to construct a vector to obtain the meaning of this word and some features useful to Entity recognition, this to Made of the word vector sum that amount is trained as Word2Vec is stacked from the vector for extracting feature in part of speech by weight
2. the semantic expressiveness of text about: to each of context word, needing a significant vector is indicated. The vector that word in context is obtained using BILSTM is indicated.After we, which obtain the final vector of word, indicates, to word vector Sequence carries out bi-LSTM.Using the hidden state at each time point, rather than just end-state.M term vector is inputted, The vector of m hidden state is obtained, however it includes the information of word rank that word vector, which is, and the vector of hidden state considers Hereafter.
3. decoding: after we have the vector of each word to indicate, the prediction of Lai Jinhang entity tag.
Label score is calculated in decoding stage, does last prediction using the corresponding hidden state vector of each word, it can be with The score of each entity tag is obtained using a full Connection Neural Network.
Assuming that there are 9 classifications, W ∈ R is used9×kWith b ∈ R9To calculate score s ∈ R9=Wh+b can understand s [i] For the score of word w corresponding label i.
Using linear crf to the score of entity tag: softmax method is to do local selection, in other words, even if bi- Some contextual informations are contained in the h that LSTM is generated, but label decision is still local.Not using the label of surrounding come Aid decision making.Such as: " poplar power ", after we have given power " I-actor " this label, this should help us to determine that " poplar " is right The initial position of I-actor is answered, linear CRF defines global score.
Finally, trained model and relevant parameter are saved.
Process of data preprocessing is as follows:
Here additional character etc. mainly is gone to the data processing before model prediction;Text data is handled as model prediction It is required that format, that is, convert text to wordId term vector, dimension is the length of training data dictionary dictionary.
Model prediction is as follows:
By treated, data input model is predicted, the possible situation of prediction result is as follows:
(1), I wants to see coming back for Zhang Yimou director;
0 0 0 B-director_name I-director_name I-director_name 0 0 B-movie_ name I-movie_name;
(2), the talk on the journey to west of Liu Dehua;
B-actor I-actor I-actor 0 B-movie_name I-movie_name I-movie_name;
(3), the film that wife Deng Chao drills;
B-actor I-actor 0 0 B-relation I-relation 0 00;
(4), cool life can not be sad;
0 0 0 0 0 0 B-movie_name I-movie_name I-movie_name;
(5), recommend a most fiery film;
0 0 0 0 0 0 0 0 0;
1, prediction result incorporeity is handled;
To there is no the case where entity occurs in prediction result in model prediction (5):
Data processing: 1, removal front and back redundancy department ' I wants to see ', ' I will see ', ' broadcasting ', ' having ' etc.;
2, the entity rules such as film collection/season/portion, version, language are extracted, maintains language, version, country etc. in advance no Long variation, wired special data, this data are being present in knowledge mapping simultaneously, and the present invention is to deposit this partial data with word The form deposit memory of allusion quotation.Similar { ' English ': ' English ', ' English ': ' English ', ' foreign language ': ' English ' } form, can be by it All synonyms are taken into account.It is replaced with after corresponding entity is matched with canonical and by entity empty as ' I wants to see speed and swash Feelings English edition ', if model is not previously predicted entity result, remove after front and back redundancy section and special entity ' the fast and the furious English ' Corresponding entity result is obtained being searched for knowledge mapping.
3, prediction result has entity handles;
There is the knowing correspondent entity search of entity result label in prediction result (1), (2), (3), (4) in model prediction as above Know map and verified whether entity as necessary being, does not drill talk on the journey to west as Liu De China is practical in (2), it will be to user Recommend other films of Liu Dehua, rather than return to user and do not find the film, improves the experience of user;Prediction result (3) is real What border user wanted viewing is the film of grandson pari, this is the excavation of entity abstraction relation, can better meet user demand.Know Know map verifying and further improves the effect of entity.To (4) although this entity result, do not found in knowledge mapping pair The video display name entities answered are considered as prediction of failure, then execute following entity result encapsulation output processing.
Entity result encapsulation output is as follows:
It is not inconsistent logical entity prediction result processing, such as ' Liu Dehua third ' actor: Liu Dehua, season of identification: Season entity will be deleted, will as a result be encapsulated.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Specific technical features described in the above specific embodiments, in not lance In the case where shield, can be combined in any appropriate way, in order to avoid unnecessary repetition, the present invention to it is various can No further explanation will be given for the combination of energy.Various embodiments of the present invention can be combined randomly, only Want it without prejudice to thought of the invention, it should also be regarded as the disclosure of the present invention.

Claims (2)

1. based on Bilstm-crf and knowledge mapping video display entity recognition method, which comprises the following steps:
Step 1: from major movie data source real-time collecting movie data information, crawling each video display name, performer, role, Ren Wuguan Each entity informations such as system, establish video display knowledge mapping;
Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television;Analyze the number being collected into According to there is the common search statement of certain rule user to label, for model training and term vector training;
Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:
(1), it character representation layer: is made of part of speech vector sum character vector;Character vector is obtained by LM model training, word Property vector one-hot part of speech vector obtained by part-of-speech tagging after segmenting, part of speech vector layer and character vector layer are finally pressed into weight It is spliced into final term vector layer;Finally, splicing part of speech vector sum character level vector to indicate word in certain semantic Feature under space;
(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term;Forward and reverse LSTM receives mark sheet That shows layer goes out feature as input, is separately encoded the information above and below at current time;The encoded information of the two, which merges, to be constituted Score information to be decoded;
(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, obtained according to sequence The complete optimal sequence label of component selections;
Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition efficiency.
2. as described in claim 1 based on Bilstm-crf and knowledge mapping video display entity recognition method, which is characterized in that described In step 2, frequency statistics, k-Means clustering are done to a large number of users data acquired from television.
CN201910572843.1A 2019-06-26 2019-06-26 Based on Bilstm-crf and knowledge mapping video display entity recognition method Pending CN110298042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910572843.1A CN110298042A (en) 2019-06-26 2019-06-26 Based on Bilstm-crf and knowledge mapping video display entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910572843.1A CN110298042A (en) 2019-06-26 2019-06-26 Based on Bilstm-crf and knowledge mapping video display entity recognition method

Publications (1)

Publication Number Publication Date
CN110298042A true CN110298042A (en) 2019-10-01

Family

ID=68029238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910572843.1A Pending CN110298042A (en) 2019-06-26 2019-06-26 Based on Bilstm-crf and knowledge mapping video display entity recognition method

Country Status (1)

Country Link
CN (1) CN110298042A (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716991A (en) * 2019-10-11 2020-01-21 掌阅科技股份有限公司 Method for displaying entity associated information based on electronic book and electronic equipment
CN110782881A (en) * 2019-10-25 2020-02-11 四川长虹电器股份有限公司 Video entity error correction method after speech recognition and entity recognition
CN110807324A (en) * 2019-10-09 2020-02-18 四川长虹电器股份有限公司 Video entity identification method based on IDCNN-crf and knowledge graph
CN110909174A (en) * 2019-11-19 2020-03-24 南京航空航天大学 Knowledge graph-based method for improving entity link in simple question answering
CN111090754A (en) * 2019-11-20 2020-05-01 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN111159017A (en) * 2019-12-17 2020-05-15 北京中科晶上超媒体信息技术有限公司 Test case generation method based on slot filling
CN111241810A (en) * 2020-01-16 2020-06-05 百度在线网络技术(北京)有限公司 Punctuation prediction method and device
CN111274788A (en) * 2020-01-16 2020-06-12 创新工场(广州)人工智能研究有限公司 Dual-channel joint processing method and device
CN111274817A (en) * 2020-01-16 2020-06-12 北京航空航天大学 Intelligent software cost measurement method based on natural language processing technology
CN111274794A (en) * 2020-01-19 2020-06-12 浙江大学 Synonym expansion method based on transmission
CN111310470A (en) * 2020-01-17 2020-06-19 西安交通大学 Chinese named entity recognition method fusing word and word features
CN111444720A (en) * 2020-03-30 2020-07-24 华南理工大学 Named entity recognition method for English text
CN111553158A (en) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 Method and system for identifying named entities in power scheduling field based on BilSTM-CRF model
CN111666418A (en) * 2020-04-23 2020-09-15 北京三快在线科技有限公司 Text regeneration method and device, electronic equipment and computer readable medium
CN111832306A (en) * 2020-07-09 2020-10-27 昆明理工大学 Image diagnosis report named entity identification method based on multi-feature fusion
CN111859967A (en) * 2020-06-12 2020-10-30 北京三快在线科技有限公司 Entity identification method and device and electronic equipment
CN111882124A (en) * 2020-07-20 2020-11-03 武汉理工大学 Homogeneous platform development effect prediction method based on generation confrontation simulation learning
CN111917861A (en) * 2020-07-28 2020-11-10 广东工业大学 Knowledge storage method and system based on block chain and knowledge graph and application thereof
CN112084783A (en) * 2020-09-24 2020-12-15 中国民航大学 Entity identification method and system based on civil aviation non-civilized passengers
CN112101009A (en) * 2020-09-23 2020-12-18 中国农业大学 Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions
CN112541088A (en) * 2020-12-29 2021-03-23 浙大城市学院 Dangerous chemical library construction method based on knowledge graph
CN112905884A (en) * 2021-02-10 2021-06-04 北京百度网讯科技有限公司 Method, apparatus, medium, and program product for generating sequence annotation model
CN112906367A (en) * 2021-02-08 2021-06-04 上海宏原信息科技有限公司 Information extraction structure, labeling method and identification method of consumer text
CN112989787A (en) * 2021-02-05 2021-06-18 杭州云嘉云计算有限公司 Text element extraction method
CN113255354A (en) * 2021-06-03 2021-08-13 北京达佳互联信息技术有限公司 Search intention recognition method, device, server and storage medium
CN113326380A (en) * 2021-08-03 2021-08-31 国能大渡河大数据服务有限公司 Equipment measurement data processing method, system and terminal based on deep neural network
CN113392649A (en) * 2021-07-08 2021-09-14 上海浦东发展银行股份有限公司 Identification method, device, equipment and storage medium
CN113536793A (en) * 2020-10-14 2021-10-22 腾讯科技(深圳)有限公司 Entity identification method, device, equipment and storage medium
CN113673248A (en) * 2021-08-23 2021-11-19 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN113821592A (en) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition
CN114691889A (en) * 2022-04-15 2022-07-01 中北大学 Method for constructing fault diagnosis knowledge map of turnout switch machine
CN116401369A (en) * 2023-06-07 2023-07-07 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763218A (en) * 2018-06-04 2018-11-06 四川长虹电器股份有限公司 A kind of video display retrieval entity recognition method based on CRF
CN108874997A (en) * 2018-06-13 2018-11-23 广东外语外贸大学 A kind of name name entity recognition method towards film comment
CN109033374A (en) * 2018-07-27 2018-12-18 四川长虹电器股份有限公司 Knowledge mapping search method based on Bayes classifier

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763218A (en) * 2018-06-04 2018-11-06 四川长虹电器股份有限公司 A kind of video display retrieval entity recognition method based on CRF
CN108874997A (en) * 2018-06-13 2018-11-23 广东外语外贸大学 A kind of name name entity recognition method towards film comment
CN109033374A (en) * 2018-07-27 2018-12-18 四川长虹电器股份有限公司 Knowledge mapping search method based on Bayes classifier

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周浩 等: "融合语义与语法信息的中文评价对象提取", 《智能系统学报》 *

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807324A (en) * 2019-10-09 2020-02-18 四川长虹电器股份有限公司 Video entity identification method based on IDCNN-crf and knowledge graph
CN110716991A (en) * 2019-10-11 2020-01-21 掌阅科技股份有限公司 Method for displaying entity associated information based on electronic book and electronic equipment
CN110782881A (en) * 2019-10-25 2020-02-11 四川长虹电器股份有限公司 Video entity error correction method after speech recognition and entity recognition
CN110909174B (en) * 2019-11-19 2022-01-04 南京航空航天大学 Knowledge graph-based method for improving entity link in simple question answering
CN110909174A (en) * 2019-11-19 2020-03-24 南京航空航天大学 Knowledge graph-based method for improving entity link in simple question answering
CN111090754A (en) * 2019-11-20 2020-05-01 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN111090754B (en) * 2019-11-20 2023-04-07 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN111159017A (en) * 2019-12-17 2020-05-15 北京中科晶上超媒体信息技术有限公司 Test case generation method based on slot filling
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN111274788A (en) * 2020-01-16 2020-06-12 创新工场(广州)人工智能研究有限公司 Dual-channel joint processing method and device
CN111274817A (en) * 2020-01-16 2020-06-12 北京航空航天大学 Intelligent software cost measurement method based on natural language processing technology
CN111241810A (en) * 2020-01-16 2020-06-05 百度在线网络技术(北京)有限公司 Punctuation prediction method and device
CN111310470A (en) * 2020-01-17 2020-06-19 西安交通大学 Chinese named entity recognition method fusing word and word features
CN111310470B (en) * 2020-01-17 2021-11-19 西安交通大学 Chinese named entity recognition method fusing word and word features
CN111274794A (en) * 2020-01-19 2020-06-12 浙江大学 Synonym expansion method based on transmission
CN111444720A (en) * 2020-03-30 2020-07-24 华南理工大学 Named entity recognition method for English text
CN111553158A (en) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 Method and system for identifying named entities in power scheduling field based on BilSTM-CRF model
CN111666418A (en) * 2020-04-23 2020-09-15 北京三快在线科技有限公司 Text regeneration method and device, electronic equipment and computer readable medium
CN111666418B (en) * 2020-04-23 2024-01-16 北京三快在线科技有限公司 Text regeneration method, device, electronic equipment and computer readable medium
CN111859967B (en) * 2020-06-12 2024-04-09 北京三快在线科技有限公司 Entity identification method and device and electronic equipment
CN111859967A (en) * 2020-06-12 2020-10-30 北京三快在线科技有限公司 Entity identification method and device and electronic equipment
CN111832306A (en) * 2020-07-09 2020-10-27 昆明理工大学 Image diagnosis report named entity identification method based on multi-feature fusion
CN111882124A (en) * 2020-07-20 2020-11-03 武汉理工大学 Homogeneous platform development effect prediction method based on generation confrontation simulation learning
CN111882124B (en) * 2020-07-20 2022-06-07 武汉理工大学 Homogeneous platform development effect prediction method based on generation confrontation simulation learning
CN111917861A (en) * 2020-07-28 2020-11-10 广东工业大学 Knowledge storage method and system based on block chain and knowledge graph and application thereof
CN112101009A (en) * 2020-09-23 2020-12-18 中国农业大学 Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions
CN112101009B (en) * 2020-09-23 2024-03-26 中国农业大学 Method for judging similarity of red-building dream character relationship frames based on knowledge graph
CN112084783A (en) * 2020-09-24 2020-12-15 中国民航大学 Entity identification method and system based on civil aviation non-civilized passengers
CN112084783B (en) * 2020-09-24 2022-04-12 中国民航大学 Entity identification method and system based on civil aviation non-civilized passengers
CN113536793A (en) * 2020-10-14 2021-10-22 腾讯科技(深圳)有限公司 Entity identification method, device, equipment and storage medium
CN112541088B (en) * 2020-12-29 2022-05-17 浙大城市学院 Dangerous chemical library construction method based on knowledge graph
CN112541088A (en) * 2020-12-29 2021-03-23 浙大城市学院 Dangerous chemical library construction method based on knowledge graph
CN112989787A (en) * 2021-02-05 2021-06-18 杭州云嘉云计算有限公司 Text element extraction method
CN112906367A (en) * 2021-02-08 2021-06-04 上海宏原信息科技有限公司 Information extraction structure, labeling method and identification method of consumer text
CN112905884B (en) * 2021-02-10 2024-05-31 北京百度网讯科技有限公司 Method, apparatus, medium and program product for generating sequence annotation model
CN112905884A (en) * 2021-02-10 2021-06-04 北京百度网讯科技有限公司 Method, apparatus, medium, and program product for generating sequence annotation model
CN113255354B (en) * 2021-06-03 2021-12-07 北京达佳互联信息技术有限公司 Search intention recognition method, device, server and storage medium
CN113255354A (en) * 2021-06-03 2021-08-13 北京达佳互联信息技术有限公司 Search intention recognition method, device, server and storage medium
CN113821592A (en) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113392649A (en) * 2021-07-08 2021-09-14 上海浦东发展银行股份有限公司 Identification method, device, equipment and storage medium
CN113326380A (en) * 2021-08-03 2021-08-31 国能大渡河大数据服务有限公司 Equipment measurement data processing method, system and terminal based on deep neural network
CN113673248B (en) * 2021-08-23 2022-02-01 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN113673248A (en) * 2021-08-23 2021-11-19 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition
CN114691889A (en) * 2022-04-15 2022-07-01 中北大学 Method for constructing fault diagnosis knowledge map of turnout switch machine
CN114691889B (en) * 2022-04-15 2024-04-12 中北大学 Construction method of fault diagnosis knowledge graph of switch machine
CN116401369A (en) * 2023-06-07 2023-07-07 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms
CN116401369B (en) * 2023-06-07 2023-08-11 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms

Similar Documents

Publication Publication Date Title
CN110298042A (en) Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN109189942B (en) Construction method and device of patent data knowledge graph
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN112417097B (en) Multi-modal data feature extraction and association method for public opinion analysis
CN110188197B (en) Active learning method and device for labeling platform
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN108304373B (en) Semantic dictionary construction method and device, storage medium and electronic device
CN104102721A (en) Method and device for recommending information
CN113505204B (en) Recall model training method, search recall device and computer equipment
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN109189862A (en) A kind of construction of knowledge base method towards scientific and technological information analysis
CN110704624A (en) Geographic information service metadata text multi-level multi-label classification method
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN114661872B (en) Beginner-oriented API self-adaptive recommendation method and system
CN113535949B (en) Multi-modal combined event detection method based on pictures and sentences
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114239730B (en) Cross-modal retrieval method based on neighbor ordering relation
CN117573869A (en) Network connection resource key element extraction method
CN116483990B (en) Internet news content automatic generation method based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191001