CN110298042A - Based on Bilstm-crf and knowledge mapping video display entity recognition method - Google Patents
Based on Bilstm-crf and knowledge mapping video display entity recognition method Download PDFInfo
- Publication number
- CN110298042A CN110298042A CN201910572843.1A CN201910572843A CN110298042A CN 110298042 A CN110298042 A CN 110298042A CN 201910572843 A CN201910572843 A CN 201910572843A CN 110298042 A CN110298042 A CN 110298042A
- Authority
- CN
- China
- Prior art keywords
- vector
- video display
- crf
- entity recognition
- bilstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Animal Behavior & Ethology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses based on Bilstm-crf and knowledge mapping video display entity recognition method, by the character vector and part of speech vector that obtain text to be identified, summation is weighted to character vector and part of speech vector, and result is input in target bi LSTM model and is handled, obtain text feature sequence;Text feature sequence inputting is handled into target CRF model, obtains the name Entity recognition result of text to be identified;It goes in video display knowledge mapping name Entity recognition result to inquire further verification result.The present invention effectively can carry out entity extraction to the partially colloquial video display search text of user, and video display knowledge mapping is made full use of to excavate the video display search intention that user is abstracted, and promote the usage experience of user.It is inputted by the term vector of language model training as the bottom of neural network in the case where labeled data is few, improves training effectiveness, have a good application prospect, can be widely applied to the Entity recognition scene in each field.
Description
Technical field
Deep learning natural language processing technique field of the present invention, more particularly to based on Bilstm-crf and knowledge mapping shadow
Depending on entity recognition method.
Background technique
TV is the equipment of each family's indispensability, almost has new movie and television play restocking daily, this makes people can be from electricity
A large amount of the resources of movie & TV is searched depending on, such as people can be by director, performer, title, type information search the resources of movie & TV, such as
What can accurately extract video display entity with a kind of effectively mode, to help user to be quickly found out the movie and television play that it is admired, at
For an important demand.
Traditional name Entity recognition mostly uses rule-based and statistical machine learning method.Initially, name entity is known
It Cai Yong not the method based on dictionary and rule.The rule-based knowledge base and dictionary that these methods are established using linguistic expertise mostly is bases
Plinth, using pattern match or the method for string matching identification name entity.The text strong for regularity, it is rule-based
Method is accurate and efficient.But the text not strong for regularity, regular writing become difficult, and recognition effect is also undesirable,
And the name Entity recognition based on dictionary depends critically upon dictionary, can not identify unregistered word.So people start mesh
The method of light trend of purchasing machine learning.
There is hidden Markov model (Hidden in the name common machine learning method in Entity recognition field
MarkovModel, HMM), conditional random field models (Conditional Random Fields, CRF), maximum entropy model
(MaximumEntropy), supporting vector machine model (Support Vector Machine, SVM) etc..It is wherein most typical
It is using being more successfully hidden Markov model and conditional random field models.Based on the method for machine learning migration,
The performance of recognition effect etc. is better than rule-based method, but uses the name Entity recognition mould of statistical machine learning method
There is also some limitations for type.On the one hand, in order to be easily handled reasoning, it needs specific dependence to assume;On the other hand,
Requirement of the machine learning method to Feature Selection based on statistical model is relatively high, needs to select to appoint name Entity recognition
It is engaged in influential various features, i.e. Feature Engineering (feature engineering), it has a major impact recognition result, but
It is that the process is time-consuming and laborious, and HMM with the CRF method based on word frequency statistics can only be associated with the language of the previous word of current word
Justice, accuracy of identification is not high enough, and the discrimination of especially unregistered word is lower;Finally, they usually require largely with task phase
The specific knowledge of pass such as designs the state model of HMM, or the input feature vector of selection CRF.
In recent years, with the development of hardware capabilities and the distributed appearance for indicating (word embedding) of word, mind
Become the model that many NLP (natural language processing) task can be effectively treated through network.Such methods appoint sequence labelling
The processing mode of business (such as POS (part-of-speech tagging), NER (name Entity recognition)) be it is similar, by token (label) from discrete
One-hot (one-hot encoding) expression, which is mapped in lower dimensional space, becomes dense embedding (vector insertion), then by sentence
Embedding (vector insertion) sequence inputting in RNN (Recognition with Recurrent Neural Network), automatically extracts feature with neural network,
Softmax (normalization exponential function) predicts the label of each token (label).This method becomes the training of model
One overall process end to end, and unconventional pipeline (pipeline), do not depend on Feature Engineering, are a kind of data-drivens
Method;But network mutation is more, it is big to rely on parameter setting, and model interpretation is poor.In addition, one of this method the disadvantage is that right
Each token is independent classification during labelling, and directly (can only cannot lean on hidden shape using label predicted above
State transmits information above), and then the sequence label predicted is caused to may be illegal, such as label B-PER (BIO sequence mark
Injection-molded) it is followed by impossible followed by I-LOC (BIO sequence labelling mode), but Softmax (normalization exponential function) is no
This information can be used.
Currently, educational circles, which proposes LSTM-CRF (shot and long term memory network adds condition random field) model, does sequence labelling.?
CRF (condition random field) layer is accessed after LSTM (shot and long term memory network) layer to do the Tag Estimation of sentence level, so that mark
Process is no longer to each token (label) independent sorting.
Summary of the invention
In view of the above-mentioned problems, the invention proposes based on Bilstm-crf and knowledge mapping video display entity recognition method, solution
Certainly labeled data is few and the Entity recognition problem of brief, the colloquial video display text data of text.
The present invention through the following technical solutions to achieve the above objectives:
Based on Bilstm-crf and knowledge mapping video display entity recognition method, comprising the following steps:
Step 1: from major movie data source real-time collecting movie data information, e.g., bean cotyledon, Baidupedia etc. are crawled each
Each entity information such as video display name, performer, role, character relation, establishes video display knowledge mapping;
Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television;What analysis was collected into
Data, to there is the common search statement of certain rule user to label, for model training and term vector training;
Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:
(1), it character representation layer: is made of part of speech vector sum character vector;Character vector is obtained by LM model training
, part of speech vector obtains one-hot part of speech vector by part-of-speech tagging after segmenting, and finally presses part of speech vector layer and character vector layer
Weight is spliced into final term vector layer;Finally, splicing part of speech vector sum character level vector to indicate word specific
Feature under semantic space;
(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term;Forward and reverse LSTM receives special
That levies expression layer goes out feature as input, is separately encoded the information above and below at current time;The encoded information of the two merges
Constitute score information to be decoded;
(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, according to sequence
Arrange to obtain the complete optimal sequence label of component selections;
Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition effect
Rate.
Further scheme is in the step 2, to do frequency statistics, k- to a large number of users data acquired from television
Means clustering.
The beneficial effects of the present invention are:
The present invention effectively can carry out entity extraction to the partially colloquial video display search text of user, and make full use of video display
Knowledge mapping excavates the video display search intention that user is abstracted, and promotes the usage experience of user.It is logical in the case where labeled data is few
The term vector for crossing language model training is inputted as the bottom of neural network, improves training effectiveness, before having application well
Scape can be widely applied to the Entity recognition scene in each field.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art
In required practical attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only the one of the present embodiment
A little embodiments for those of ordinary skill in the art without creative efforts, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is the video display Entity recognition flow chart of the method for the present invention;
Fig. 2 is the Bilstm-crf model structure in the method for the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical solution of the present invention will be carried out below
Detailed description.Obviously, the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art without making creative work it is obtained it is all its
Its embodiment belongs to the range that the present invention is protected.
In any embodiment, as shown in Figure 1, it is of the invention based on Bilstm-crf and knowledge mapping video display Entity recognition
Method, comprising the following steps:
Step 1: from major movie data source real-time collecting movie data information, e.g., bean cotyledon, Baidupedia etc. are crawled each
Each entity information such as video display name, performer, role, character relation, establishes video display knowledge mapping;
Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television;What analysis was collected into
Data, to there is the common search statement of certain rule user to label, for model training and term vector training;
Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:
(1), it character representation layer: is made of part of speech vector sum character vector;Character vector is obtained by LM model training
, part of speech vector obtains one-hot part of speech vector by part-of-speech tagging after segmenting, and finally presses part of speech vector layer and character vector layer
Weight is spliced into final term vector layer;Finally, splicing part of speech vector sum character level vector to indicate word specific
Feature under semantic space;
(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term;Forward and reverse LSTM receives special
That levies expression layer goes out feature as input, is separately encoded the information above and below at current time;The encoded information of the two merges
Constitute score information to be decoded;
(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, according to sequence
Arrange to obtain the complete optimal sequence label of component selections;
Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition effect
Rate.
In the step 2, frequency statistics, k-Means clustering are done to a large number of users data acquired from television.
K-Means cluster algorithm is summarized as follows:
Select K point as initial mass center;
Each point is assigned to nearest mass center, forms K cluster;
Recalculate the mass center of each cluster;
Until cluster does not change or reaches maximum number of iterations;
It adjusts ginseng to select 15 cluster points here by test, user's saying can be intended to similar sentence cluster to together;
In conjunction with frequency analysis and clustering as a result, the general common video display of precognition user search sentence, determination will be known
Other entity type, label Uniform Name have 27 labels at present.
Language model pre-training process: special punctuation mark, English size are removed to data prediction before term vector training
Conversion etc. is write, data a large number of users authority data uses the word2vec of gensim kit to train by treated, is trained for
The term vector that dimension 300 is tieed up;Part of speech vector first segments each sentence with jieba, and the dictionary of participle is analyzed by data
The fixed word of the user obtained afterwards common such as ' I wants to see ', ' I wants to listen ', this dictionary are labelled with its word by weight
Property;Then every corresponding part of speech sequence of words is converted to the one-hot term vector of 300 dimensions, finally by part of speech vector with
The character vector that word2vec is trained is added by certain weight, finally obtains initial ginseng of the term vector as two-way lstm network
Number.
Depth is used in the case where solving labeled data to a certain extent less with the term vector of a large amount of truthful datas training
Neural network is done the problem of Entity recognition, and the initial parameter of BIlstm neural network is no longer nonsensical random parameter, greatly
The term vector of amount data training can obtain the initial informations such as middle text radical and input as the bottom of neural network, be additionally added originally
Part of speech vector, can effective identifying to input this paper field.
Model training data preparation: data mark;
The frequently-used data comprising each label is filtered out from user data manually to mark, and is evaluated and tested using Bakeoff-3
Employed in BIO mark collection, following labeling form:
I wants to see the World Without Thieve of Liu Dehua;
0 0 0 B-actor I-actor I-actor 0 B-movie_name I-movie_name I_movie_
name I_movie_name;
In order to allow each label data to be not in too many inclination, the tag that we lack frequency of occurrence, type are marked
At a label, predicts and according to priority go to look into knowledge mapping verifying after coming again.Model training data 25674 at present, with
The change of user demand, not counting iteration updates, corresponding training data also be will increase for model meeting.
Model training:
By the training data of all marks by 0.6,0.3,0.1 ratio cut partition be training dataset, test data set and
Validation data set.
As unit of sentence, a sentence (sequence of word) containing n word is denoted as:
X=(x1, x2..., xn);
Wherein, xiIndicate id of i-th of word in dictionary of sentence, so the word2Id of available each word to
Amount, dimension is dictionary size.
Here dictionary be each frequency is counted from all training datas, and by sorting from large to small after, obtain each
The corresponding unique id of word, for posting term marker bit ' UNK '.
The first layer of model is look-up layers, will be in sentence using the embedding matrix of pre-training or random initializtion
Each word xiIt is dense word vector (character embedding) x of low-dimensional by one-hot DUAL PROBLEMS OF VECTOR MAPPINGi∈R2It is
The dimension of embedding.Before inputting next layer, dropout is set to alleviate over-fitting.
The second layer of model be it is LSTM layers two-way, automatically extract sentence characteristics.By each word of a sentence
Character embedding sequence (x1, x2..., xn) input as each time step of two-way LSTM, then by positive LSTM
The hidden status switch of outputWith reversed LSTM'sThe hidden state exported at various locations carries out
Opsition dependent splicingObtain complete hidden status switch:
(h1, h2..., hn)∈Rn*m;
After dropout is set, a linear layer is accessed, hidden state vector is mapped to k dimension from m dimension, k is mark collection
Number of tags, so that the sentence characteristics automatically extracted, are denoted as matrix P=(p1, p2..., pn)∈Rn*k.It can be pi∈Rk's
Per one-dimensional pijAll it is regarded as word xiIt is categorized into the marking value of j-th of label, if carrying out Softmax to P again, is equivalent to pair
Each position independently carries out k class classification.But the information marked can not be utilized when being labeled in this way to each position,
So next one CRF layers will be entered to be labeled.
The third layer of model is CRF layers, carries out the sequence labelling of Sentence-level, and CRF layers of parameter is (k+2) × (k+
2) matrix A,
AijWhat is indicated is the transfer score from i-th of label to j-th of label, and then when a position is labeled
Time can use the label marked before this, and why to add 2 is that should be sentence stem to be to add an initial state
And a final state is added for sentence tail portion.If one length of note is equal to the sequence label y=(y of sentence length1,
y2..., yn), then marking of the model for the label of sentence x equal to y is
It can be seen that the marking of entire sequence is equal to the sum of the marking of each position, and the marking of each position is by two parts
It obtains, a part is to export p by LSTMiIt determines, another part is then determined by the shift-matrix A of CRF.And then it can use
Softmax normalized after probability:
It is given below to training sample (x, a y by maximizing log-likelihood function when model trainingx) logarithm
Likelihood:
logP(yx| x)=score (x, yx)-log (∑ exp (score (x, y ')));
If this algorithm wants formula to realize, it should be noted that the logarithm of the sum of index will be converted intoThe Section 2 of above formula is efficiently calculated using forward-backward algorithm algorithm in CRF.
Model solves optimal path using the Viterbi algorithm of Dynamic Programming when predicting process (decoding):
Bilstm adds crf rather than reason, and CRF layers can obtain constrained rule from training data, and CRF layers can be
It is legal that the label finally predicted, which adds some constraints come the label for guaranteeing prediction,.In training data training process, these
Constraint can be arrived by CRF layers of automatic study.
These constraints may is that
I: first word is always started with label " B- " or " 0 " in sentence, rather than " I- ";
II: label " B-label1 I-label2 I-label3 I- ... ", label1, label2, Iabel3 should belong to
Same class entity.For example, " B-Person I-Person " is legal sequence, still " B-Person I-Organization "
It is illegal sequence label;
III: the first label of sequence label " 0 I-labe1 " is illegal entity tag should be " B- ", rather than " I-
", in other words, effective sequence label should be " 0 B-labe1 ".
There are these to constrain, the probability that illegal sequence occurs in sequence label prediction will will be greatly reduced, due to BiLSTM
Output be unit each label score value, we can select the highest label as the unit of score value;Although
The correct label of each unit in our available sentence x, but it is that prediction is correct that we, which cannot be guaranteed label every time,
The structure of entire model is as shown in Figure 2.
In addition, data analysis process of the invention is as follows:
The basic of user's video display search is analyzed by K-means cluster, frequency etc. from a large amount of collected user data
Demand searches for clause, what conditional search video to determine entity class and name in conjunction with business demand by such as common;Then
Manually mark training data by BIO standard, due to not having readily available labeled data, using a large number of users truthful data and
The character vector and one-hot part of speech vector of word2vec language model 300 dimensions of training, and merge text by certain weight
Character vector and part of speech vector, the bottom as two-way lstm input.
The training of entity recognition model is as follows:
By the training data of all marks by 0.6,0.3,0.1 ratio cut partition be training dataset, test data set and
Validation data set.
As unit of sentence, a sentence (sequence of word) containing n word is denoted as:
X=(x1, x2..., xn);
Wherein, xiIndicate id of i-th of word in dictionary of sentence, so the word2Id of available each word to
Amount, dimension is dictionary size.
Here dictionary is the frequency that each word is counted from all training datas, and by sorting from large to small after, obtain
The corresponding unique id of each word, unregistered word marker bit ' UNK '.
Model training mainly includes following 3 part:
1. inputting character/word vector indicates.
Each word is indicated using close (dense) vector, loads trained word vector (Word2Vec) and part of speech in advance
Vector.Some meanings will be extracted from single word (single letter), and the meaning of sentence is obtained from part of speech vector.To each
Word, it would be desirable to construct a vector to obtain the meaning of this word and some features useful to Entity recognition, this to
Made of the word vector sum that amount is trained as Word2Vec is stacked from the vector for extracting feature in part of speech by weight
2. the semantic expressiveness of text about: to each of context word, needing a significant vector is indicated.
The vector that word in context is obtained using BILSTM is indicated.After we, which obtain the final vector of word, indicates, to word vector
Sequence carries out bi-LSTM.Using the hidden state at each time point, rather than just end-state.M term vector is inputted,
The vector of m hidden state is obtained, however it includes the information of word rank that word vector, which is, and the vector of hidden state considers
Hereafter.
3. decoding: after we have the vector of each word to indicate, the prediction of Lai Jinhang entity tag.
Label score is calculated in decoding stage, does last prediction using the corresponding hidden state vector of each word, it can be with
The score of each entity tag is obtained using a full Connection Neural Network.
Assuming that there are 9 classifications, W ∈ R is used9×kWith b ∈ R9To calculate score s ∈ R9=Wh+b can understand s [i]
For the score of word w corresponding label i.
Using linear crf to the score of entity tag: softmax method is to do local selection, in other words, even if bi-
Some contextual informations are contained in the h that LSTM is generated, but label decision is still local.Not using the label of surrounding come
Aid decision making.Such as: " poplar power ", after we have given power " I-actor " this label, this should help us to determine that " poplar " is right
The initial position of I-actor is answered, linear CRF defines global score.
Finally, trained model and relevant parameter are saved.
Process of data preprocessing is as follows:
Here additional character etc. mainly is gone to the data processing before model prediction;Text data is handled as model prediction
It is required that format, that is, convert text to wordId term vector, dimension is the length of training data dictionary dictionary.
Model prediction is as follows:
By treated, data input model is predicted, the possible situation of prediction result is as follows:
(1), I wants to see coming back for Zhang Yimou director;
0 0 0 B-director_name I-director_name I-director_name 0 0 B-movie_
name I-movie_name;
(2), the talk on the journey to west of Liu Dehua;
B-actor I-actor I-actor 0 B-movie_name I-movie_name I-movie_name;
(3), the film that wife Deng Chao drills;
B-actor I-actor 0 0 B-relation I-relation 0 00;
(4), cool life can not be sad;
0 0 0 0 0 0 B-movie_name I-movie_name I-movie_name;
(5), recommend a most fiery film;
0 0 0 0 0 0 0 0 0;
1, prediction result incorporeity is handled;
To there is no the case where entity occurs in prediction result in model prediction (5):
Data processing: 1, removal front and back redundancy department ' I wants to see ', ' I will see ', ' broadcasting ', ' having ' etc.;
2, the entity rules such as film collection/season/portion, version, language are extracted, maintains language, version, country etc. in advance no
Long variation, wired special data, this data are being present in knowledge mapping simultaneously, and the present invention is to deposit this partial data with word
The form deposit memory of allusion quotation.Similar { ' English ': ' English ', ' English ': ' English ', ' foreign language ': ' English ' } form, can be by it
All synonyms are taken into account.It is replaced with after corresponding entity is matched with canonical and by entity empty as ' I wants to see speed and swash
Feelings English edition ', if model is not previously predicted entity result, remove after front and back redundancy section and special entity ' the fast and the furious English '
Corresponding entity result is obtained being searched for knowledge mapping.
3, prediction result has entity handles;
There is the knowing correspondent entity search of entity result label in prediction result (1), (2), (3), (4) in model prediction as above
Know map and verified whether entity as necessary being, does not drill talk on the journey to west as Liu De China is practical in (2), it will be to user
Recommend other films of Liu Dehua, rather than return to user and do not find the film, improves the experience of user;Prediction result (3) is real
What border user wanted viewing is the film of grandson pari, this is the excavation of entity abstraction relation, can better meet user demand.Know
Know map verifying and further improves the effect of entity.To (4) although this entity result, do not found in knowledge mapping pair
The video display name entities answered are considered as prediction of failure, then execute following entity result encapsulation output processing.
Entity result encapsulation output is as follows:
It is not inconsistent logical entity prediction result processing, such as ' Liu Dehua third ' actor: Liu Dehua, season of identification:
Season entity will be deleted, will as a result be encapsulated.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.Specific technical features described in the above specific embodiments, in not lance
In the case where shield, can be combined in any appropriate way, in order to avoid unnecessary repetition, the present invention to it is various can
No further explanation will be given for the combination of energy.Various embodiments of the present invention can be combined randomly, only
Want it without prejudice to thought of the invention, it should also be regarded as the disclosure of the present invention.
Claims (2)
1. based on Bilstm-crf and knowledge mapping video display entity recognition method, which comprises the following steps:
Step 1: from major movie data source real-time collecting movie data information, crawling each video display name, performer, role, Ren Wuguan
Each entity informations such as system, establish video display knowledge mapping;
Step 2: the data that video display are searched for by the user that voice is converted to text are collected from television;Analyze the number being collected into
According to there is the common search statement of certain rule user to label, for model training and term vector training;
Step 3: entity recognition model training, the model are made of character representation layer, BiLSTM and 3 part of CRF layer:
(1), it character representation layer: is made of part of speech vector sum character vector;Character vector is obtained by LM model training, word
Property vector one-hot part of speech vector obtained by part-of-speech tagging after segmenting, part of speech vector layer and character vector layer are finally pressed into weight
It is spliced into final term vector layer;Finally, splicing part of speech vector sum character level vector to indicate word in certain semantic
Feature under space;
(2), BiLSTM: by the length of forward and reverse, memory network LSTM is formed in short-term;Forward and reverse LSTM receives mark sheet
That shows layer goes out feature as input, is separately encoded the information above and below at current time;The encoded information of the two, which merges, to be constituted
Score information to be decoded;
(3), the CRF:CRF layers of output score for receiving BiLSTM are as input, while introducing transfer score matrix, obtained according to sequence
The complete optimal sequence label of component selections;
Step 4: result verification, to model prediction result verification, the combination of rule and knowledge mapping improves Entity recognition efficiency.
2. as described in claim 1 based on Bilstm-crf and knowledge mapping video display entity recognition method, which is characterized in that described
In step 2, frequency statistics, k-Means clustering are done to a large number of users data acquired from television.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910572843.1A CN110298042A (en) | 2019-06-26 | 2019-06-26 | Based on Bilstm-crf and knowledge mapping video display entity recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910572843.1A CN110298042A (en) | 2019-06-26 | 2019-06-26 | Based on Bilstm-crf and knowledge mapping video display entity recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110298042A true CN110298042A (en) | 2019-10-01 |
Family
ID=68029238
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910572843.1A Pending CN110298042A (en) | 2019-06-26 | 2019-06-26 | Based on Bilstm-crf and knowledge mapping video display entity recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110298042A (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110716991A (en) * | 2019-10-11 | 2020-01-21 | 掌阅科技股份有限公司 | Method for displaying entity associated information based on electronic book and electronic equipment |
CN110782881A (en) * | 2019-10-25 | 2020-02-11 | 四川长虹电器股份有限公司 | Video entity error correction method after speech recognition and entity recognition |
CN110807324A (en) * | 2019-10-09 | 2020-02-18 | 四川长虹电器股份有限公司 | Video entity identification method based on IDCNN-crf and knowledge graph |
CN110909174A (en) * | 2019-11-19 | 2020-03-24 | 南京航空航天大学 | Knowledge graph-based method for improving entity link in simple question answering |
CN111090754A (en) * | 2019-11-20 | 2020-05-01 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
CN111125378A (en) * | 2019-12-25 | 2020-05-08 | 同方知网(北京)技术有限公司 | Closed-loop entity extraction method based on automatic sample labeling |
CN111159017A (en) * | 2019-12-17 | 2020-05-15 | 北京中科晶上超媒体信息技术有限公司 | Test case generation method based on slot filling |
CN111241810A (en) * | 2020-01-16 | 2020-06-05 | 百度在线网络技术(北京)有限公司 | Punctuation prediction method and device |
CN111274788A (en) * | 2020-01-16 | 2020-06-12 | 创新工场(广州)人工智能研究有限公司 | Dual-channel joint processing method and device |
CN111274817A (en) * | 2020-01-16 | 2020-06-12 | 北京航空航天大学 | Intelligent software cost measurement method based on natural language processing technology |
CN111274794A (en) * | 2020-01-19 | 2020-06-12 | 浙江大学 | Synonym expansion method based on transmission |
CN111310470A (en) * | 2020-01-17 | 2020-06-19 | 西安交通大学 | Chinese named entity recognition method fusing word and word features |
CN111444720A (en) * | 2020-03-30 | 2020-07-24 | 华南理工大学 | Named entity recognition method for English text |
CN111553158A (en) * | 2020-04-21 | 2020-08-18 | 中国电力科学研究院有限公司 | Method and system for identifying named entities in power scheduling field based on BilSTM-CRF model |
CN111666418A (en) * | 2020-04-23 | 2020-09-15 | 北京三快在线科技有限公司 | Text regeneration method and device, electronic equipment and computer readable medium |
CN111832306A (en) * | 2020-07-09 | 2020-10-27 | 昆明理工大学 | Image diagnosis report named entity identification method based on multi-feature fusion |
CN111859967A (en) * | 2020-06-12 | 2020-10-30 | 北京三快在线科技有限公司 | Entity identification method and device and electronic equipment |
CN111882124A (en) * | 2020-07-20 | 2020-11-03 | 武汉理工大学 | Homogeneous platform development effect prediction method based on generation confrontation simulation learning |
CN111917861A (en) * | 2020-07-28 | 2020-11-10 | 广东工业大学 | Knowledge storage method and system based on block chain and knowledge graph and application thereof |
CN112084783A (en) * | 2020-09-24 | 2020-12-15 | 中国民航大学 | Entity identification method and system based on civil aviation non-civilized passengers |
CN112101009A (en) * | 2020-09-23 | 2020-12-18 | 中国农业大学 | Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions |
CN112541088A (en) * | 2020-12-29 | 2021-03-23 | 浙大城市学院 | Dangerous chemical library construction method based on knowledge graph |
CN112905884A (en) * | 2021-02-10 | 2021-06-04 | 北京百度网讯科技有限公司 | Method, apparatus, medium, and program product for generating sequence annotation model |
CN112906367A (en) * | 2021-02-08 | 2021-06-04 | 上海宏原信息科技有限公司 | Information extraction structure, labeling method and identification method of consumer text |
CN112989787A (en) * | 2021-02-05 | 2021-06-18 | 杭州云嘉云计算有限公司 | Text element extraction method |
CN113255354A (en) * | 2021-06-03 | 2021-08-13 | 北京达佳互联信息技术有限公司 | Search intention recognition method, device, server and storage medium |
CN113326380A (en) * | 2021-08-03 | 2021-08-31 | 国能大渡河大数据服务有限公司 | Equipment measurement data processing method, system and terminal based on deep neural network |
CN113392649A (en) * | 2021-07-08 | 2021-09-14 | 上海浦东发展银行股份有限公司 | Identification method, device, equipment and storage medium |
CN113536793A (en) * | 2020-10-14 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Entity identification method, device, equipment and storage medium |
CN113673248A (en) * | 2021-08-23 | 2021-11-19 | 中国人民解放军32801部队 | Named entity identification method for testing and identifying small sample text |
CN113821592A (en) * | 2021-06-23 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN114647727A (en) * | 2022-03-17 | 2022-06-21 | 北京百度网讯科技有限公司 | Model training method, device and equipment applied to entity information recognition |
CN114691889A (en) * | 2022-04-15 | 2022-07-01 | 中北大学 | Method for constructing fault diagnosis knowledge map of turnout switch machine |
CN116401369A (en) * | 2023-06-07 | 2023-07-07 | 佰墨思(成都)数字技术有限公司 | Entity identification and classification method for biological product production terms |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763218A (en) * | 2018-06-04 | 2018-11-06 | 四川长虹电器股份有限公司 | A kind of video display retrieval entity recognition method based on CRF |
CN108874997A (en) * | 2018-06-13 | 2018-11-23 | 广东外语外贸大学 | A kind of name name entity recognition method towards film comment |
CN109033374A (en) * | 2018-07-27 | 2018-12-18 | 四川长虹电器股份有限公司 | Knowledge mapping search method based on Bayes classifier |
-
2019
- 2019-06-26 CN CN201910572843.1A patent/CN110298042A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763218A (en) * | 2018-06-04 | 2018-11-06 | 四川长虹电器股份有限公司 | A kind of video display retrieval entity recognition method based on CRF |
CN108874997A (en) * | 2018-06-13 | 2018-11-23 | 广东外语外贸大学 | A kind of name name entity recognition method towards film comment |
CN109033374A (en) * | 2018-07-27 | 2018-12-18 | 四川长虹电器股份有限公司 | Knowledge mapping search method based on Bayes classifier |
Non-Patent Citations (1)
Title |
---|
周浩 等: "融合语义与语法信息的中文评价对象提取", 《智能系统学报》 * |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807324A (en) * | 2019-10-09 | 2020-02-18 | 四川长虹电器股份有限公司 | Video entity identification method based on IDCNN-crf and knowledge graph |
CN110716991A (en) * | 2019-10-11 | 2020-01-21 | 掌阅科技股份有限公司 | Method for displaying entity associated information based on electronic book and electronic equipment |
CN110782881A (en) * | 2019-10-25 | 2020-02-11 | 四川长虹电器股份有限公司 | Video entity error correction method after speech recognition and entity recognition |
CN110909174B (en) * | 2019-11-19 | 2022-01-04 | 南京航空航天大学 | Knowledge graph-based method for improving entity link in simple question answering |
CN110909174A (en) * | 2019-11-19 | 2020-03-24 | 南京航空航天大学 | Knowledge graph-based method for improving entity link in simple question answering |
CN111090754A (en) * | 2019-11-20 | 2020-05-01 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
CN111090754B (en) * | 2019-11-20 | 2023-04-07 | 新华智云科技有限公司 | Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries |
CN111159017A (en) * | 2019-12-17 | 2020-05-15 | 北京中科晶上超媒体信息技术有限公司 | Test case generation method based on slot filling |
CN111125378A (en) * | 2019-12-25 | 2020-05-08 | 同方知网(北京)技术有限公司 | Closed-loop entity extraction method based on automatic sample labeling |
CN111274788A (en) * | 2020-01-16 | 2020-06-12 | 创新工场(广州)人工智能研究有限公司 | Dual-channel joint processing method and device |
CN111274817A (en) * | 2020-01-16 | 2020-06-12 | 北京航空航天大学 | Intelligent software cost measurement method based on natural language processing technology |
CN111241810A (en) * | 2020-01-16 | 2020-06-05 | 百度在线网络技术(北京)有限公司 | Punctuation prediction method and device |
CN111310470A (en) * | 2020-01-17 | 2020-06-19 | 西安交通大学 | Chinese named entity recognition method fusing word and word features |
CN111310470B (en) * | 2020-01-17 | 2021-11-19 | 西安交通大学 | Chinese named entity recognition method fusing word and word features |
CN111274794A (en) * | 2020-01-19 | 2020-06-12 | 浙江大学 | Synonym expansion method based on transmission |
CN111444720A (en) * | 2020-03-30 | 2020-07-24 | 华南理工大学 | Named entity recognition method for English text |
CN111553158A (en) * | 2020-04-21 | 2020-08-18 | 中国电力科学研究院有限公司 | Method and system for identifying named entities in power scheduling field based on BilSTM-CRF model |
CN111666418A (en) * | 2020-04-23 | 2020-09-15 | 北京三快在线科技有限公司 | Text regeneration method and device, electronic equipment and computer readable medium |
CN111666418B (en) * | 2020-04-23 | 2024-01-16 | 北京三快在线科技有限公司 | Text regeneration method, device, electronic equipment and computer readable medium |
CN111859967B (en) * | 2020-06-12 | 2024-04-09 | 北京三快在线科技有限公司 | Entity identification method and device and electronic equipment |
CN111859967A (en) * | 2020-06-12 | 2020-10-30 | 北京三快在线科技有限公司 | Entity identification method and device and electronic equipment |
CN111832306A (en) * | 2020-07-09 | 2020-10-27 | 昆明理工大学 | Image diagnosis report named entity identification method based on multi-feature fusion |
CN111882124A (en) * | 2020-07-20 | 2020-11-03 | 武汉理工大学 | Homogeneous platform development effect prediction method based on generation confrontation simulation learning |
CN111882124B (en) * | 2020-07-20 | 2022-06-07 | 武汉理工大学 | Homogeneous platform development effect prediction method based on generation confrontation simulation learning |
CN111917861A (en) * | 2020-07-28 | 2020-11-10 | 广东工业大学 | Knowledge storage method and system based on block chain and knowledge graph and application thereof |
CN112101009A (en) * | 2020-09-23 | 2020-12-18 | 中国农业大学 | Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions |
CN112101009B (en) * | 2020-09-23 | 2024-03-26 | 中国农业大学 | Method for judging similarity of red-building dream character relationship frames based on knowledge graph |
CN112084783A (en) * | 2020-09-24 | 2020-12-15 | 中国民航大学 | Entity identification method and system based on civil aviation non-civilized passengers |
CN112084783B (en) * | 2020-09-24 | 2022-04-12 | 中国民航大学 | Entity identification method and system based on civil aviation non-civilized passengers |
CN113536793A (en) * | 2020-10-14 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Entity identification method, device, equipment and storage medium |
CN112541088B (en) * | 2020-12-29 | 2022-05-17 | 浙大城市学院 | Dangerous chemical library construction method based on knowledge graph |
CN112541088A (en) * | 2020-12-29 | 2021-03-23 | 浙大城市学院 | Dangerous chemical library construction method based on knowledge graph |
CN112989787A (en) * | 2021-02-05 | 2021-06-18 | 杭州云嘉云计算有限公司 | Text element extraction method |
CN112906367A (en) * | 2021-02-08 | 2021-06-04 | 上海宏原信息科技有限公司 | Information extraction structure, labeling method and identification method of consumer text |
CN112905884B (en) * | 2021-02-10 | 2024-05-31 | 北京百度网讯科技有限公司 | Method, apparatus, medium and program product for generating sequence annotation model |
CN112905884A (en) * | 2021-02-10 | 2021-06-04 | 北京百度网讯科技有限公司 | Method, apparatus, medium, and program product for generating sequence annotation model |
CN113255354B (en) * | 2021-06-03 | 2021-12-07 | 北京达佳互联信息技术有限公司 | Search intention recognition method, device, server and storage medium |
CN113255354A (en) * | 2021-06-03 | 2021-08-13 | 北京达佳互联信息技术有限公司 | Search intention recognition method, device, server and storage medium |
CN113821592A (en) * | 2021-06-23 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
CN113392649A (en) * | 2021-07-08 | 2021-09-14 | 上海浦东发展银行股份有限公司 | Identification method, device, equipment and storage medium |
CN113326380A (en) * | 2021-08-03 | 2021-08-31 | 国能大渡河大数据服务有限公司 | Equipment measurement data processing method, system and terminal based on deep neural network |
CN113673248B (en) * | 2021-08-23 | 2022-02-01 | 中国人民解放军32801部队 | Named entity identification method for testing and identifying small sample text |
CN113673248A (en) * | 2021-08-23 | 2021-11-19 | 中国人民解放军32801部队 | Named entity identification method for testing and identifying small sample text |
CN114647727A (en) * | 2022-03-17 | 2022-06-21 | 北京百度网讯科技有限公司 | Model training method, device and equipment applied to entity information recognition |
CN114691889A (en) * | 2022-04-15 | 2022-07-01 | 中北大学 | Method for constructing fault diagnosis knowledge map of turnout switch machine |
CN114691889B (en) * | 2022-04-15 | 2024-04-12 | 中北大学 | Construction method of fault diagnosis knowledge graph of switch machine |
CN116401369A (en) * | 2023-06-07 | 2023-07-07 | 佰墨思(成都)数字技术有限公司 | Entity identification and classification method for biological product production terms |
CN116401369B (en) * | 2023-06-07 | 2023-08-11 | 佰墨思(成都)数字技术有限公司 | Entity identification and classification method for biological product production terms |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298042A (en) | Based on Bilstm-crf and knowledge mapping video display entity recognition method | |
CN109189942B (en) | Construction method and device of patent data knowledge graph | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN112818093B (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
CN112417097B (en) | Multi-modal data feature extraction and association method for public opinion analysis | |
CN110188197B (en) | Active learning method and device for labeling platform | |
CN112183094B (en) | Chinese grammar debugging method and system based on multiple text features | |
CN111324771B (en) | Video tag determination method and device, electronic equipment and storage medium | |
CN108304373B (en) | Semantic dictionary construction method and device, storage medium and electronic device | |
CN104102721A (en) | Method and device for recommending information | |
CN113505204B (en) | Recall model training method, search recall device and computer equipment | |
CN112101027A (en) | Chinese named entity recognition method based on reading understanding | |
CN109189862A (en) | A kind of construction of knowledge base method towards scientific and technological information analysis | |
CN110704624A (en) | Geographic information service metadata text multi-level multi-label classification method | |
CN111061939B (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN111814477B (en) | Dispute focus discovery method and device based on dispute focus entity and terminal | |
CN114661872B (en) | Beginner-oriented API self-adaptive recommendation method and system | |
CN113535949B (en) | Multi-modal combined event detection method based on pictures and sentences | |
CN110941958A (en) | Text category labeling method and device, electronic equipment and storage medium | |
CN113537304A (en) | Cross-modal semantic clustering method based on bidirectional CNN | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114239730B (en) | Cross-modal retrieval method based on neighbor ordering relation | |
CN117573869A (en) | Network connection resource key element extraction method | |
CN116483990B (en) | Internet news content automatic generation method based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191001 |