CN110688485B - Word vector language model based on emergency - Google Patents
Word vector language model based on emergency Download PDFInfo
- Publication number
- CN110688485B CN110688485B CN201910915299.6A CN201910915299A CN110688485B CN 110688485 B CN110688485 B CN 110688485B CN 201910915299 A CN201910915299 A CN 201910915299A CN 110688485 B CN110688485 B CN 110688485B
- Authority
- CN
- China
- Prior art keywords
- emergency
- representation
- word
- vector
- word vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a Word vector language model based on an emergency, which uses a traditional Word2Vec model to train a context, wherein the training comprises calculating input in an input layer of the modelHidden layer information ofSimultaneous addition of vector representations of incident events at the input layerWill be provided withAndobtaining a final hidden layer representation by weighted summationThe two are made to jointly influence the final hidden layer representation, the generated hidden layer representation being not only related to the context but also to the emergency. The inventionA new emergency-related word vector model is proposed for modeling text stream data containing an emergency. The method can learn a word vector model with the characteristics of the emergency to identify semantic changes, and the emergency vector representation is added to improve semantic relevance.
Description
Technical Field
The invention relates to the technical field of dynamic word vector generation models, in particular to a word vector language model based on an emergency.
Background
Word semantics are changing over time, and many factors affect word semantics, including cultural changes, creation of new technologies, etc., such as the word "amazon" referring initially to tropical rainforest, and then the word "amazon" is referred to more as an e-business due to the appearance of the same name of e-business.
The emergency event refers to an event which is suddenly referred to several times in a short time, such as "a fire in the san diese, paris", "an earthquake in taiwan", and the like. In the incident detection model, an incident is generally represented as < target word, start time, end time >. In previous work, people only focused on the expression of semantics in emergencies, and we propose to combine emergencies with word vectors to capture and understand word correlations and semantic changes between emergencies.
Inspired by recent research, the dynamic word vector model is used for learning semantic representations of words in different time periods, and can find dynamic changes of word semantics. The dynamic word vector model is a time vector sequence for constructing words in a learning continuous time interval, and can track semantic changes generated in the use of the words. Furthermore, the dynamic word vector model makes it possible to find different words with the same meaning in different time periods, for example by retrieving word vectors of similar regions in the vector space corresponding to different time periods. Due to the randomness of the neural network training process, if we apply a word vector model on each time slice, the output vectors of each time slice will be placed in a vector space with a different coordinate system. In order to compare semantic changes of words, they need to be aligned and mapped into the same vector space. The existing methods for exploring semantic changes of words include the following methods:
1) comparing word similarity: the method calculates the similarity of two entities at the same time, and compares the changes of the similarity at different times, thereby finding out semantic changes and avoiding the comparison of word vectors at different times.
2) Alignment of linear transformation: the method is that the distance of the same word at different time is minimized and a transformation matrix is learned to align word vectors on the assumption that the semantics of most words are not changed.
3) Non-random initial alignment: the method is to initialize the t moment by utilizing the word vector at the t-1 moment so as to lead the word vector to change smoothly along the time track.
These dynamic word vector methods have the following problems when applied to an emergency:
1) although the method of comparing word similarity is simple and feasible, it can only compare the semantics between fixed entities, and lacks flexibility.
2) The basis for linear transformation alignment is to assume that most words do not change their meaning over time. This approach works well, but for some words, it may result in excessively smooth differences between word senses, which have changed over time.
3) Although the non-random initialization method solves the alignment problem, words are enabled to have rich semantic information of the previous corpus, and the semantic information expression capacity of the current emergency is weakened.
In order to solve the above problems, the present invention proposes a new sudden word vector model for capturing word semantic transitions related to sudden events in a text stream. Unlike previous methods that split the entire time span on average, the present invention uses multiple burst segments of the word detected by the burst detection model as the time span given a target word. We assume that each word has a unique representation for each incident. These burst-specific embeddings can represent very different semantics at different bursts of the target word without using linear dependencies or alignment constraints. Intuitively, the words in the text that are related to the emergency typically have some common meaning, which collectively represent the semantics of the emergency. In this regard, the present invention further incorporates an incident vector to force all related words to share similar semantics.
The information disclosed in this background section is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The present invention is directed to provide a word vector language model based on emergency events, so as to solve the technical problems in the prior art.
In order to solve the technical problem, the invention provides a Word vector language model based on an emergency, which is characterized in that the language model uses a traditional Word2Vec model to train a context, and the training comprises the step of calculating an input in an input layer of the modelHidden layer information ofSimultaneous addition of vector representations of incident events at the input layerWill be provided withAndobtaining a final hidden layer representation by weighted summationThe two are made to jointly influence the final hidden layer representation, the generated hidden layer representation being not only related to the context but also to the emergency.
As a further technical solution, the word vector of the input layerThe method is characterized by comprising two parts, wherein one part of word vector representation is static word representation generated by utilizing linguistic data at non-burst time, namely normal time, and the other part of word vector representation is dynamic word vector representation with the characteristics of an emergency.
As a further technical solution, the hidden layer representsThe calculation formula of (a) is as follows:
is the word wjIs used to represent the context word vector of (a),is andan emergency representation in the same vector space; whereinIs composed of a word wj+mThe static representation and the emergency representation of (1) are composed of two parts.
As a further technical solution, the static representation refers to Word vectors obtained by Word2Vec training using linguistic data of non-emergency events; the indication of an emergency is that the emergency isVector representation of the Chinese word is obtained as parameter training learning;the calculation formula of (a) is as follows:
andare respectively the word wj+mStatic representation and representation in an emergency; ) Is a weighted value for measuring two expressions, and the word wj+mThe calculation formula is as follows:
andeach represents biBurst period and whole time period wjThe number of tweets present per day is averaged.
As a further technical solution, the vector representation of the emergency eventIs affected by the relevance of other emergencies; the relevance depends on the time at which the emergency event coincides, and the similarity of the words that occur in the emergency event.
As a further technical scheme, the similarity is expressed as co-occurrence information among words in the emergency corpus and is expressed by PMI, the calculation formula is as follows,
#(w1,w2) Represents the word w1And w2Number of co-occurrences, # (w)1) And # (w)2) Respectively represents w1And w2The number of times of occurrence in the corpus, | D | represents the number of documents in the corpus.
By adopting the technical scheme, the invention has the following beneficial effects:
the word vector learned by the invention can improve the performance on text classification and emergency summary tasks. In the text classification task, the average of the word vectors is used as a vector representation of the text and classification is performed using an SVM classifier. For the emergency summarizing task, the invention selects the top10 neighbor words of the words as the keywords of the emergency, and the keywords can be seen to well summarize the emergency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description in the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a diagram of a prior art CBOW;
FIG. 2 is a block diagram of a prior art LSTM;
FIG. 3 is a functional image of LSTM selected activation cells as a function of tanh function;
FIG. 4 is a functional image that distinguishes whether the target word is from a true distribution or a noise distribution by logistic regression.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention will be further explained with reference to specific embodiments.
The method of the invention relates to technologies such as intelligent analysis and language model, and can be used for generating word vectors in text stream data, and the generated word vectors are enabled to have information of emergency events.
The invention provides a Word vector model (BWE) based on an emergency on the basis of a CBOW structure in a Word2Vec model.
The invention provides a Word vector language model based on an emergency, which is characterized in that the language model uses a traditional Word2Vec model to train a context, and the training comprises the step of calculating input in an input layer of the modelHidden layer information ofSimultaneous addition of vector representations of incident events at the input layerWill be provided withAndobtaining a final hidden layer representation by weighted summationHaving both influence the final hidden layer representation, the generated hidden layer representation is not only relevant to the context,but also to the emergency event.
As shown in fig. 1, CBOW is composed of three parts, an input layer, a hidden layer, and an output layer. At the input level, the model represents the vector of each word corresponding to the contextInput into the model. In the hidden layer, its input is a context word vectorThe output is a vector h which can represent historical informationt. The output layer is of size | V |, and the accepted input is a vector representation h of history informationtThe output is the posterior probability of each word in the vocabulary, and a softmax classifier is used, and the calculation formula is as follows:
yt=softmax(Oht+b) (1)
wherein the vector y is outputtIs a probability distribution whose k-dimension is the posterior probability of the k-th word appearing in the vocabulary, and O is the direct weight from the hidden layer to the output layer. O is also called the output word embedding matrix, and each row in the matrix can also be regarded as a word vector.
Given history information htConditional on the k-th word V in the vocabulary VkThe posterior probability of occurrence is:
pθ(vk|ht)=soft max(s(vk,ht;θ)) (2)
wherein, s (v)k,ht(ii) a Theta) is an unnormalized score calculated by a neural network; θ represents all parameters in the network, including the word vector and weights and biases of the neural network.
The BWE model is based on a CBOW structure, learning word vector representations using incident corpora. At the model input layer, a hidden layer representation is generated by using the context words and the emergency representation together. Each incident may convey some associated semantics that may affect the words in the text associated with the incident, thus adding an incident representation at the input layer. The calculation formula is as follows:
in order to hide the layer representation,is the word wjIs used to represent the context word vector of (a),is andand (3) emergency representation in the same vector space.
WhereinIs composed of a word wj+mThe static representation and the emergency representation of (1) are composed of two parts. Static representation refers to Word vectors trained by Word2Vec using non-bursty-time corpora. The emergency representation refers to a vector representation of words in the emergency, which is obtained by training and learning as a parameter.
andare respectively the word wj+mStatic representation of (2) and representation in an emergency. Gamma is a weight value that measures two expressions, and the word wj+mThe calculation formula is as follows:
andeach represents biBurst period and whole time period wjThe number of tweets present per day is averaged. It can be seen thatAbove 1, γ is greater than 0.5. For a given context word, the present invention uses both static and bursty word vectors to represent its semantics. A burst indication should have more weight if its word frequency is significantly larger than a normal word.
As shown in FIG. 2, the emergency representationIs to generate an output representation of input X using the LSTM structure. LSTM is used to learn the emergency representation since the current emergency may be related to other previous emergencies. Input X Each line XtAn emergency representation is represented that is a vector representation derived from the word's PMI information, and the incoming emergency representations are ordered by time of occurrence. The formula for LSTM is as follows:
ht=ot·tanh(ct) (10)
it、ot、ft、ct、hthidden layer information respectively showing input gate information, output gate information, forgetting gate information, cell state information and history, forgetting gate ftControlling the internal state c of the last momentt-1How much information needs to be forgotten, input gate itAn output gate O for controlling the information of the candidate state at the current timetControlling the internal state c at the present momenttHow much information needs to be output to the external state ht。Wi、Wo、Wf、Wc、Ui、Uo、UfAnd UcInputs x representing input gate, output gate, forgetting gate, and cell status, respectivelytThe weight and the inputWeight of (a), bi、bo、bfAnd bcIndicating the corresponding offset for each control gate. In addition, the LSTM selection activation unit is a tanh function that maps a real input to [ -1,1]Within the range, the curve is shown in FIG. 3.
In the training process, the invention firstly learns the vector representation of the emergency event by using the LSTMWill be provided withInitialization is a Word vector learned using the standard Word2Vec model over all corpora. To learn burst perception embeddingThe invention optimizes the parameters according to the time sequence of the emergency. In this way, the emergency vector may be used as a fixed constant to learn the word vector representation that occurs in the current emergency. For an emergency, a single tweet is sampled first, and then the parameters of random gradient descent are optimized using negative sampling.
When training is carried out by using a negative sampling method, each positive case is subjected toUsing randomly sampled k negative examples of noise distributionThe objective function of the negative sampling is
The objective function of the negative sampling method is also a two-class classification problem, which distinguishes whether the target word comes from a true distribution or a noise distribution by logistic regression, and the function image is shown in fig. 4.
Vector representation of emergency events in the present inventionIs affected by the relevance of other emergencies; the relevance depends on the time at which the emergency event coincides, and the similarity of the words that occur in the emergency event. The similarity is expressed as co-occurrence information among words in the emergency corpus and is expressed by PMI, and the calculation formula is as follows,
#(w1,w2) Represents the word w1And w2Number of co-occurrences, # (w)1) And # (w)2) Respectively represents w1And w2The number of times of occurrence in the corpus, | D | represents the number of documents in the corpus.
In addition, the word vector learned by the method can improve the performance on text classification and emergency summary tasks. In the text classification task, the average of the word vectors is used as a vector representation of the text and classification is performed using an SVM classifier. For the emergency summarizing task, the invention selects the top10 neighbor words of the words as the keywords of the emergency, and the keywords can be seen to well summarize the emergency.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (5)
1. An emergency-based Word vector language model, wherein the language model uses a traditional Word2Vec model to train a context, and the training comprises calculating an input at an input layer of the modelHidden layer information ofSimultaneous addition of vector representations of incident events at the input layerWill be provided withAndobtaining a final hidden layer representation by weighted summationCausing both to jointly affect a final hidden layer representation, the generated hidden layer representation being related not only to the context but also to the emergency; the hidden layer representationThe calculation formula of (a) is as follows:
2. The emergency-based word vector language model of claim 1, wherein the word vectors of the input layerThe method is characterized by comprising two parts, wherein one part of word vector representation is static word representation generated by utilizing linguistic data at non-burst time, namely normal time, and the other part of word vector representation is dynamic word vector representation with the characteristics of an emergency.
3. The emergency-based Word vector language model of claim 1, wherein the static representation is a Word vector obtained by Word2Vec training using a corpus of non-emergency events; the emergency representation refers to vector representation of words in the emergency, and is obtained by training and learning as a parameter;the calculation formula of (a) is as follows:
andare respectively the word wj+mStatic representation and representation in an emergency; gamma is a weight value that measures two expressions, and the word wj+mThe calculation formula is as follows:
4. The emergency-based word vector language model of claim 1, wherein the vector representation of the emergency isIs affected by the relevance of other emergencies; the relevance depends on the time at which the emergency event coincides, and the similarity of the words that occur in the emergency event.
5. The emergency-based word vector language model according to claim 4, wherein the similarity is expressed as co-occurrence information between words in the emergency corpus, and is expressed by PMI, and the calculation formula is as follows:
#(w1,w2) Represents the word w1And w2Number of co-occurrences, # (w)1) And # (w)2) Respectively represents w1And w2The number of times of occurrence in the corpus, | D | represents the number of documents in the corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910915299.6A CN110688485B (en) | 2019-09-26 | 2019-09-26 | Word vector language model based on emergency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910915299.6A CN110688485B (en) | 2019-09-26 | 2019-09-26 | Word vector language model based on emergency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110688485A CN110688485A (en) | 2020-01-14 |
CN110688485B true CN110688485B (en) | 2022-03-11 |
Family
ID=69110299
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910915299.6A Active CN110688485B (en) | 2019-09-26 | 2019-09-26 | Word vector language model based on emergency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110688485B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106886567A (en) * | 2017-01-12 | 2017-06-23 | 北京航空航天大学 | Microblogging incident detection method and device based on semantic extension |
CN108804417A (en) * | 2018-05-21 | 2018-11-13 | 山东科技大学 | A kind of documentation level sentiment analysis method based on specific area emotion word |
CN108932311A (en) * | 2018-06-20 | 2018-12-04 | 天津大学 | The method of incident detection and prediction |
CN109582785A (en) * | 2018-10-31 | 2019-04-05 | 天津大学 | Emergency event public sentiment evolution analysis method based on text vector and machine learning |
CN109635116A (en) * | 2018-12-17 | 2019-04-16 | 腾讯科技(深圳)有限公司 | Training method, electronic equipment and the computer storage medium of text term vector model |
CN109960798A (en) * | 2019-03-01 | 2019-07-02 | 国网新疆电力有限公司信息通信公司 | Uighur text emergency event element recognition methods |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10572585B2 (en) * | 2017-11-30 | 2020-02-25 | International Business Machines Coporation | Context-based linguistic analytics in dialogues |
-
2019
- 2019-09-26 CN CN201910915299.6A patent/CN110688485B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106886567A (en) * | 2017-01-12 | 2017-06-23 | 北京航空航天大学 | Microblogging incident detection method and device based on semantic extension |
CN108804417A (en) * | 2018-05-21 | 2018-11-13 | 山东科技大学 | A kind of documentation level sentiment analysis method based on specific area emotion word |
CN108932311A (en) * | 2018-06-20 | 2018-12-04 | 天津大学 | The method of incident detection and prediction |
CN109582785A (en) * | 2018-10-31 | 2019-04-05 | 天津大学 | Emergency event public sentiment evolution analysis method based on text vector and machine learning |
CN109635116A (en) * | 2018-12-17 | 2019-04-16 | 腾讯科技(深圳)有限公司 | Training method, electronic equipment and the computer storage medium of text term vector model |
CN109960798A (en) * | 2019-03-01 | 2019-07-02 | 国网新疆电力有限公司信息通信公司 | Uighur text emergency event element recognition methods |
Non-Patent Citations (2)
Title |
---|
Word Vector Compositionality based Relevance Feedback using Kernel Density Estimation;Dwaipayan Roy;《CIKM "16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management》;20161024;全文 * |
基于长短时记忆网络的突发灾害事件网络舆情情感识别研究;金占勇,田亚鹏,白莽;《情报科学》;20190501;第37卷(第5期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110688485A (en) | 2020-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984745B (en) | Neural network text classification method fusing multiple knowledge maps | |
Jupalle et al. | Automation of human behaviors and its prediction using machine learning | |
CN108733792B (en) | Entity relation extraction method | |
CN111368996B (en) | Retraining projection network capable of transmitting natural language representation | |
CN109635109B (en) | Sentence classification method based on LSTM and combined with part-of-speech and multi-attention mechanism | |
US11580975B2 (en) | Systems and methods for response selection in multi-party conversations with dynamic topic tracking | |
CN109992773B (en) | Word vector training method, system, device and medium based on multi-task learning | |
CN108549658B (en) | Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree | |
Asr et al. | Comparing Predictive and Co-occurrence Based Models of Lexical Semantics Trained on Child-directed Speech. | |
CN109800437A (en) | A kind of name entity recognition method based on Fusion Features | |
CN112800190B (en) | Intent recognition and slot value filling joint prediction method based on Bert model | |
US11023686B2 (en) | Method and system for resolving abstract anaphora using hierarchically-stacked recurrent neural network (RNN) | |
CN110866113B (en) | Text classification method based on sparse self-attention mechanism fine-tuning burt model | |
CN110532395B (en) | Semantic embedding-based word vector improvement model establishing method | |
CN111125520B (en) | Event line extraction method based on deep clustering model for news text | |
CN113435208B (en) | Training method and device for student model and electronic equipment | |
CN110276396B (en) | Image description generation method based on object saliency and cross-modal fusion features | |
CN112131367A (en) | Self-auditing man-machine conversation method, system and readable storage medium | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
CN111914553A (en) | Financial information negative subject judgment method based on machine learning | |
Tao et al. | News text classification based on an improved convolutional neural network | |
Verma et al. | Semantic similarity between short paragraphs using Deep Learning | |
Liu et al. | A Hybrid Neural Network BERT‐Cap Based on Pre‐Trained Language Model and Capsule Network for User Intent Classification | |
CN110688485B (en) | Word vector language model based on emergency | |
CN116306869A (en) | Method for training text classification model, text classification method and corresponding device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |