CN109670171B - Word vector representation learning method based on word pair asymmetric co-occurrence - Google Patents
Word vector representation learning method based on word pair asymmetric co-occurrence Download PDFInfo
- Publication number
- CN109670171B CN109670171B CN201811413427.9A CN201811413427A CN109670171B CN 109670171 B CN109670171 B CN 109670171B CN 201811413427 A CN201811413427 A CN 201811413427A CN 109670171 B CN109670171 B CN 109670171B
- Authority
- CN
- China
- Prior art keywords
- word
- occurrence
- vector representation
- corpus
- low
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000013598 vector Substances 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 24
- 239000011159 matrix material Substances 0.000 claims abstract description 40
- 230000006870 function Effects 0.000 claims abstract description 25
- 230000003542 behavioural effect Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 239000013604 expression vector Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004836 empirical method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention belongs to the field of natural processing, and particularly relates to a word vector representation learning method based on word pair asymmetric co-occurrence. Comprises the following steps. S100, counting word lists from the corpus; counting the occurrence frequency of each word from a given corpus, sequencing the words in the corpus from high to low according to frequency, S200-sequentially traversing the words in the corpus, and counting a left side co-occurrence matrix and a right side co-occurrence matrix which are expressed as XLAnd XRS300, setting model hyper-parameters, adopting the objective function of the Glove model, and respectively using XLAnd XRTraining out the left low-dimensional vector representation V of the wordLAnd right low-dimensional vector representation VRStitching them together to get the low-dimensional vector representation of the word V = [ V =L,VR]. The invention adopts a parallel computing method to train word vectors by two co-occurrence matrixes simultaneously, thereby greatly reducing the running time of a program.
Description
Technical Field
The invention belongs to the field of natural processing, and particularly relates to a word vector representation learning method based on word pair asymmetric co-occurrence.
Background
In the field of natural processing, there are many ways to represent words inside computers, and typically there are the following:
1) one-hot representation, which is applied to the conventional rule-based, statistical natural language processing method. Each word is represented as a vector, the length of the vector is the size of a word list, the value of only one dimension in the vector is 1 and represents the current word, and the other dimensions are 0. Such a representation is not conducive to semantic computation of words.
2) The distributed representation, the length of the vector represented by the method is also the size of a word list, and is obtained by counting a co-occurrence matrix from a corpus, each row of the co-occurrence matrix corresponds to a word, each column also corresponds to a word, each element in the matrix represents the co-occurrence frequency of the two words in a corpus, each row in the matrix is a word vector corresponding to the word, and the representation improves the semantic information of the word represented by one-hot.
3) The distributed representation is a low-dimensional dense vector obtained by dimension reduction of the distributed representation through various methods, overcomes the defects of the distributed representation, and can better perform semantic calculation.
The low-dimensional word representation method based on the Glove model is one of the main representation learning methods at present, and the Glove model has the advantages of relatively simple learning algorithm, high efficiency and easiness in implementation. The trained word vector has better performance in a semantic similarity task and a word inference task.
Detailed description of the Glove model is described in the following documents:
Pennington J,Socher R,Manning C.Glove:Global Vectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
the Glove model mainly comprises the following steps: setting the size of a fixed window, taking words in the fixed windows at two sides of each word (target word) as context, counting the co-occurrence frequency to generate a co-occurrence matrix, and then training by adopting a random gradient descent method to obtain the vector representation of each word. Although the model has better performance, the sequence of the words is not considered, when the co-occurrence matrix of the target word is counted, the words on the left side and the right side of the target word are not treated differently, and the words on the left side and the right side of the target word are mixed together to be used as the context of the target word, so that the precision of the word vector trained by the co-occurrence matrix is further improved.
Disclosure of Invention
In order to solve the problems, the invention provides a word vector representation learning method based on word pair asymmetric co-occurrence.
The invention adopts the following technical scheme: a word vector representation learning method based on word pair asymmetric co-occurrence comprises the following steps.
S100, counting word lists from the corpus; counting the number of occurrences of each word from a given corpus, ordering by frequency from high to low, ciDenotes the ith word, fiAnd i is more than or equal to 1 and less than or equal to n, wherein n is the number of different words in the corpus.
S200, setting the size of a fixed window as w, sequentially traversing words in a corpus, and counting a left co-occurrence matrix and a right co-occurrence matrix which are expressed as XLAnd XRThe size of both matrices is n × n.
The row of the matrix is the sequence number of each word in the vocabulary, and the column is also the sequence number of each word in the vocabulary. By using Denotes ci、cjThe position of the kth co-occurrence in the corpus.
The process of counting the left co-occurrence matrix and the right co-occurrence matrix is as follows:
s201. matrix XLAnd XRIs initialized to 0;
s202, traversing each word in the corpus, and finding a sequence number i of the word in a word list;
s203-traverse the fixed windowEach word co-occurring at the left side of the word finds the serial number j of the word in the word list, calculates the weight according to the relative position of the word i and the word j, and accumulates toAt the same time, the weight is added up toPerforming the following steps; generating a left co-occurrence matrix X after traversal is finishedLAnd the right co-occurrence matrix XR。
S300, setting model hyper-parameters, adopting the objective function of the Glove model, and respectively using XLAnd XRTraining out the left low-dimensional vector representation V of the wordLAnd right low-dimensional vector representation VRConcatenating them together to get the low-dimensional vector representation of the word V ═ VL,VR]。
Training VLThe objective function of (a) is:
whereinAndrespectively represent words ciAnd cjThe left-hand low-dimensional word vector representation of,andis composed ofAndcorresponding deviationThe item is put into the device,each term in the objective function is weighted according to the co-occurrence frequency of the word pairs as a weighting function.
Training VRThe objective function of (a) is:
whereinAndrespectively represent words ciAnd cjThe right-hand low-dimensional word vector representation of,andis composed ofAndthe corresponding offset term is used to determine the offset,each term in the objective function is weighted according to the co-occurrence frequency of the word pairs as a weighting function.
Andthe weighting method of (3) is the same as that of the Glove model, and the function is as follows.
Compared with the prior art, the invention provides a new windowing mode, namely a mode of respectively taking words in a fixed window before and after a target word as contexts, and effectively fusing word vectors trained by the two windowing modes together to form a word expression vector, thereby improving the precision of the word vectors, obviously improving the precision on a disclosed test set in a word inference task, and being beneficial to parallel calculation.
The invention improves the way of the Glove model to count the co-occurrence matrix. The following three main advantages are:
1. an asymmetric mode statistical method of the word pair co-occurrence is provided, and a left co-occurrence matrix and a right co-occurrence matrix are counted.
2. An effective fusion mode of vectors trained by two co-occurrence matrixes is provided, and word expression vectors with higher precision than those under a symmetrical window under the same dimensionality can be obtained.
3. The word vectors are trained by two co-occurrence matrixes simultaneously by adopting a parallel computing method, so that the running time of a program is greatly reduced.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a flow chart for generating a left-side co-occurrence matrix and a right-side co-occurrence matrix.
Detailed Description
As shown in fig. 1 european, a word vector representation learning method based on word pair asymmetric co-occurrence includes the following steps,
s100, counting word lists from the corpus; counting the number of occurrences of each word from a given corpus, ordering by frequency from high to low, ciDenotes the ith word, fiAnd i is more than or equal to 1 and less than or equal to n, wherein n is the number of different words in the corpus.
S200, setting the size of a fixed window as w, sequentially traversing words in the corpus, counting a left co-occurrence matrix and a right co-occurrence matrix,is represented by XLAnd XRThe size of both matrices is n × n;
the sequence number of each word in the behavioral vocabulary of the matrix, and the column also the sequence number of each word in the vocabulary, are used Denotes ci、cjThe position of the kth co-occurrence in the corpus.
The process of counting the left co-occurrence matrix and the right co-occurrence matrix is as follows:
s201. matrix XLAnd XRIs initialized to 0;
s202, traversing each word in the corpus, and finding a sequence number i of the word in a word list;
s203, traversing each word co-occurring at the left side of the word in the fixed window, finding the serial number j of the word in the word list, calculating the weight according to the relative position of the word i and the word j, and accumulating the weight toAt the same time, the weight is added up toPerforming the following steps; generating a left co-occurrence matrix X after traversal is finishedLAnd the right co-occurrence matrix XR。
S300, setting model hyper-parameters, adopting the objective function of the Glove model, and respectively using XLAnd XRTraining out the left low-dimensional vector representation V of the wordLAnd right low-dimensional vector representation VRConcatenating them together to get the low-dimensional vector representation of the word V ═ VL,VR]。
Training VLThe objective function of (a) is:
whereinAndrespectively represent words ciAnd cjThe left-hand low-dimensional word vector representation of,andis composed ofAndthe corresponding offset term is used to determine the offset,weighting each item in the target function according to the co-occurrence frequency of the word pairs as a weighting function;
training VRThe objective function of (a) is:
whereinAndrespectively represent words ciAnd cjThe right-hand low-dimensional word vector representation of,andis composed ofAndthe corresponding offset term is used to determine the offset,weighting each item in the target function according to the co-occurrence frequency of the word pairs as a weighting function;
andthe weighting method of (3) is the same as that of the Glove model, and the function is as follows.
Example (b):
1. selecting English Wikipedia corpora, and generating a vocabulary list by 100000 words with high occurrence frequency.
2. Setting the size of the fixed window to be 10, and respectively counting the ten words before and after each word in the corpus to obtain a left co-occurrence matrix and a right co-occurrence matrix,XLAnd XR。
3. Setting the initial learning rate to be 0.05, the iteration times to be 50, and respectively using XLAnd XRTraining out 300-dimensional left-side low-dimensional word vector representation VLAnd 300D Right Low-dimensional word vector representation VRStitching them together results in a 600-dimensional low-dimensional word vector representation.
Table 1 shows the comparison between word vector representation obtained by training in the method and word vector representation obtained by training in a Glove model on a grammar-based word inference task, the Glove model adopts a symmetrical window, the size of a fixed window is set to be 10, the initial learning rate is set to be 0.05, the iteration frequency is 50, and the dimension of the word vector is 600 dimensions. Four corpora with different sizes are divided from the English Wikipedia corpus, and the corpora respectively contain 2 hundred million, 5 hundred million, 10 hundred million and 16 hundred million words, and the file sizes are respectively 1.09GB, 2.71GB, 5.42GB and 8.64 GB. The data in the table is the accuracy rate comparison of the 600-dimensional word vector obtained by training the invention and the Glove model to complete the word inference task of grammar.
TABLE 1 comparison of the present invention and Glove model on grammar-based word inference tasks
The experimental result shows that the accuracy of the task of the method is higher than that of a Glove model on corpora with different sizes, and meanwhile, when words with the same dimension are generated through training, the method adopts a parallel processing technology and trains V simultaneouslyLAnd VRAnd then the word vectors V ═ V obtained by splicing the words and the vectorsL,VR],VLAnd VRThe dimension of (1) is half of that of a word vector obtained by training a Glove model, so that the training time can be greatly reduced.
Claims (3)
1. A word vector representation learning method based on word pair asymmetric co-occurrence is characterized in that: comprises the following steps of (a) carrying out,
s100, counting word lists from the corpus; counting per-word occurrences from a given corpusIn order of frequency from high to low, ciDenotes the ith word, fiRepresenting the frequency of the ith word, wherein i is more than or equal to 1 and less than or equal to n, and n is the number of different words in the corpus;
s200, setting the size of a fixed window as w, sequentially traversing words in a corpus, and counting a left co-occurrence matrix and a right co-occurrence matrix which are expressed as XLAnd XRThe size of both matrices is n × n;
the sequence number of each word in the behavioral vocabulary of the matrix, and the column also the sequence number of each word in the vocabulary, are used Denotes ci、cjThe position of the kth co-occurrence in the corpus;
s300, setting model hyper-parameters, adopting the objective function of the Glove model, and respectively using XLAnd XRTraining out the left low-dimensional vector representation V of the wordLAnd right low-dimensional vector representation VRConcatenating them together to get the low-dimensional vector representation of the word V ═ VL,VR]。
2. The word vector representation learning method based on word pair asymmetric co-occurrence according to claim 1, characterized in that: in step S200, the process of counting the left co-occurrence matrix and the right co-occurrence matrix is as follows:
s201. matrix XLAnd XRIs initialized to 0;
s202, traversing each word in the corpus, and finding a sequence number i of the word in a word list;
s203-traversing each word co-occurring at the left side of the word in the fixed window, finding the serial number j of the word in the word list according to ciAnd cjThe relative position of the two points is calculated and added toAt the same time, the weight is added up toPerforming the following steps; generating a left co-occurrence matrix X after traversal is finishedLAnd the right co-occurrence matrix XR。
3. The word vector representation learning method based on word pair asymmetric co-occurrence according to claim 2, characterized in that: the step S300 specifically adopts the following method,
training VLThe objective function of (a) is:
whereinAndrespectively represent words ciAnd cjThe left-hand low-dimensional word vector representation of,andis composed ofAndthe corresponding offset term is used to determine the offset,weighting each item in the target function according to the co-occurrence frequency of the word pairs as a weighting function;
training VRThe objective function of (a) is:
whereinAndrespectively represent words ciAnd cjThe right-hand low-dimensional word vector representation of,andis composed ofAndthe corresponding offset term is used to determine the offset,weighting each item in the target function according to the co-occurrence frequency of the word pairs as a weighting function;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811413427.9A CN109670171B (en) | 2018-11-23 | 2018-11-23 | Word vector representation learning method based on word pair asymmetric co-occurrence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811413427.9A CN109670171B (en) | 2018-11-23 | 2018-11-23 | Word vector representation learning method based on word pair asymmetric co-occurrence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109670171A CN109670171A (en) | 2019-04-23 |
CN109670171B true CN109670171B (en) | 2021-05-14 |
Family
ID=66142590
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811413427.9A Active CN109670171B (en) | 2018-11-23 | 2018-11-23 | Word vector representation learning method based on word pair asymmetric co-occurrence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109670171B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110781686B (en) * | 2019-10-30 | 2023-04-18 | 普信恒业科技发展(北京)有限公司 | Statement similarity calculation method and device and computer equipment |
CN111859910B (en) * | 2020-07-15 | 2022-03-18 | 山西大学 | Word feature representation method for semantic role recognition and fusing position information |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682089A (en) * | 2016-11-26 | 2017-05-17 | 山东大学 | RNNs-based method for automatic safety checking of short message |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9779085B2 (en) * | 2015-05-29 | 2017-10-03 | Oracle International Corporation | Multilingual embeddings for natural language processing |
US9880999B2 (en) * | 2015-07-03 | 2018-01-30 | The University Of North Carolina At Charlotte | Natural language relatedness tool using mined semantic analysis |
CN105243083B (en) * | 2015-09-08 | 2018-09-07 | 百度在线网络技术(北京)有限公司 | Document subject matter method for digging and device |
US20170161275A1 (en) * | 2015-12-08 | 2017-06-08 | Luminoso Technologies, Inc. | System and method for incorporating new terms in a term-vector space from a semantic lexicon |
US10019438B2 (en) * | 2016-03-18 | 2018-07-10 | International Business Machines Corporation | External word embedding neural network language models |
CN107220220A (en) * | 2016-03-22 | 2017-09-29 | 索尼公司 | Electronic equipment and method for text-processing |
CN106844342B (en) * | 2017-01-12 | 2019-10-08 | 北京航空航天大学 | Term vector generation method and device based on incremental learning |
US20180260381A1 (en) * | 2017-03-09 | 2018-09-13 | Xerox Corporation | Prepositional phrase attachment over word embedding products |
CN108460022A (en) * | 2018-03-20 | 2018-08-28 | 福州大学 | A kind of text Valence-Arousal emotional intensities prediction technique and system |
CN108399163B (en) * | 2018-03-21 | 2021-01-12 | 北京理工大学 | Text similarity measurement method combining word aggregation and word combination semantic features |
CN108829667A (en) * | 2018-05-28 | 2018-11-16 | 南京柯基数据科技有限公司 | It is a kind of based on memory network more wheels dialogue under intension recognizing method |
CN108829672A (en) * | 2018-06-05 | 2018-11-16 | 平安科技(深圳)有限公司 | Sentiment analysis method, apparatus, computer equipment and the storage medium of text |
CN108694476A (en) * | 2018-06-29 | 2018-10-23 | 山东财经大学 | A kind of convolutional neural networks Stock Price Fluctuation prediction technique of combination financial and economic news |
-
2018
- 2018-11-23 CN CN201811413427.9A patent/CN109670171B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682089A (en) * | 2016-11-26 | 2017-05-17 | 山东大学 | RNNs-based method for automatic safety checking of short message |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
Non-Patent Citations (4)
Title |
---|
Chinese sign language recognition based on gray-level co-occurrence matrix and other multi-features fusion;Yulong Li等;《2009 4th IEEE Conference on Industrial Electronics and Applications》;20090630;全文 * |
Topic mover"s distance based document classification;Xinhui Wu等;《2017 IEEE 17th International Conference on Communication Technology (ICCT)》;20180517;全文 * |
基于卷积神经网络的图文融合媒体情感预测;蔡国永等;《计算机应用》;20160210;第36卷(第2期);全文 * |
采用循环神经网络的情感分析注意力模型;李松如等;《华侨大学学报(自然科学版)》;20180331;第39卷(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109670171A (en) | 2019-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107291693B (en) | Semantic calculation method for improved word vector model | |
CN110222349B (en) | Method and computer for deep dynamic context word expression | |
CN107273355B (en) | Chinese word vector generation method based on word and phrase joint training | |
WO2020062770A1 (en) | Method and apparatus for constructing domain dictionary, and device and storage medium | |
CN107358948B (en) | Language input relevance detection method based on attention model | |
CN106202010B (en) | Method and apparatus based on deep neural network building Law Text syntax tree | |
CN106547737B (en) | Sequence labeling method in natural language processing based on deep learning | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
CN109635124A (en) | A kind of remote supervisory Relation extraction method of combination background knowledge | |
CN111291556B (en) | Chinese entity relation extraction method based on character and word feature fusion of entity meaning item | |
CN110222178A (en) | Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing | |
Huang et al. | Sndcnn: Self-normalizing deep cnns with scaled exponential linear units for speech recognition | |
CN107480143A (en) | Dialogue topic dividing method and system based on context dependence | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN107526834A (en) | Joint part of speech and the word2vec improved methods of the correlation factor of word order training | |
CN107766320A (en) | A kind of Chinese pronoun resolution method for establishing model and device | |
CN110826338A (en) | Fine-grained semantic similarity recognition method for single-choice gate and inter-class measurement | |
CN110489554B (en) | Attribute-level emotion classification method based on location-aware mutual attention network model | |
CN113204674B (en) | Video-paragraph retrieval method and system based on local-overall graph inference network | |
CN110858480B (en) | Speech recognition method based on N-element grammar neural network language model | |
CN109670171B (en) | Word vector representation learning method based on word pair asymmetric co-occurrence | |
CN115438154A (en) | Chinese automatic speech recognition text restoration method and system based on representation learning | |
CN110874392B (en) | Text network information fusion embedding method based on depth bidirectional attention mechanism | |
CN114254645A (en) | Artificial intelligence auxiliary writing system | |
Rei | Online representation learning in recurrent neural language models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231011 Address after: Room 305, Yunlv Tianxia Maker Space, AB Podium Building, No. 529 South Zhonghuan Street, Xuefu Industrial Park, Shanxi Transformation and Comprehensive Reform Demonstration Zone, Taiyuan City, Shanxi Province, 030006 Patentee after: Shanxi Zhonghuida Technology Co.,Ltd. Address before: 030006 No. 92, Hollywood Road, Taiyuan, Shanxi Patentee before: SHANXI University |