CN111026848B - Chinese word vector generation method based on similar context and reinforcement learning - Google Patents
Chinese word vector generation method based on similar context and reinforcement learning Download PDFInfo
- Publication number
- CN111026848B CN111026848B CN201911301344.5A CN201911301344A CN111026848B CN 111026848 B CN111026848 B CN 111026848B CN 201911301344 A CN201911301344 A CN 201911301344A CN 111026848 B CN111026848 B CN 111026848B
- Authority
- CN
- China
- Prior art keywords
- chinese
- word
- words
- similar
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 239000013598 vector Substances 0.000 title claims abstract description 102
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000002787 reinforcement Effects 0.000 title claims abstract description 55
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 230000009471 action Effects 0.000 claims description 32
- 239000003795 chemical substances by application Substances 0.000 claims description 31
- 230000003044 adaptive effect Effects 0.000 claims description 29
- 230000006399 behavior Effects 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 150000001875 compounds Chemical class 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 8
- 238000003058 natural language processing Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 5
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese word vector generation method based on similar context and reinforcement learning, which solves the problems that the existing Chinese word vector generation method is predicted by considering the relation between the adjacent context of a target word and the target word, the situation that some words in Chinese are adjacent but have irrelevant semantics and the expression quality of word vectors is not high. The method comprises the following steps: selecting a corpus, and preprocessing the corpus to construct a Chinese corpus; similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained; and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese word vector generation method based on similar context and reinforcement learning.
Background
Natural language processing is an important direction in the field of computer science and artificial intelligence. At present, natural language processing tasks comprise machine translation, emotion analysis, text summarization, text classification, information extraction and the like. In the natural language processing task, first, a first step needs to be considered how to make the computer capable of representing natural language. The computer cannot directly express the natural language, so a method needs to be designed to mathematically express the natural language so that the computer can process the natural language, namely the word vector. A word vector is a real vector that represents a natural language as containing semantics. Specifically, words are mapped into a vector space and represented by vectors. Generally speaking, the higher the quality of a word vector is, the richer and more accurate the semantic information contained in the word vector is, the more easily a computing mechanism can solve the semantics in the natural language, and the processing result of the natural language processing task can be fundamentally improved. How to generate high quality word vectors is the basis and important research for natural language processing.
There are two main directions in current research on word vectors: firstly, a common word vector method: the method is suitable for various languages, such as Chinese, English, Japanese and the like, and can represent words as word vectors in various languages. There are two categories of this kind, one is to represent words as a point vector in a vector space, and the other is to represent words as a gaussian distribution. Second, language-specific word vector method: the method only applies to specific languages, can only represent words of a certain language as word vectors, and considers various fine-grained characteristics of the specific language, such as Chinese radicals, strokes, pinyin and other characteristics, English letters, suffixes and the like. The Chinese patent CN107273355A, a Chinese word vector generation method based on word combination training, provides a Chinese word vector method. The patent takes word information in words as important characteristics, combines context words and words, and jointly trains word vector representation of Chinese. On the basis of a word vector model based on the words, target words are predicted based on context words while the target words are predicted based on the context words by introducing Chinese character forming information of the words. The Chinese patent CN109815476A, a word vector representation method based on Chinese morpheme and pinyin combined statistics, provides a Chinese word vector generation method. The method utilizes the morpheme and pinyin characteristics of Chinese characters and based on the morpheme and pinyin combined characteristics of context words, trains a three-layer neural network to predict a central target word and then generates a word vector.
The existing method lays a foundation for the research of word vectors, particularly the research of Chinese word vectors, but has the following defects: first, the generic word vector method is applicable to most languages, does not consider specific language features, has strong generalization but low accuracy, and cannot well improve the precision of subsequent natural language processing tasks. Second, the chinese word vector approach simply adds chinese features without improving the neural network infrastructure, and its accuracy cannot be further improved. Meanwhile, the research methods all consider the relationship between the adjacent context of the target word and the target word for prediction, and do not consider the situation that some words in Chinese are adjacent but have irrelevant semantics. The existing word vector generation method predicts words based on the characteristics of adjacent contexts of words, only uses a simple neural network architecture, does not improve the neural network and cannot better obtain high-quality Chinese word vectors.
Disclosure of Invention
The technical problem to be solved by the invention is that the existing Chinese word vector generation method is predicted by considering the relation between the adjacent context of the target word and the target word, and the problems that although some words in Chinese are adjacent, the semantics are irrelevant and the expression quality of the word vector is not high are not considered. The invention provides a Chinese word vector generation method based on similar context and reinforcement learning, which solves the problems, selects the similar context of a target word in a self-adaptive manner, simultaneously provides a Chinese word vector reinforcement learning frame, interacts with a corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the size of the corpus, predicts the target word based on the similar context, and further generates a Chinese word vector, avoids semantic irrelevance of adjacent Chinese contexts, enhances the performance of a learning framework, reduces the training time and improves the quality of the Chinese word vector.
The invention is realized by the following technical scheme:
a Chinese word vector generation method based on similar context and reinforcement learning comprises the following steps:
selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.
The method selects the similar context of the Chinese target word in a self-adaptive manner, simultaneously provides a Chinese word vector reinforcement learning framework, interacts with the corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the scale of the corpus, predicts the target word based on the similar context, further generates the Chinese word vector, avoids the semantic irrelevance of the adjacent Chinese contexts, enhances the learning architecture performance, reduces the training time and improves the quality of the Chinese word vector. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.
Further, the corpus pre-processing comprises: and performing simplified and traditional body conversion on the downloaded Internet text, removing messy codes, English and punctuation, and performing Chinese word segmentation.
Further, the similar context discovery is carried out on the Chinese target word, including the similar context discovery is carried out on the Chinese target word, and the Chinese target word w is subjected to the similar context discovery t In the history of (1) and (w) t Similar words, wherein w t The history vocabulary of (1) is represented in w t Near and at the same time at w t The left word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
determining the size c of the window;
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range [ w t-c Word w in, 0) i If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
if n 1 If < c, the search range is expanded leftwards, c words are added, namely, the search range is [ w ] t-2c ,w t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
searching to the left, increasing c words each time and updating adaptive threshold T, and repeating searching until the number of similar words is equal to c.
Further, similar context discovery is carried out on the Chinese target word, and similar context discovery is also carried out on the Chinese target word, namely the Chinese target word w t Looking for the sum w in the future vocabulary t Similar words, wherein w t Is represented at w t Near and at the same time at w t The right word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
firstly, determining the size c of a lower window, wherein the size of the lower window is the same as that of an upper window;
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And range [ w ] t-c ,w t+c ]Average of all word similarities within;
③ for the range (0, w) t+c ]Word w in j If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
if n 2 If < c, the search range is expanded to the right, and c words are added, namely in the range [ w ] t+c ,w t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
and fifthly, searching rightwards all the time, increasing c words each time, updating the adaptive threshold value T, and iteratively searching for similar contexts until the number of words in the similar contexts is equal to c.
Further, the word similarity represents the similarity degree of word semantics through similarity degree, wherein the similarity degree s (w) i ,w j ) The calculation formula of (a) is as follows:
in the formula, w i And w j Representing two words in a Chinese language database;anddenotes w i And w j The initial word vector of (a); s (w) i ,w j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.
Further, in the process of constructing a Chinese word vector reinforcement learning framework and performing reinforcement learning: define the corpus as the environmental Environment, Chinese target word w t And its similar context SC t The classifier with the CBOW and SG behaviors is defined as an Agent; when the agent is in one state of the environment, the agent takes an Action, then the environment gives a Reward according to the Action, the agent judges the quality of the current Action through the Reward, learns, and then takes a better Action in the next state; when the agent processes different states, the process is iterated continuously until the set maximum iteration times is reached, the reinforcement learning is completed, and finally the Chinese word vector is generated.
Further, the method for constructing the Chinese word vector reinforcement learning framework and performing reinforcement learning specifically comprises the following steps:
initializing a proxy pi θ The parameter is theta;
② setting learning rate eta and maximum iteration number T max Inputting a Chinese language database E;
let agent pi θ Interacting with the environment E, and sampling to obtain N track segments tau s ={τ 1 ,…,τ n ,…,τ N };
Fourthly, calculating the total return of each section of track, wherein the calculation formula is as follows:
in the formula, R (tau) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,the method comprises the steps of representing the return after a certain action is taken by the tth state in the nth section of track, wherein t represents the tth state, | E | represents the total number of words in a Chinese language database;
calculating the total return expectation of N tracks
In the formula (I), the compound is shown in the specification,representing the expectation of return for the total of N tracks, N representing the total number of tracks, R (τ) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,expressing the return after the action is taken by the t-th state in the nth section of track, wherein t expresses the t-th state, and | E | expresses the total number of words in the Chinese language database;
calculating the expected gradient of total return of N tracks
In the formula (I), the compound is shown in the specification,the expected gradient representing the total return for N tracks, N representing the total number of tracks, R (τ) n ) Representing the total return of N tracks, τ n Represents the nth segment of the track, t represents the tth state, | E | represents the total number of words in the Chinese corpus,representing proxy pi θ On the premise of the parameter theta, obtaining the gradient of the probability of the nth track,representing proxy pi θ The action taken in the t-th state of the nth track segment may be a cbow (CBOW behavior) or a sg (SG action),t-th state, w, representing the nth track t Representing a Chinese target word, SC t Representing similar context of the Chinese target word, w i Representing words in similar context, c represents similar context window size;
seventhly, updating the parameter theta
In the formula, theta represents proxy pi θ Is determined by the parameters of (a) and (b),representing the expected gradient of the total return of the N sections of tracks, and eta represents the learning rate;
adding 1 to the iteration times, stopping if the iteration times reach the maximum iteration times, outputting Chinese word vectors, and otherwise returning to the step II to continue the iterative training.
Further, the CBOW behavior predicts the target word based on the similar context of the Chinese target word, and the SG behavior predicts the similar context of the Chinese target word based on the similar context of the Chinese target word, wherein the CBOW behavior and the SG behavior are both three-layer neural networks.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the method for generating the Chinese word vector based on the similar context and the reinforcement learning, the similar context is calculated by utilizing the similarity between words, the traditional adjacent context is replaced, the adjacent irrelevant words of Chinese are avoided, and the semantic relevance of the context is increased;
2. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, which utilizes reinforcement learning to generate Chinese word vectors and can generate Chinese word vectors with excellent quality on corpora of various sizes;
3. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, and the well-trained reinforcement learning agent can be directly applied to a new corpus to generate Chinese word vectors, thereby reducing the training time;
4. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, which has wide application range and can carry out subsequent Chinese word vector generation as long as a corpus is given; the invention is suitable for the technical field of natural language processing.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a general flowchart of a chinese word vector generation method based on similar context and reinforcement learning according to the present invention.
FIG. 2 is a diagram of similar contextual findings of the present invention.
Fig. 3 is a flow chart similar to that found above in the present invention.
Fig. 4 is a flow chart similar to that found below in the present invention.
FIG. 5 is a model diagram of reinforcement learning according to the present invention.
FIG. 6 is a flowchart illustrating reinforcement learning according to the present invention.
FIG. 7 is a diagram showing the results of the extensive experiments of the analogy task of the present invention in different corpus sizes.
FIG. 8 is a diagram of the results of the extensive experiments of similar tasks in different corpus sizes of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Examples
As shown in fig. 1 to 8, the present invention relates to a method for generating chinese word vector based on similar context and reinforcement learning, the method comprising:
selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.
The general flow of the invention is shown in fig. 1, and the specific implementation steps are as follows:
1.1 corpus preprocessing: the method comprises the steps of performing simplified and traditional body conversion on a downloaded internet text by using an opencc tool kit, removing messy codes, English and punctuations by using a regular expression, performing Chinese word segmentation by using jieba word segmentation, and preprocessing the text;
and 1.2, storing the final result to construct a Chinese language database.
2.1 term similarity calculation
Calculating similarity s (w) between words by assigning all words to an initial word vector i ,w j ) To indicate the phase of the word semanticsDegree of similarity:
in the formula, w i And w j Representing two words in a Chinese language database;anddenotes w i And w j The initial word vector of (a); s (w) i ,w j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.
2.2 words and phrases are similar to those found above
In Chinese target word w t In the history of (1) and (w) t Similar words, wherein w t The history vocabulary of (1) is represented in w t Near and at the same time at w t The left word, t, indicates the subscript of the Chinese target word, the schematic diagram is shown in FIG. 2, the flowchart is shown in FIG. 3, and the specific steps are as follows:
firstly, determining the size c of the window, namely, for each word, searching how many words with similar semantics (the size of the window below is the same as that of the window above, namely, the window below is c);
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range [ w t-c Word w in, 0) i If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
if n 1 If < c, the search range is expanded to the left, and c words are added, namely in the range [ w ] t-2c ,w t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
searching to the left, increasing c words each time and updating the adaptive threshold value T, and iteratively searching until the number of similar words is equal to c (if the corpus boundary is found to be still insufficient for c words, stopping iteration, based on the number of similar words at present).
2.3 word similarity is found below
In Chinese target word w t Looking for the sum w in the future vocabulary t Similar words, wherein w t Is represented at w t Near and at the same time at w t The right word, t, represents the index of the Chinese target word, the schematic diagram is shown in fig. 2, the flow chart is shown in fig. 4, and the specific steps are as follows:
firstly, determining the size c of a following window, namely, aiming at each word, searching how many words with similar semantics are required to be searched in the following; wherein the context window is as large as the context window size;
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range (0, w) t+c ]Word w in j If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
if n 2 If < c, the search range is expanded to the right, and c words are added, namely in the range [ w ] t+c ,w t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within the range and Chinese target word w t Is greater than the adaptive similarity threshold T, the word is determined to be similarHereinafter, the number of words n similar to hereinafter 2 Adding 1;
and fifthly, searching rightwards all the time, increasing c words each time, updating the adaptive threshold value T, and iteratively searching similar contexts until the number of the words in the similar contexts is equal to c (if the corpus boundary is found to be still insufficient for c words, stopping iteration, and taking the number of the words in the similar contexts as the standard).
3.1, basic definition
Aiming at the generation of Chinese word vectors, the invention provides a reinforcement learning framework, as shown in FIG. 5, a corpus is defined as an Environment (Environment), and Chinese target words w t And its similar context SC t Defined as State, a classifier with both CBOW and SG behavior is defined as Agent (Agent). The reinforcement learning process is as follows: when an agent is in one state of the environment, the agent takes an Action (Action), then the environment gives a Reward (Reward) according to the Action, the agent judges whether the current Action is good or bad through the Reward, learns, and then takes a better Action in the next state, as shown in fig. 5. When the agent processes different states, the process is iterated continuously until the set maximum iteration times is reached, the reinforcement learning is completed, and finally the Chinese word vector is generated. The specific definition is as follows:
environment (Environment E) the Environment of the Chinese word vector generation method is a given processed Chinese language database.
State S Each State S t Defined as a Chinese target word w t And its similar context SC t Combinations of (a) and (b). There are many states in the environment, and each word and its similar context constitutes a state.
Action A defining two possible actions that an agent may take in different states, CBOW and SG, i.e. A ═ a CBOW ,a SG The CBOW behavior predicts target words based on the similar context of the Chinese target words, and the SG behavior predicts the target words based on the Chinese target wordsSimilar context, specifically as follows:
a) CBOW behaviour
Known Chinese target word w t Similar context of SC t Predicting the Chinese target word w on the premise of t Such as Action1 portion of FIG. 5; the CBOW is a three-layer neural network, and the corresponding layers are specifically as follows:
an input layer: for inputting Chinese target words w t Contains similar context SC t Word vectors of 2c words;
projection layer: 2c vectors of the input layer are accumulated and summed, and the output is
In the formula (I), the compound is shown in the specification,representing CBOW projection layer with Chinese target word w t Is output under the premise of input;andrepresenting a Chinese target word w t And the word w t+i The word vector of (2);
an output layer: probability p (w) of accurately predicting Chinese target word according to similar context t |SC t ) Which is equal to the Chinese target word w t Calculating prediction weights of all words in the corpus through a softmax function;
in the formula, p (w) t |SC t ) Representing CBOW according to context SC t Accurate prediction of Chinese target word w t The probability of (d); w is a j Representing a corpusThe jth word in (1); e represents a corpus;andan output word vector representing a corresponding word, t and j representing subscripts of the words;
the objective function of CBOW is the maximum likelihood of:
in the formula, ζ CBOW Is the objective function; p (w) t |SC t ) Representing CBOW according to context SC t Accurate prediction of Chinese target word w t The probability of (d); e represents a corpus; | E | represents the total number of words in the corpus; t represents the subscript of the word;
b) SG behavior
Known Chinese target word w t On the premise of predicting the similar context SC t Using each chinese target word as input and predicting the similar context of the chinese target word, as in Action2 part of fig. 5, SG is also a three-layer neural network, and the corresponding layers are specifically:
Projection layer: a word vector for holding a current word; in fact, the projection layer in the SG has no practical effect, but for the purpose of maintaining a consistent structure with CBOW, its output is
An output layer: according to the Chinese target word w t Accurate prediction context SC t Possibility p (w) t+i |w t ) Equal to the pre-prediction of each context word in all words in the corpusMeasuring weight, and calculating through a softmax function;
in the formula, p (w) t+i |w t ) Representing SG according to Chinese target word w t Accurately predicting the probability of each word in the context;
the SG objective function is the maximum likelihood function as follows:
in the formula, ζ SG Is the objective function; | E | represents the total number of words in the corpus; t and i represent the subscripts of the words; c is half the number of contexts.
Return (Reward R) report t Is context-to-action feedback that is used to evaluate the success or failure of an agent to take an action. The return under different behaviors is defined as:
in the formula, logp θ () Representing proxy pi θ On the premise of the parameter theta, the probability, w, of the nth track is obtained t Representing a Chinese target word, SC t Representing a similar context of the target word in chinese,representing proxy pi θ The action taken in the t-th state of the nth track segment may be a cbow (CBOW behavior) or a sg (SG action), w i Representing words in similar context, c represents similar above window size.
Agent (Agent pi) θ ) An agent is a mapping function pi: S → A, which can be viewed as a classifier with a parameter theta, whose input is the state and whose output is the action taken.
One segment is a track from an initial state to a final state of reinforcement learning, including actions taken and returns obtained by each state, and the nth segment is defined as
3.2, reinforcement learning Process
The Chinese word vector reinforcement learning process is that the agent continuously interacts with the environment and is in different states of the environment, then the agent takes different actions, the environment gives a return according to the action of the agent, the agent judges whether the action is taken according to the return, and the agent learns what action should be taken in the next state better. And continuously iterating until the set maximum iteration number is reached, as shown in fig. 5, and the specific steps are as follows, and the flow chart is shown in fig. 6.
In the embodiment of the invention, 2008 dog searching news is selected as a corpus, simplified and traditional body conversion is carried out by using an opencc tool kit through corpus preprocessing, messy codes, English and punctuations are removed by using a regular expression, Chinese word segmentation is carried out by using jieba word segmentation, and finally a Chinese corpus is obtained, wherein the Chinese corpus comprises 3 hundred million Chinese words and the size of a dictionary is about 420000.
In the embodiment of the present invention, similar context discovery is performed on the standard corpus, where a similar context discovery model is shown in fig. 2, and a flow is shown in fig. 3 and fig. 4. In the embodiment of the invention, the learning rate is set to be 0.01, and the maximum iteration times are set to be 4 times for reinforcement learning. The model diagram of reinforcement learning is shown in fig. 5, and the flowchart is shown in fig. 6. We compared 7 comparison methods on analogy tasks, similarity tasks (WS-240, WS-296), text classification and named entity recognition tasks, and performed experiments on 100% corpus, with the experimental results shown in Table 1. Extensive experiments were then performed on 25%, 50%, 75% of the corpus, and the results are shown in fig. 7 and 8. (the method of the invention is named sc2 vec).
TABLE 1 results of the experiment
From experimental results, the sc2vec method provided by the invention achieves the best results on analogy tasks, similar tasks, text classification tasks and named entity identification tasks. In the expansibility experiment, the best result is obtained in the corpora with different sizes. It can be observed that the method of the invention can be better expressed under different conditions, which shows that similar contexts overcome the semantic irrelevance of adjacent contexts, and the reinforcement learning enhances the learning architecture performance, thus showing that the invention is a Chinese word vector generation model with better performance and more accurate semantic information capture. From each experimental result, the feasibility of the Chinese word vector generation method based on similar context and reinforcement learning is shown.
The method selects the similar context of the Chinese target word in a self-adaptive manner, provides a Chinese word vector reinforcement learning generation frame at the same time, interacts with the corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the size of the corpus, predicts the target word based on the similar context, and further generates the Chinese word vector, avoids the semantic irrelevance of the adjacent Chinese contexts, enhances the learning architecture performance, reduces the training time, and improves the quality of the Chinese word vector. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (4)
1. A method for generating Chinese word vectors based on similar contexts and reinforcement learning is characterized by comprising the following steps:
selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of a Chinese target word;
the similar context discovery is carried out on the Chinese target word, including the similar context discovery on the Chinese target word, and the similar context discovery is carried out on the Chinese target word w t In the history of (1) and (w) t Similar words, wherein w t The history vocabulary of (1) is represented in w t Near and at the same time at w t The left word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
determining the size c of the window;
calculating an adaptive similarity threshold T equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
(iii) target range [ w t-c Word w in, 0) i If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
if n 1 If < c, the search range is expanded to the left, and c words are added, namely in the range [ w ] t-2c ,w t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
searching to the left, increasing c words each time, updating the adaptive threshold value T, and iteratively searching until the number of similar words in the previous word is equal to c;
performing similar context discovery on the Chinese target word, and further comprising performing similar context discovery on the Chinese target wordIn Chinese target word w t Looking for the sum w in the future vocabulary t Similar words, wherein w t Is represented at w t Near and at the same time at w t The right word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
firstly, determining the size c of a lower window, wherein the size of the lower window is the same as that of an upper window;
calculating an adaptive similarity threshold T equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range (0, w) t+c ]Word w in j If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
if n 2 If < c, the search range is expanded to the right, and c words are added, namely, the search range is [ w ] t+c ,w t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
searching rightwards all the time, increasing c words each time and updating the self-adaptive threshold value T, and iteratively searching for similar contexts until the number of words in the similar contexts is equal to c;
the method comprises the following steps of constructing a Chinese word vector reinforcement learning framework and carrying out reinforcement learning: define the corpus as the environmental Environment, Chinese target word w t And its similar context SC t The classifier with the CBOW and SG behaviors is defined as an Agent; when the agent is in one state of the environment, the agent takes an Action, then the environment gives a Reward according to the Action, the agent judges whether the current Action is good or bad through the Reward to learn, and then the agent takes a more Action in the next stateGood behavior; when the agent processes different states, the process is iterated continuously until reaching the set maximum iteration times, the reinforcement learning is completed, and finally Chinese word vectors are generated;
the method for constructing the Chinese word vector reinforcement learning framework and performing reinforcement learning specifically comprises the following steps:
a1: initializing a proxy pi θ ,
A2: setting learning rate eta and maximum iteration number T max Inputting a Chinese language database E as an environment E;
a3: according to proxy pi θ And environment E, let agent pi θ Interacting with the environment E, and sampling to obtain N track segments tau s ;
A4: from N track segments τ s Calculating the total return of each section of track;
a5: calculating the expectation of the total return of N sections of tracks according to the total return of each section of tracks;
a6: calculating the expected gradient of the total return of the N sections of tracks according to the expected total return of the tracks;
a7: updating the parameter theta according to the expected gradient of the total return of the track;
a8: according to the process, adding 1 to the iteration times, stopping if the maximum iteration times is reached, outputting Chinese word vectors, and returning to A2 to continue the iterative training if the maximum iteration times is not reached;
the CBOW behavior predicts a target word based on the similar context of the Chinese target word, and the SG behavior predicts the similar context of the Chinese target word based on the similar context of the Chinese target word, wherein the CBOW behavior and the SG behavior are three layers of neural networks.
2. The method of claim 1, wherein the corpus preprocessing comprises: and performing simplified and traditional body conversion on the downloaded Internet text, removing messy codes, English and punctuation, and performing Chinese word segmentation.
3. The method of claim 1, wherein the similar context and reinforcement learning-based Chinese word vector generation method is characterized in thatCharacterized in that the word similarity represents the similarity degree of word semantics through similarity degree, wherein the similarity degree s (w) i ,w j ) The calculation formula of (a) is as follows:
4. The method for generating a chinese word vector based on similar context and reinforcement learning according to claim 1, wherein the constructing a chinese word vector reinforcement learning framework and performing reinforcement learning specifically includes the following steps:
initializing a proxy pi θ The parameter is theta;
② setting learning rate eta and maximum iteration number T max Inputting a Chinese language database E;
let agent pi θ Interacting with the environment E, and sampling to obtain N track segments tau s ={τ 1 ,...,τ n ,...,τ N };
Fourthly, calculating the total return of each section of track, wherein the calculation formula is as follows:
in the formula, R (tau) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,the method comprises the steps of representing the return after a certain action is taken by the tth state in the nth section of track, wherein t represents the tth state, | E | represents the total number of words in a Chinese language database;
calculating the total return expectation of N tracks
In the formula (I), the compound is shown in the specification,representing the expectation of return for the total of N tracks, N representing the total number of tracks, R (τ) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,the method comprises the steps of representing the return after a t-th state in an nth section of track takes action, wherein t represents the t-th state, and | E | represents the total number of words in a Chinese language database;
calculating the expected gradient of total return of N tracks
In the formula (I), the compound is shown in the specification,the expected gradient representing the total return for N tracks, N representing the total number of tracks, R (τ) n ) Representing the total return of N tracks, τ n Representing the nth track, t representsThe t-th state, | E | represents the total number of words in the Chinese corpus,representing proxy pi θ On the premise of the parameter theta, obtaining the gradient of the probability of the nth track,representing proxy pi θ The action taken in the t-th state of the nth track is CBOW action a cbow Or SG behavior a sg ;The t-th state, w, representing the nth track t Representing a Chinese target word, SC t Representing similar context of the Chinese target word, w i Representing words in similar context, c represents similar context window size;
seventhly, updating the parameter theta
In the formula, theta represents proxy pi θ Is determined by the parameters of (a) and (b),representing the expected gradient of the total return of the N sections of tracks, and eta represents the learning rate;
adding 1 to the iteration times, stopping if the iteration times reach the maximum iteration times, outputting Chinese word vectors, and otherwise returning to the step II to continue the iterative training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911301344.5A CN111026848B (en) | 2019-12-17 | 2019-12-17 | Chinese word vector generation method based on similar context and reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911301344.5A CN111026848B (en) | 2019-12-17 | 2019-12-17 | Chinese word vector generation method based on similar context and reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111026848A CN111026848A (en) | 2020-04-17 |
CN111026848B true CN111026848B (en) | 2022-08-02 |
Family
ID=70209462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911301344.5A Expired - Fee Related CN111026848B (en) | 2019-12-17 | 2019-12-17 | Chinese word vector generation method based on similar context and reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111026848B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291165B (en) * | 2020-05-09 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Method and device for embedding training word vector into model |
CN112883169B (en) * | 2021-04-29 | 2021-07-16 | 南京视察者智能科技有限公司 | Contradiction evolution analysis method and device based on big data |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090061399A1 (en) * | 2007-08-30 | 2009-03-05 | Digital Directions International, Inc. | Educational software with embedded sheltered instruction |
EP3454260A1 (en) * | 2017-09-11 | 2019-03-13 | Tata Consultancy Services Limited | Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof |
CN108763504B (en) * | 2018-05-30 | 2020-07-24 | 浙江大学 | Dialog reply generation method and system based on reinforced double-channel sequence learning |
CN109597876B (en) * | 2018-11-07 | 2023-04-11 | 中山大学 | Multi-round dialogue reply selection model based on reinforcement learning and method thereof |
CN109918162B (en) * | 2019-02-28 | 2021-11-02 | 集智学园(北京)科技有限公司 | High-dimensional graph interactive display method for learnable mass information |
-
2019
- 2019-12-17 CN CN201911301344.5A patent/CN111026848B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN111026848A (en) | 2020-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107924680B (en) | Spoken language understanding system | |
Chang et al. | Chinese named entity recognition method based on BERT | |
CN109472024B (en) | Text classification method based on bidirectional circulation attention neural network | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN107358948B (en) | Language input relevance detection method based on attention model | |
CN106502985B (en) | neural network modeling method and device for generating titles | |
Deng et al. | Use of kernel deep convex networks and end-to-end learning for spoken language understanding | |
CN110046248B (en) | Model training method for text analysis, text classification method and device | |
CN109887484B (en) | Dual learning-based voice recognition and voice synthesis method and device | |
CN113239700A (en) | Text semantic matching device, system, method and storage medium for improving BERT | |
CN110619034A (en) | Text keyword generation method based on Transformer model | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN110969020A (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN109299479A (en) | Translation memory is incorporated to the method for neural machine translation by door control mechanism | |
CN110162789A (en) | A kind of vocabulary sign method and device based on the Chinese phonetic alphabet | |
CN111354333A (en) | Chinese prosody hierarchy prediction method and system based on self-attention | |
CN111026848B (en) | Chinese word vector generation method based on similar context and reinforcement learning | |
Deng et al. | Self-attention-based BiGRU and capsule network for named entity recognition | |
Mocialov et al. | Transfer learning for british sign language modelling | |
Wu et al. | An effective approach of named entity recognition for cyber threat intelligence | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN113609284A (en) | Method and device for automatically generating text abstract fused with multivariate semantics | |
Seilsepour et al. | Self-supervised sentiment classification based on semantic similarity measures and contextual embedding using metaheuristic optimizer | |
US11941360B2 (en) | Acronym definition network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220802 |