CN111026848B

CN111026848B - Chinese word vector generation method based on similar context and reinforcement learning

Info

Publication number: CN111026848B
Application number: CN201911301344.5A
Authority: CN
Inventors: 杨尚明; 张云; 刘勇国; 李巧勤
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2022-08-02
Anticipated expiration: 2039-12-17
Also published as: CN111026848A

Abstract

The invention discloses a Chinese word vector generation method based on similar context and reinforcement learning, which solves the problems that the existing Chinese word vector generation method is predicted by considering the relation between the adjacent context of a target word and the target word, the situation that some words in Chinese are adjacent but have irrelevant semantics and the expression quality of word vectors is not high. The method comprises the following steps: selecting a corpus, and preprocessing the corpus to construct a Chinese corpus; similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained; and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.

Description

Chinese word vector generation method based on similar context and reinforcement learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese word vector generation method based on similar context and reinforcement learning.

Background

Natural language processing is an important direction in the field of computer science and artificial intelligence. At present, natural language processing tasks comprise machine translation, emotion analysis, text summarization, text classification, information extraction and the like. In the natural language processing task, first, a first step needs to be considered how to make the computer capable of representing natural language. The computer cannot directly express the natural language, so a method needs to be designed to mathematically express the natural language so that the computer can process the natural language, namely the word vector. A word vector is a real vector that represents a natural language as containing semantics. Specifically, words are mapped into a vector space and represented by vectors. Generally speaking, the higher the quality of a word vector is, the richer and more accurate the semantic information contained in the word vector is, the more easily a computing mechanism can solve the semantics in the natural language, and the processing result of the natural language processing task can be fundamentally improved. How to generate high quality word vectors is the basis and important research for natural language processing.

There are two main directions in current research on word vectors: firstly, a common word vector method: the method is suitable for various languages, such as Chinese, English, Japanese and the like, and can represent words as word vectors in various languages. There are two categories of this kind, one is to represent words as a point vector in a vector space, and the other is to represent words as a gaussian distribution. Second, language-specific word vector method: the method only applies to specific languages, can only represent words of a certain language as word vectors, and considers various fine-grained characteristics of the specific language, such as Chinese radicals, strokes, pinyin and other characteristics, English letters, suffixes and the like. The Chinese patent CN107273355A, a Chinese word vector generation method based on word combination training, provides a Chinese word vector method. The patent takes word information in words as important characteristics, combines context words and words, and jointly trains word vector representation of Chinese. On the basis of a word vector model based on the words, target words are predicted based on context words while the target words are predicted based on the context words by introducing Chinese character forming information of the words. The Chinese patent CN109815476A, a word vector representation method based on Chinese morpheme and pinyin combined statistics, provides a Chinese word vector generation method. The method utilizes the morpheme and pinyin characteristics of Chinese characters and based on the morpheme and pinyin combined characteristics of context words, trains a three-layer neural network to predict a central target word and then generates a word vector.

The existing method lays a foundation for the research of word vectors, particularly the research of Chinese word vectors, but has the following defects: first, the generic word vector method is applicable to most languages, does not consider specific language features, has strong generalization but low accuracy, and cannot well improve the precision of subsequent natural language processing tasks. Second, the chinese word vector approach simply adds chinese features without improving the neural network infrastructure, and its accuracy cannot be further improved. Meanwhile, the research methods all consider the relationship between the adjacent context of the target word and the target word for prediction, and do not consider the situation that some words in Chinese are adjacent but have irrelevant semantics. The existing word vector generation method predicts words based on the characteristics of adjacent contexts of words, only uses a simple neural network architecture, does not improve the neural network and cannot better obtain high-quality Chinese word vectors.

Disclosure of Invention

The technical problem to be solved by the invention is that the existing Chinese word vector generation method is predicted by considering the relation between the adjacent context of the target word and the target word, and the problems that although some words in Chinese are adjacent, the semantics are irrelevant and the expression quality of the word vector is not high are not considered. The invention provides a Chinese word vector generation method based on similar context and reinforcement learning, which solves the problems, selects the similar context of a target word in a self-adaptive manner, simultaneously provides a Chinese word vector reinforcement learning frame, interacts with a corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the size of the corpus, predicts the target word based on the similar context, and further generates a Chinese word vector, avoids semantic irrelevance of adjacent Chinese contexts, enhances the performance of a learning framework, reduces the training time and improves the quality of the Chinese word vector.

The invention is realized by the following technical scheme:

a Chinese word vector generation method based on similar context and reinforcement learning comprises the following steps:

selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;

similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;

and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.

The method selects the similar context of the Chinese target word in a self-adaptive manner, simultaneously provides a Chinese word vector reinforcement learning framework, interacts with the corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the scale of the corpus, predicts the target word based on the similar context, further generates the Chinese word vector, avoids the semantic irrelevance of the adjacent Chinese contexts, enhances the learning architecture performance, reduces the training time and improves the quality of the Chinese word vector. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.

Further, the corpus pre-processing comprises: and performing simplified and traditional body conversion on the downloaded Internet text, removing messy codes, English and punctuation, and performing Chinese word segmentation.

Further, the similar context discovery is carried out on the Chinese target word, including the similar context discovery is carried out on the Chinese target word, and the Chinese target word w is subjected to the similar context discovery _t In the history of (1) and (w) _t Similar words, wherein w _t The history vocabulary of (1) is represented in w _t Near and at the same time at w _t The left word and t represent subscripts of the Chinese target word, and the specific steps are as follows:

determining the size c of the window;

② calculating an adaptive similarity threshold T which is equal to the Chinese target word w _t And the range [ w _t-c ,w _t+c ]Average of all word similarities within;

③ for the range [ w _t-c Word w in, 0) _i If and Chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is ₁ Adding 1;

if n ₁ If < c, the search range is expanded leftwards, c words are added, namely, the search range is [ w ] _t-2c ,w _t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w _t And the range [ w _t-2c ,w _t+2c ]Average of all word similarities within, if words within range and chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is ₁ Adding 1;

searching to the left, increasing c words each time and updating adaptive threshold T, and repeating searching until the number of similar words is equal to c.

Further, similar context discovery is carried out on the Chinese target word, and similar context discovery is also carried out on the Chinese target word, namely the Chinese target word w _t Looking for the sum w in the future vocabulary _t Similar words, wherein w _t Is represented at w _t Near and at the same time at w _t The right word and t represent subscripts of the Chinese target word, and the specific steps are as follows:

firstly, determining the size c of a lower window, wherein the size of the lower window is the same as that of an upper window;

② calculating an adaptive similarity threshold T which is equal to the Chinese target word w _t And range [ w ] _t-c ,w _t+c ]Average of all word similarities within;

③ for the range (0, w) _t+c ]Word w in _j If and Chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context ₂ Adding 1;

if n ₂ If < c, the search range is expanded to the right, and c words are added, namely in the range [ w ] _t+c ,w _t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w _t And the range [ w _t-2c ,w _t+2c ]Average of all word similarities within, if words within range and chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context ₂ Adding 1;

and fifthly, searching rightwards all the time, increasing c words each time, updating the adaptive threshold value T, and iteratively searching for similar contexts until the number of words in the similar contexts is equal to c.

Further, the word similarity represents the similarity degree of word semantics through similarity degree, wherein the similarity degree s (w) _i ,w _j ) The calculation formula of (a) is as follows:

in the formula, w _i And w _j Representing two words in a Chinese language database;

and

denotes w _i And w _j The initial word vector of (a); s (w) _i ,w _j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.

Further, in the process of constructing a Chinese word vector reinforcement learning framework and performing reinforcement learning: define the corpus as the environmental Environment, Chinese target word w _t And its similar context SC _t The classifier with the CBOW and SG behaviors is defined as an Agent; when the agent is in one state of the environment, the agent takes an Action, then the environment gives a Reward according to the Action, the agent judges the quality of the current Action through the Reward, learns, and then takes a better Action in the next state; when the agent processes different states, the process is iterated continuously until the set maximum iteration times is reached, the reinforcement learning is completed, and finally the Chinese word vector is generated.

Further, the method for constructing the Chinese word vector reinforcement learning framework and performing reinforcement learning specifically comprises the following steps:

initializing a proxy pi _θ The parameter is theta;

② setting learning rate eta and maximum iteration number T _max Inputting a Chinese language database E;

let agent pi _θ Interacting with the environment E, and sampling to obtain N track segments tau _s ＝{τ ¹ ,…,τ ⁿ ,…,τ ^N }；

Fourthly, calculating the total return of each section of track, wherein the calculation formula is as follows:

in the formula, R (tau) ⁿ ) Representing the total payback of the nth trace, τ ⁿ The n-th segment of the track is shown,

the method comprises the steps of representing the return after a certain action is taken by the tth state in the nth section of track, wherein t represents the tth state, | E | represents the total number of words in a Chinese language database;

calculating the total return expectation of N tracks

In the formula (I), the compound is shown in the specification,

representing the expectation of return for the total of N tracks, N representing the total number of tracks, R (τ) ⁿ ) Representing the total payback of the nth trace, τ ⁿ The n-th segment of the track is shown,

expressing the return after the action is taken by the t-th state in the nth section of track, wherein t expresses the t-th state, and | E | expresses the total number of words in the Chinese language database;

calculating the expected gradient of total return of N tracks

In the formula (I), the compound is shown in the specification,

the expected gradient representing the total return for N tracks, N representing the total number of tracks, R (τ) ⁿ ) Representing the total return of N tracks, τ ⁿ Represents the nth segment of the track, t represents the tth state, | E | represents the total number of words in the Chinese corpus,

representing proxy pi _θ On the premise of the parameter theta, obtaining the gradient of the probability of the nth track,

representing proxy pi _θ The action taken in the t-th state of the nth track segment may be a _cbow (CBOW behavior) or a _sg (SG action),

t-th state, w, representing the nth track _t Representing a Chinese target word, SC _t Representing similar context of the Chinese target word, w _i Representing words in similar context, c represents similar context window size;

seventhly, updating the parameter theta

In the formula, theta represents proxy pi _θ Is determined by the parameters of (a) and (b),

representing the expected gradient of the total return of the N sections of tracks, and eta represents the learning rate;

adding 1 to the iteration times, stopping if the iteration times reach the maximum iteration times, outputting Chinese word vectors, and otherwise returning to the step II to continue the iterative training.

Further, the CBOW behavior predicts the target word based on the similar context of the Chinese target word, and the SG behavior predicts the similar context of the Chinese target word based on the similar context of the Chinese target word, wherein the CBOW behavior and the SG behavior are both three-layer neural networks.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the method for generating the Chinese word vector based on the similar context and the reinforcement learning, the similar context is calculated by utilizing the similarity between words, the traditional adjacent context is replaced, the adjacent irrelevant words of Chinese are avoided, and the semantic relevance of the context is increased;

2. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, which utilizes reinforcement learning to generate Chinese word vectors and can generate Chinese word vectors with excellent quality on corpora of various sizes;

3. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, and the well-trained reinforcement learning agent can be directly applied to a new corpus to generate Chinese word vectors, thereby reducing the training time;

4. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, which has wide application range and can carry out subsequent Chinese word vector generation as long as a corpus is given; the invention is suitable for the technical field of natural language processing.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

fig. 1 is a general flowchart of a chinese word vector generation method based on similar context and reinforcement learning according to the present invention.

FIG. 2 is a diagram of similar contextual findings of the present invention.

Fig. 3 is a flow chart similar to that found above in the present invention.

Fig. 4 is a flow chart similar to that found below in the present invention.

FIG. 5 is a model diagram of reinforcement learning according to the present invention.

FIG. 6 is a flowchart illustrating reinforcement learning according to the present invention.

FIG. 7 is a diagram showing the results of the extensive experiments of the analogy task of the present invention in different corpus sizes.

FIG. 8 is a diagram of the results of the extensive experiments of similar tasks in different corpus sizes of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Examples

As shown in fig. 1 to 8, the present invention relates to a method for generating chinese word vector based on similar context and reinforcement learning, the method comprising:

selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;

The general flow of the invention is shown in fig. 1, and the specific implementation steps are as follows:

step 1, corpus construction: selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;

1.1 corpus preprocessing: the method comprises the steps of performing simplified and traditional body conversion on a downloaded internet text by using an opencc tool kit, removing messy codes, English and punctuations by using a regular expression, performing Chinese word segmentation by using jieba word segmentation, and preprocessing the text;

and 1.2, storing the final result to construct a Chinese language database.

Step 2, finding similar context: similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;

2.1 term similarity calculation

Calculating similarity s (w) between words by assigning all words to an initial word vector _i ,w _j ) To indicate the phase of the word semanticsDegree of similarity:

and

2.2 words and phrases are similar to those found above

In Chinese target word w _t In the history of (1) and (w) _t Similar words, wherein w _t The history vocabulary of (1) is represented in w _t Near and at the same time at w _t The left word, t, indicates the subscript of the Chinese target word, the schematic diagram is shown in FIG. 2, the flowchart is shown in FIG. 3, and the specific steps are as follows:

firstly, determining the size c of the window, namely, for each word, searching how many words with similar semantics (the size of the window below is the same as that of the window above, namely, the window below is c);

if n ₁ If < c, the search range is expanded to the left, and c words are added, namely in the range [ w ] _t-2c ,w _t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w _t And the range [ w _t-2c ,w _t+2c ]Average of all word similarities within, if words within range and chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is ₁ Adding 1;

searching to the left, increasing c words each time and updating the adaptive threshold value T, and iteratively searching until the number of similar words is equal to c (if the corpus boundary is found to be still insufficient for c words, stopping iteration, based on the number of similar words at present).

2.3 word similarity is found below

In Chinese target word w _t Looking for the sum w in the future vocabulary _t Similar words, wherein w _t Is represented at w _t Near and at the same time at w _t The right word, t, represents the index of the Chinese target word, the schematic diagram is shown in fig. 2, the flow chart is shown in fig. 4, and the specific steps are as follows:

firstly, determining the size c of a following window, namely, aiming at each word, searching how many words with similar semantics are required to be searched in the following; wherein the context window is as large as the context window size;

if n ₂ If < c, the search range is expanded to the right, and c words are added, namely in the range [ w ] _t+c ,w _t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w _t And the range [ w _t-2c ,w _t+2c ]Average of all word similarities within, if words within the range and Chinese target word w _t Is greater than the adaptive similarity threshold T, the word is determined to be similarHereinafter, the number of words n similar to hereinafter ₂ Adding 1;

and fifthly, searching rightwards all the time, increasing c words each time, updating the adaptive threshold value T, and iteratively searching similar contexts until the number of the words in the similar contexts is equal to c (if the corpus boundary is found to be still insufficient for c words, stopping iteration, and taking the number of the words in the similar contexts as the standard).

Step 3, reinforcement learning: and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.

3.1, basic definition

Aiming at the generation of Chinese word vectors, the invention provides a reinforcement learning framework, as shown in FIG. 5, a corpus is defined as an Environment (Environment), and Chinese target words w _t And its similar context SC _t Defined as State, a classifier with both CBOW and SG behavior is defined as Agent (Agent). The reinforcement learning process is as follows: when an agent is in one state of the environment, the agent takes an Action (Action), then the environment gives a Reward (Reward) according to the Action, the agent judges whether the current Action is good or bad through the Reward, learns, and then takes a better Action in the next state, as shown in fig. 5. When the agent processes different states, the process is iterated continuously until the set maximum iteration times is reached, the reinforcement learning is completed, and finally the Chinese word vector is generated. The specific definition is as follows:

environment (Environment E) the Environment of the Chinese word vector generation method is a given processed Chinese language database.

State S Each State S _t Defined as a Chinese target word w _t And its similar context SC _t Combinations of (a) and (b). There are many states in the environment, and each word and its similar context constitutes a state.

Action A defining two possible actions that an agent may take in different states, CBOW and SG, i.e. A ═ a _CBOW ,a _SG The CBOW behavior predicts target words based on the similar context of the Chinese target words, and the SG behavior predicts the target words based on the Chinese target wordsSimilar context, specifically as follows:

a) CBOW behaviour

Known Chinese target word w _t Similar context of SC _t Predicting the Chinese target word w on the premise of _t Such as Action1 portion of FIG. 5; the CBOW is a three-layer neural network, and the corresponding layers are specifically as follows:

an input layer: for inputting Chinese target words w _t Contains similar context SC _t Word vectors of 2c words;

projection layer: 2c vectors of the input layer are accumulated and summed, and the output is

In the formula (I), the compound is shown in the specification,

representing CBOW projection layer with Chinese target word w _t Is output under the premise of input;

and

representing a Chinese target word w _t And the word w _t+i The word vector of (2);

an output layer: probability p (w) of accurately predicting Chinese target word according to similar context _t |SC _t ) Which is equal to the Chinese target word w _t Calculating prediction weights of all words in the corpus through a softmax function;

in the formula, p (w) _t |SC _t ) Representing CBOW according to context SC _t Accurate prediction of Chinese target word w _t The probability of (d); w is a _j Representing a corpusThe jth word in (1); e represents a corpus;

and

an output word vector representing a corresponding word, t and j representing subscripts of the words;

the objective function of CBOW is the maximum likelihood of:

in the formula, ζ _CBOW Is the objective function; p (w) _t |SC _t ) Representing CBOW according to context SC _t Accurate prediction of Chinese target word w _t The probability of (d); e represents a corpus; | E | represents the total number of words in the corpus; t represents the subscript of the word;

b) SG behavior

Known Chinese target word w _t On the premise of predicting the similar context SC _t Using each chinese target word as input and predicting the similar context of the chinese target word, as in Action2 part of fig. 5, SG is also a three-layer neural network, and the corresponding layers are specifically:

an input layer: for inputting w _t Initial word vector of

Projection layer: a word vector for holding a current word; in fact, the projection layer in the SG has no practical effect, but for the purpose of maintaining a consistent structure with CBOW, its output is

An output layer: according to the Chinese target word w _t Accurate prediction context SC _t Possibility p (w) _t+i |w _t ) Equal to the pre-prediction of each context word in all words in the corpusMeasuring weight, and calculating through a softmax function;

in the formula, p (w) _t+i |w _t ) Representing SG according to Chinese target word w _t Accurately predicting the probability of each word in the context;

the SG objective function is the maximum likelihood function as follows:

in the formula, ζ _SG Is the objective function; | E | represents the total number of words in the corpus; t and i represent the subscripts of the words; c is half the number of contexts.

Return (Reward R) report _t Is context-to-action feedback that is used to evaluate the success or failure of an agent to take an action. The return under different behaviors is defined as:

in the formula, logp _θ () Representing proxy pi _θ On the premise of the parameter theta, the probability, w, of the nth track is obtained _t Representing a Chinese target word, SC _t Representing a similar context of the target word in chinese,

representing proxy pi _θ The action taken in the t-th state of the nth track segment may be a _cbow (CBOW behavior) or a _sg (SG action), w _i Representing words in similar context, c represents similar above window size.

Agent (Agent pi) _θ ) An agent is a mapping function pi: S → A, which can be viewed as a classifier with a parameter theta, whose input is the state and whose output is the action taken.

One segment is a track from an initial state to a final state of reinforcement learning, including actions taken and returns obtained by each state, and the nth segment is defined as

3.2, reinforcement learning Process

The Chinese word vector reinforcement learning process is that the agent continuously interacts with the environment and is in different states of the environment, then the agent takes different actions, the environment gives a return according to the action of the agent, the agent judges whether the action is taken according to the return, and the agent learns what action should be taken in the next state better. And continuously iterating until the set maximum iteration number is reached, as shown in fig. 5, and the specific steps are as follows, and the flow chart is shown in fig. 6.

In the embodiment of the invention, 2008 dog searching news is selected as a corpus, simplified and traditional body conversion is carried out by using an opencc tool kit through corpus preprocessing, messy codes, English and punctuations are removed by using a regular expression, Chinese word segmentation is carried out by using jieba word segmentation, and finally a Chinese corpus is obtained, wherein the Chinese corpus comprises 3 hundred million Chinese words and the size of a dictionary is about 420000.

In the embodiment of the present invention, similar context discovery is performed on the standard corpus, where a similar context discovery model is shown in fig. 2, and a flow is shown in fig. 3 and fig. 4. In the embodiment of the invention, the learning rate is set to be 0.01, and the maximum iteration times are set to be 4 times for reinforcement learning. The model diagram of reinforcement learning is shown in fig. 5, and the flowchart is shown in fig. 6. We compared 7 comparison methods on analogy tasks, similarity tasks (WS-240, WS-296), text classification and named entity recognition tasks, and performed experiments on 100% corpus, with the experimental results shown in Table 1. Extensive experiments were then performed on 25%, 50%, 75% of the corpus, and the results are shown in fig. 7 and 8. (the method of the invention is named sc2 vec).

TABLE 1 results of the experiment

From experimental results, the sc2vec method provided by the invention achieves the best results on analogy tasks, similar tasks, text classification tasks and named entity identification tasks. In the expansibility experiment, the best result is obtained in the corpora with different sizes. It can be observed that the method of the invention can be better expressed under different conditions, which shows that similar contexts overcome the semantic irrelevance of adjacent contexts, and the reinforcement learning enhances the learning architecture performance, thus showing that the invention is a Chinese word vector generation model with better performance and more accurate semantic information capture. From each experimental result, the feasibility of the Chinese word vector generation method based on similar context and reinforcement learning is shown.

The method selects the similar context of the Chinese target word in a self-adaptive manner, provides a Chinese word vector reinforcement learning generation frame at the same time, interacts with the corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the size of the corpus, predicts the target word based on the similar context, and further generates the Chinese word vector, avoids the semantic irrelevance of the adjacent Chinese contexts, enhances the learning architecture performance, reduces the training time, and improves the quality of the Chinese word vector. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating Chinese word vectors based on similar contexts and reinforcement learning is characterized by comprising the following steps:

selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;

constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of a Chinese target word;

the similar context discovery is carried out on the Chinese target word, including the similar context discovery on the Chinese target word, and the similar context discovery is carried out on the Chinese target word w _t In the history of (1) and (w) _t Similar words, wherein w _t The history vocabulary of (1) is represented in w _t Near and at the same time at w _t The left word and t represent subscripts of the Chinese target word, and the specific steps are as follows:

determining the size c of the window;

calculating an adaptive similarity threshold T equal to the Chinese target word w _t And the range [ w _t-c ，w _t+c ]Average of all word similarities within;

(iii) target range [ w _t-c Word w in, 0) _i If and Chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is ₁ Adding 1;

if n ₁ If < c, the search range is expanded to the left, and c words are added, namely in the range [ w ] _t-2c ，w _t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w _t And the range [ w _t-2c ，w _t+2c ]Average of all word similarities within, if words within range and chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is ₁ Adding 1;

searching to the left, increasing c words each time, updating the adaptive threshold value T, and iteratively searching until the number of similar words in the previous word is equal to c;

performing similar context discovery on the Chinese target word, and further comprising performing similar context discovery on the Chinese target wordIn Chinese target word w _t Looking for the sum w in the future vocabulary _t Similar words, wherein w _t Is represented at w _t Near and at the same time at w _t The right word and t represent subscripts of the Chinese target word, and the specific steps are as follows:

if n ₂ If < c, the search range is expanded to the right, and c words are added, namely, the search range is [ w ] _t+c ，w _t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w _t And the range [ w _t-2c ，w _t+2c ]Average of all word similarities within, if words within range and chinese target word w _t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context ₂ Adding 1;

searching rightwards all the time, increasing c words each time and updating the self-adaptive threshold value T, and iteratively searching for similar contexts until the number of words in the similar contexts is equal to c;

the method comprises the following steps of constructing a Chinese word vector reinforcement learning framework and carrying out reinforcement learning: define the corpus as the environmental Environment, Chinese target word w _t And its similar context SC _t The classifier with the CBOW and SG behaviors is defined as an Agent; when the agent is in one state of the environment, the agent takes an Action, then the environment gives a Reward according to the Action, the agent judges whether the current Action is good or bad through the Reward to learn, and then the agent takes a more Action in the next stateGood behavior; when the agent processes different states, the process is iterated continuously until reaching the set maximum iteration times, the reinforcement learning is completed, and finally Chinese word vectors are generated;

the method for constructing the Chinese word vector reinforcement learning framework and performing reinforcement learning specifically comprises the following steps:

a1: initializing a proxy pi _θ ，

A2: setting learning rate eta and maximum iteration number T _max Inputting a Chinese language database E as an environment E;

a3: according to proxy pi _θ And environment E, let agent pi _θ Interacting with the environment E, and sampling to obtain N track segments tau _s ；

A4: from N track segments τ _s Calculating the total return of each section of track;

a5: calculating the expectation of the total return of N sections of tracks according to the total return of each section of tracks;

a6: calculating the expected gradient of the total return of the N sections of tracks according to the expected total return of the tracks;

a7: updating the parameter theta according to the expected gradient of the total return of the track;

a8: according to the process, adding 1 to the iteration times, stopping if the maximum iteration times is reached, outputting Chinese word vectors, and returning to A2 to continue the iterative training if the maximum iteration times is not reached;

the CBOW behavior predicts a target word based on the similar context of the Chinese target word, and the SG behavior predicts the similar context of the Chinese target word based on the similar context of the Chinese target word, wherein the CBOW behavior and the SG behavior are three layers of neural networks.

2. The method of claim 1, wherein the corpus preprocessing comprises: and performing simplified and traditional body conversion on the downloaded Internet text, removing messy codes, English and punctuation, and performing Chinese word segmentation.

3. The method of claim 1, wherein the similar context and reinforcement learning-based Chinese word vector generation method is characterized in thatCharacterized in that the word similarity represents the similarity degree of word semantics through similarity degree, wherein the similarity degree s (w) _i ，w _j ) The calculation formula of (a) is as follows:

and

denotes w _i And w _j The initial word vector of (a); s (w) _i ，w _j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.

4. The method for generating a chinese word vector based on similar context and reinforcement learning according to claim 1, wherein the constructing a chinese word vector reinforcement learning framework and performing reinforcement learning specifically includes the following steps:

initializing a proxy pi _θ The parameter is theta;

let agent pi _θ Interacting with the environment E, and sampling to obtain N track segments tau _s ＝{τ ¹ ，...，τ ⁿ ，...，τ ^N }；

calculating the total return expectation of N tracks

In the formula (I), the compound is shown in the specification,

the method comprises the steps of representing the return after a t-th state in an nth section of track takes action, wherein t represents the t-th state, and | E | represents the total number of words in a Chinese language database;

calculating the expected gradient of total return of N tracks

In the formula (I), the compound is shown in the specification,

the expected gradient representing the total return for N tracks, N representing the total number of tracks, R (τ) ⁿ ) Representing the total return of N tracks, τ ⁿ Representing the nth track, t representsThe t-th state, | E | represents the total number of words in the Chinese corpus,

representing proxy pi _θ The action taken in the t-th state of the nth track is CBOW action a _cbow Or SG behavior a _sg ；

The t-th state, w, representing the nth track _t Representing a Chinese target word, SC _t Representing similar context of the Chinese target word, w _i Representing words in similar context, c represents similar context window size;

seventhly, updating the parameter theta