CN111026848B - Chinese word vector generation method based on similar context and reinforcement learning - Google Patents

Chinese word vector generation method based on similar context and reinforcement learning Download PDF

Info

Publication number
CN111026848B
CN111026848B CN201911301344.5A CN201911301344A CN111026848B CN 111026848 B CN111026848 B CN 111026848B CN 201911301344 A CN201911301344 A CN 201911301344A CN 111026848 B CN111026848 B CN 111026848B
Authority
CN
China
Prior art keywords
chinese
word
words
similar
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201911301344.5A
Other languages
Chinese (zh)
Other versions
CN111026848A (en
Inventor
杨尚明
张云
刘勇国
李巧勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911301344.5A priority Critical patent/CN111026848B/en
Publication of CN111026848A publication Critical patent/CN111026848A/en
Application granted granted Critical
Publication of CN111026848B publication Critical patent/CN111026848B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese word vector generation method based on similar context and reinforcement learning, which solves the problems that the existing Chinese word vector generation method is predicted by considering the relation between the adjacent context of a target word and the target word, the situation that some words in Chinese are adjacent but have irrelevant semantics and the expression quality of word vectors is not high. The method comprises the following steps: selecting a corpus, and preprocessing the corpus to construct a Chinese corpus; similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained; and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.

Description

Chinese word vector generation method based on similar context and reinforcement learning
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese word vector generation method based on similar context and reinforcement learning.
Background
Natural language processing is an important direction in the field of computer science and artificial intelligence. At present, natural language processing tasks comprise machine translation, emotion analysis, text summarization, text classification, information extraction and the like. In the natural language processing task, first, a first step needs to be considered how to make the computer capable of representing natural language. The computer cannot directly express the natural language, so a method needs to be designed to mathematically express the natural language so that the computer can process the natural language, namely the word vector. A word vector is a real vector that represents a natural language as containing semantics. Specifically, words are mapped into a vector space and represented by vectors. Generally speaking, the higher the quality of a word vector is, the richer and more accurate the semantic information contained in the word vector is, the more easily a computing mechanism can solve the semantics in the natural language, and the processing result of the natural language processing task can be fundamentally improved. How to generate high quality word vectors is the basis and important research for natural language processing.
There are two main directions in current research on word vectors: firstly, a common word vector method: the method is suitable for various languages, such as Chinese, English, Japanese and the like, and can represent words as word vectors in various languages. There are two categories of this kind, one is to represent words as a point vector in a vector space, and the other is to represent words as a gaussian distribution. Second, language-specific word vector method: the method only applies to specific languages, can only represent words of a certain language as word vectors, and considers various fine-grained characteristics of the specific language, such as Chinese radicals, strokes, pinyin and other characteristics, English letters, suffixes and the like. The Chinese patent CN107273355A, a Chinese word vector generation method based on word combination training, provides a Chinese word vector method. The patent takes word information in words as important characteristics, combines context words and words, and jointly trains word vector representation of Chinese. On the basis of a word vector model based on the words, target words are predicted based on context words while the target words are predicted based on the context words by introducing Chinese character forming information of the words. The Chinese patent CN109815476A, a word vector representation method based on Chinese morpheme and pinyin combined statistics, provides a Chinese word vector generation method. The method utilizes the morpheme and pinyin characteristics of Chinese characters and based on the morpheme and pinyin combined characteristics of context words, trains a three-layer neural network to predict a central target word and then generates a word vector.
The existing method lays a foundation for the research of word vectors, particularly the research of Chinese word vectors, but has the following defects: first, the generic word vector method is applicable to most languages, does not consider specific language features, has strong generalization but low accuracy, and cannot well improve the precision of subsequent natural language processing tasks. Second, the chinese word vector approach simply adds chinese features without improving the neural network infrastructure, and its accuracy cannot be further improved. Meanwhile, the research methods all consider the relationship between the adjacent context of the target word and the target word for prediction, and do not consider the situation that some words in Chinese are adjacent but have irrelevant semantics. The existing word vector generation method predicts words based on the characteristics of adjacent contexts of words, only uses a simple neural network architecture, does not improve the neural network and cannot better obtain high-quality Chinese word vectors.
Disclosure of Invention
The technical problem to be solved by the invention is that the existing Chinese word vector generation method is predicted by considering the relation between the adjacent context of the target word and the target word, and the problems that although some words in Chinese are adjacent, the semantics are irrelevant and the expression quality of the word vector is not high are not considered. The invention provides a Chinese word vector generation method based on similar context and reinforcement learning, which solves the problems, selects the similar context of a target word in a self-adaptive manner, simultaneously provides a Chinese word vector reinforcement learning frame, interacts with a corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the size of the corpus, predicts the target word based on the similar context, and further generates a Chinese word vector, avoids semantic irrelevance of adjacent Chinese contexts, enhances the performance of a learning framework, reduces the training time and improves the quality of the Chinese word vector.
The invention is realized by the following technical scheme:
a Chinese word vector generation method based on similar context and reinforcement learning comprises the following steps:
selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.
The method selects the similar context of the Chinese target word in a self-adaptive manner, simultaneously provides a Chinese word vector reinforcement learning framework, interacts with the corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the scale of the corpus, predicts the target word based on the similar context, further generates the Chinese word vector, avoids the semantic irrelevance of the adjacent Chinese contexts, enhances the learning architecture performance, reduces the training time and improves the quality of the Chinese word vector. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.
Further, the corpus pre-processing comprises: and performing simplified and traditional body conversion on the downloaded Internet text, removing messy codes, English and punctuation, and performing Chinese word segmentation.
Further, the similar context discovery is carried out on the Chinese target word, including the similar context discovery is carried out on the Chinese target word, and the Chinese target word w is subjected to the similar context discovery t In the history of (1) and (w) t Similar words, wherein w t The history vocabulary of (1) is represented in w t Near and at the same time at w t The left word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
determining the size c of the window;
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range [ w t-c Word w in, 0) i If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
if n 1 If < c, the search range is expanded leftwards, c words are added, namely, the search range is [ w ] t-2c ,w t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
searching to the left, increasing c words each time and updating adaptive threshold T, and repeating searching until the number of similar words is equal to c.
Further, similar context discovery is carried out on the Chinese target word, and similar context discovery is also carried out on the Chinese target word, namely the Chinese target word w t Looking for the sum w in the future vocabulary t Similar words, wherein w t Is represented at w t Near and at the same time at w t The right word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
firstly, determining the size c of a lower window, wherein the size of the lower window is the same as that of an upper window;
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And range [ w ] t-c ,w t+c ]Average of all word similarities within;
③ for the range (0, w) t+c ]Word w in j If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
if n 2 If < c, the search range is expanded to the right, and c words are added, namely in the range [ w ] t+c ,w t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
and fifthly, searching rightwards all the time, increasing c words each time, updating the adaptive threshold value T, and iteratively searching for similar contexts until the number of words in the similar contexts is equal to c.
Further, the word similarity represents the similarity degree of word semantics through similarity degree, wherein the similarity degree s (w) i ,w j ) The calculation formula of (a) is as follows:
Figure BDA0002321868780000031
in the formula, w i And w j Representing two words in a Chinese language database;
Figure BDA0002321868780000032
and
Figure BDA0002321868780000033
denotes w i And w j The initial word vector of (a); s (w) i ,w j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.
Further, in the process of constructing a Chinese word vector reinforcement learning framework and performing reinforcement learning: define the corpus as the environmental Environment, Chinese target word w t And its similar context SC t The classifier with the CBOW and SG behaviors is defined as an Agent; when the agent is in one state of the environment, the agent takes an Action, then the environment gives a Reward according to the Action, the agent judges the quality of the current Action through the Reward, learns, and then takes a better Action in the next state; when the agent processes different states, the process is iterated continuously until the set maximum iteration times is reached, the reinforcement learning is completed, and finally the Chinese word vector is generated.
Further, the method for constructing the Chinese word vector reinforcement learning framework and performing reinforcement learning specifically comprises the following steps:
initializing a proxy pi θ The parameter is theta;
② setting learning rate eta and maximum iteration number T max Inputting a Chinese language database E;
let agent pi θ Interacting with the environment E, and sampling to obtain N track segments tau s ={τ 1 ,…,τ n ,…,τ N };
Fourthly, calculating the total return of each section of track, wherein the calculation formula is as follows:
Figure BDA0002321868780000041
in the formula, R (tau) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,
Figure BDA0002321868780000042
the method comprises the steps of representing the return after a certain action is taken by the tth state in the nth section of track, wherein t represents the tth state, | E | represents the total number of words in a Chinese language database;
calculating the total return expectation of N tracks
Figure BDA0002321868780000043
In the formula (I), the compound is shown in the specification,
Figure BDA0002321868780000044
representing the expectation of return for the total of N tracks, N representing the total number of tracks, R (τ) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,
Figure BDA0002321868780000045
expressing the return after the action is taken by the t-th state in the nth section of track, wherein t expresses the t-th state, and | E | expresses the total number of words in the Chinese language database;
calculating the expected gradient of total return of N tracks
Figure BDA0002321868780000046
Figure BDA0002321868780000047
In the formula (I), the compound is shown in the specification,
Figure BDA0002321868780000048
the expected gradient representing the total return for N tracks, N representing the total number of tracks, R (τ) n ) Representing the total return of N tracks, τ n Represents the nth segment of the track, t represents the tth state, | E | represents the total number of words in the Chinese corpus,
Figure BDA00023218687800000410
representing proxy pi θ On the premise of the parameter theta, obtaining the gradient of the probability of the nth track,
Figure BDA0002321868780000049
representing proxy pi θ The action taken in the t-th state of the nth track segment may be a cbow (CBOW behavior) or a sg (SG action),
Figure BDA0002321868780000051
t-th state, w, representing the nth track t Representing a Chinese target word, SC t Representing similar context of the Chinese target word, w i Representing words in similar context, c represents similar context window size;
seventhly, updating the parameter theta
Figure BDA0002321868780000052
In the formula, theta represents proxy pi θ Is determined by the parameters of (a) and (b),
Figure BDA0002321868780000053
representing the expected gradient of the total return of the N sections of tracks, and eta represents the learning rate;
adding 1 to the iteration times, stopping if the iteration times reach the maximum iteration times, outputting Chinese word vectors, and otherwise returning to the step II to continue the iterative training.
Further, the CBOW behavior predicts the target word based on the similar context of the Chinese target word, and the SG behavior predicts the similar context of the Chinese target word based on the similar context of the Chinese target word, wherein the CBOW behavior and the SG behavior are both three-layer neural networks.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the method for generating the Chinese word vector based on the similar context and the reinforcement learning, the similar context is calculated by utilizing the similarity between words, the traditional adjacent context is replaced, the adjacent irrelevant words of Chinese are avoided, and the semantic relevance of the context is increased;
2. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, which utilizes reinforcement learning to generate Chinese word vectors and can generate Chinese word vectors with excellent quality on corpora of various sizes;
3. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, and the well-trained reinforcement learning agent can be directly applied to a new corpus to generate Chinese word vectors, thereby reducing the training time;
4. the invention relates to a Chinese word vector generation method based on similar context and reinforcement learning, which has wide application range and can carry out subsequent Chinese word vector generation as long as a corpus is given; the invention is suitable for the technical field of natural language processing.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a general flowchart of a chinese word vector generation method based on similar context and reinforcement learning according to the present invention.
FIG. 2 is a diagram of similar contextual findings of the present invention.
Fig. 3 is a flow chart similar to that found above in the present invention.
Fig. 4 is a flow chart similar to that found below in the present invention.
FIG. 5 is a model diagram of reinforcement learning according to the present invention.
FIG. 6 is a flowchart illustrating reinforcement learning according to the present invention.
FIG. 7 is a diagram showing the results of the extensive experiments of the analogy task of the present invention in different corpus sizes.
FIG. 8 is a diagram of the results of the extensive experiments of similar tasks in different corpus sizes of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Examples
As shown in fig. 1 to 8, the present invention relates to a method for generating chinese word vector based on similar context and reinforcement learning, the method comprising:
selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.
The general flow of the invention is shown in fig. 1, and the specific implementation steps are as follows:
step 1, corpus construction: selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
1.1 corpus preprocessing: the method comprises the steps of performing simplified and traditional body conversion on a downloaded internet text by using an opencc tool kit, removing messy codes, English and punctuations by using a regular expression, performing Chinese word segmentation by using jieba word segmentation, and preprocessing the text;
and 1.2, storing the final result to construct a Chinese language database.
Step 2, finding similar context: similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
2.1 term similarity calculation
Calculating similarity s (w) between words by assigning all words to an initial word vector i ,w j ) To indicate the phase of the word semanticsDegree of similarity:
Figure BDA0002321868780000061
in the formula, w i And w j Representing two words in a Chinese language database;
Figure BDA0002321868780000062
and
Figure BDA0002321868780000063
denotes w i And w j The initial word vector of (a); s (w) i ,w j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.
2.2 words and phrases are similar to those found above
In Chinese target word w t In the history of (1) and (w) t Similar words, wherein w t The history vocabulary of (1) is represented in w t Near and at the same time at w t The left word, t, indicates the subscript of the Chinese target word, the schematic diagram is shown in FIG. 2, the flowchart is shown in FIG. 3, and the specific steps are as follows:
firstly, determining the size c of the window, namely, for each word, searching how many words with similar semantics (the size of the window below is the same as that of the window above, namely, the window below is c);
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range [ w t-c Word w in, 0) i If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
if n 1 If < c, the search range is expanded to the left, and c words are added, namely in the range [ w ] t-2c ,w t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
searching to the left, increasing c words each time and updating the adaptive threshold value T, and iteratively searching until the number of similar words is equal to c (if the corpus boundary is found to be still insufficient for c words, stopping iteration, based on the number of similar words at present).
2.3 word similarity is found below
In Chinese target word w t Looking for the sum w in the future vocabulary t Similar words, wherein w t Is represented at w t Near and at the same time at w t The right word, t, represents the index of the Chinese target word, the schematic diagram is shown in fig. 2, the flow chart is shown in fig. 4, and the specific steps are as follows:
firstly, determining the size c of a following window, namely, aiming at each word, searching how many words with similar semantics are required to be searched in the following; wherein the context window is as large as the context window size;
② calculating an adaptive similarity threshold T which is equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range (0, w) t+c ]Word w in j If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
if n 2 If < c, the search range is expanded to the right, and c words are added, namely in the range [ w ] t+c ,w t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within the range and Chinese target word w t Is greater than the adaptive similarity threshold T, the word is determined to be similarHereinafter, the number of words n similar to hereinafter 2 Adding 1;
and fifthly, searching rightwards all the time, increasing c words each time, updating the adaptive threshold value T, and iteratively searching similar contexts until the number of the words in the similar contexts is equal to c (if the corpus boundary is found to be still insufficient for c words, stopping iteration, and taking the number of the words in the similar contexts as the standard).
Step 3, reinforcement learning: and constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of the Chinese target word.
3.1, basic definition
Aiming at the generation of Chinese word vectors, the invention provides a reinforcement learning framework, as shown in FIG. 5, a corpus is defined as an Environment (Environment), and Chinese target words w t And its similar context SC t Defined as State, a classifier with both CBOW and SG behavior is defined as Agent (Agent). The reinforcement learning process is as follows: when an agent is in one state of the environment, the agent takes an Action (Action), then the environment gives a Reward (Reward) according to the Action, the agent judges whether the current Action is good or bad through the Reward, learns, and then takes a better Action in the next state, as shown in fig. 5. When the agent processes different states, the process is iterated continuously until the set maximum iteration times is reached, the reinforcement learning is completed, and finally the Chinese word vector is generated. The specific definition is as follows:
environment (Environment E) the Environment of the Chinese word vector generation method is a given processed Chinese language database.
State S Each State S t Defined as a Chinese target word w t And its similar context SC t Combinations of (a) and (b). There are many states in the environment, and each word and its similar context constitutes a state.
Action A defining two possible actions that an agent may take in different states, CBOW and SG, i.e. A ═ a CBOW ,a SG The CBOW behavior predicts target words based on the similar context of the Chinese target words, and the SG behavior predicts the target words based on the Chinese target wordsSimilar context, specifically as follows:
a) CBOW behaviour
Known Chinese target word w t Similar context of SC t Predicting the Chinese target word w on the premise of t Such as Action1 portion of FIG. 5; the CBOW is a three-layer neural network, and the corresponding layers are specifically as follows:
an input layer: for inputting Chinese target words w t Contains similar context SC t Word vectors of 2c words;
projection layer: 2c vectors of the input layer are accumulated and summed, and the output is
Figure BDA0002321868780000081
In the formula (I), the compound is shown in the specification,
Figure BDA0002321868780000082
representing CBOW projection layer with Chinese target word w t Is output under the premise of input;
Figure BDA0002321868780000083
and
Figure BDA0002321868780000084
representing a Chinese target word w t And the word w t+i The word vector of (2);
an output layer: probability p (w) of accurately predicting Chinese target word according to similar context t |SC t ) Which is equal to the Chinese target word w t Calculating prediction weights of all words in the corpus through a softmax function;
Figure BDA0002321868780000091
in the formula, p (w) t |SC t ) Representing CBOW according to context SC t Accurate prediction of Chinese target word w t The probability of (d); w is a j Representing a corpusThe jth word in (1); e represents a corpus;
Figure BDA0002321868780000092
and
Figure BDA0002321868780000093
an output word vector representing a corresponding word, t and j representing subscripts of the words;
the objective function of CBOW is the maximum likelihood of:
Figure BDA0002321868780000094
in the formula, ζ CBOW Is the objective function; p (w) t |SC t ) Representing CBOW according to context SC t Accurate prediction of Chinese target word w t The probability of (d); e represents a corpus; | E | represents the total number of words in the corpus; t represents the subscript of the word;
b) SG behavior
Known Chinese target word w t On the premise of predicting the similar context SC t Using each chinese target word as input and predicting the similar context of the chinese target word, as in Action2 part of fig. 5, SG is also a three-layer neural network, and the corresponding layers are specifically:
an input layer: for inputting w t Initial word vector of
Figure BDA0002321868780000095
Projection layer: a word vector for holding a current word; in fact, the projection layer in the SG has no practical effect, but for the purpose of maintaining a consistent structure with CBOW, its output is
Figure BDA0002321868780000096
An output layer: according to the Chinese target word w t Accurate prediction context SC t Possibility p (w) t+i |w t ) Equal to the pre-prediction of each context word in all words in the corpusMeasuring weight, and calculating through a softmax function;
Figure BDA0002321868780000097
in the formula, p (w) t+i |w t ) Representing SG according to Chinese target word w t Accurately predicting the probability of each word in the context;
the SG objective function is the maximum likelihood function as follows:
Figure BDA0002321868780000098
in the formula, ζ SG Is the objective function; | E | represents the total number of words in the corpus; t and i represent the subscripts of the words; c is half the number of contexts.
Return (Reward R) report t Is context-to-action feedback that is used to evaluate the success or failure of an agent to take an action. The return under different behaviors is defined as:
Figure BDA0002321868780000101
in the formula, logp θ () Representing proxy pi θ On the premise of the parameter theta, the probability, w, of the nth track is obtained t Representing a Chinese target word, SC t Representing a similar context of the target word in chinese,
Figure BDA0002321868780000102
representing proxy pi θ The action taken in the t-th state of the nth track segment may be a cbow (CBOW behavior) or a sg (SG action), w i Representing words in similar context, c represents similar above window size.
Agent (Agent pi) θ ) An agent is a mapping function pi: S → A, which can be viewed as a classifier with a parameter theta, whose input is the state and whose output is the action taken.
One segment is a track from an initial state to a final state of reinforcement learning, including actions taken and returns obtained by each state, and the nth segment is defined as
Figure BDA0002321868780000103
3.2, reinforcement learning Process
The Chinese word vector reinforcement learning process is that the agent continuously interacts with the environment and is in different states of the environment, then the agent takes different actions, the environment gives a return according to the action of the agent, the agent judges whether the action is taken according to the return, and the agent learns what action should be taken in the next state better. And continuously iterating until the set maximum iteration number is reached, as shown in fig. 5, and the specific steps are as follows, and the flow chart is shown in fig. 6.
In the embodiment of the invention, 2008 dog searching news is selected as a corpus, simplified and traditional body conversion is carried out by using an opencc tool kit through corpus preprocessing, messy codes, English and punctuations are removed by using a regular expression, Chinese word segmentation is carried out by using jieba word segmentation, and finally a Chinese corpus is obtained, wherein the Chinese corpus comprises 3 hundred million Chinese words and the size of a dictionary is about 420000.
In the embodiment of the present invention, similar context discovery is performed on the standard corpus, where a similar context discovery model is shown in fig. 2, and a flow is shown in fig. 3 and fig. 4. In the embodiment of the invention, the learning rate is set to be 0.01, and the maximum iteration times are set to be 4 times for reinforcement learning. The model diagram of reinforcement learning is shown in fig. 5, and the flowchart is shown in fig. 6. We compared 7 comparison methods on analogy tasks, similarity tasks (WS-240, WS-296), text classification and named entity recognition tasks, and performed experiments on 100% corpus, with the experimental results shown in Table 1. Extensive experiments were then performed on 25%, 50%, 75% of the corpus, and the results are shown in fig. 7 and 8. (the method of the invention is named sc2 vec).
TABLE 1 results of the experiment
Figure BDA0002321868780000111
From experimental results, the sc2vec method provided by the invention achieves the best results on analogy tasks, similar tasks, text classification tasks and named entity identification tasks. In the expansibility experiment, the best result is obtained in the corpora with different sizes. It can be observed that the method of the invention can be better expressed under different conditions, which shows that similar contexts overcome the semantic irrelevance of adjacent contexts, and the reinforcement learning enhances the learning architecture performance, thus showing that the invention is a Chinese word vector generation model with better performance and more accurate semantic information capture. From each experimental result, the feasibility of the Chinese word vector generation method based on similar context and reinforcement learning is shown.
The method selects the similar context of the Chinese target word in a self-adaptive manner, provides a Chinese word vector reinforcement learning generation frame at the same time, interacts with the corpus and obtains feedback, automatically learns the relation between words in the corpus, searches for the similar context, reduces the size of the corpus, predicts the target word based on the similar context, and further generates the Chinese word vector, avoids the semantic irrelevance of the adjacent Chinese contexts, enhances the learning architecture performance, reduces the training time, and improves the quality of the Chinese word vector. The method can solve the problem that adjacent Chinese words are irrelevant, and generate high-quality Chinese word vectors.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (4)

1. A method for generating Chinese word vectors based on similar contexts and reinforcement learning is characterized by comprising the following steps:
selecting a corpus, and preprocessing the corpus to construct a Chinese corpus;
similar context discovery is carried out on the Chinese target words, and similar context related to the semantics of the Chinese target words is obtained;
constructing a Chinese word vector reinforcement learning framework, and performing reinforcement learning to obtain word vector representation of a Chinese target word;
the similar context discovery is carried out on the Chinese target word, including the similar context discovery on the Chinese target word, and the similar context discovery is carried out on the Chinese target word w t In the history of (1) and (w) t Similar words, wherein w t The history vocabulary of (1) is represented in w t Near and at the same time at w t The left word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
determining the size c of the window;
calculating an adaptive similarity threshold T equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
(iii) target range [ w t-c Word w in, 0) i If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
if n 1 If < c, the search range is expanded to the left, and c words are added, namely in the range [ w ] t-2c ,w t-c ]Searching for similarity above, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as the similar upper text, and the number n of words in the similar upper text is 1 Adding 1;
searching to the left, increasing c words each time, updating the adaptive threshold value T, and iteratively searching until the number of similar words in the previous word is equal to c;
performing similar context discovery on the Chinese target word, and further comprising performing similar context discovery on the Chinese target wordIn Chinese target word w t Looking for the sum w in the future vocabulary t Similar words, wherein w t Is represented at w t Near and at the same time at w t The right word and t represent subscripts of the Chinese target word, and the specific steps are as follows:
firstly, determining the size c of a lower window, wherein the size of the lower window is the same as that of an upper window;
calculating an adaptive similarity threshold T equal to the Chinese target word w t And the range [ w t-c ,w t+c ]Average of all word similarities within;
③ for the range (0, w) t+c ]Word w in j If and Chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
if n 2 If < c, the search range is expanded to the right, and c words are added, namely, the search range is [ w ] t+c ,w t+2c ]Searching for similar context, at this time, the adaptive similarity threshold T is updated to be equal to the Chinese target word w t And the range [ w t-2c ,w t+2c ]Average of all word similarities within, if words within range and chinese target word w t If the similarity is greater than the adaptive similarity threshold T, the word is determined as a similar context, and the number n of words in the similar context 2 Adding 1;
searching rightwards all the time, increasing c words each time and updating the self-adaptive threshold value T, and iteratively searching for similar contexts until the number of words in the similar contexts is equal to c;
the method comprises the following steps of constructing a Chinese word vector reinforcement learning framework and carrying out reinforcement learning: define the corpus as the environmental Environment, Chinese target word w t And its similar context SC t The classifier with the CBOW and SG behaviors is defined as an Agent; when the agent is in one state of the environment, the agent takes an Action, then the environment gives a Reward according to the Action, the agent judges whether the current Action is good or bad through the Reward to learn, and then the agent takes a more Action in the next stateGood behavior; when the agent processes different states, the process is iterated continuously until reaching the set maximum iteration times, the reinforcement learning is completed, and finally Chinese word vectors are generated;
the method for constructing the Chinese word vector reinforcement learning framework and performing reinforcement learning specifically comprises the following steps:
a1: initializing a proxy pi θ
A2: setting learning rate eta and maximum iteration number T max Inputting a Chinese language database E as an environment E;
a3: according to proxy pi θ And environment E, let agent pi θ Interacting with the environment E, and sampling to obtain N track segments tau s
A4: from N track segments τ s Calculating the total return of each section of track;
a5: calculating the expectation of the total return of N sections of tracks according to the total return of each section of tracks;
a6: calculating the expected gradient of the total return of the N sections of tracks according to the expected total return of the tracks;
a7: updating the parameter theta according to the expected gradient of the total return of the track;
a8: according to the process, adding 1 to the iteration times, stopping if the maximum iteration times is reached, outputting Chinese word vectors, and returning to A2 to continue the iterative training if the maximum iteration times is not reached;
the CBOW behavior predicts a target word based on the similar context of the Chinese target word, and the SG behavior predicts the similar context of the Chinese target word based on the similar context of the Chinese target word, wherein the CBOW behavior and the SG behavior are three layers of neural networks.
2. The method of claim 1, wherein the corpus preprocessing comprises: and performing simplified and traditional body conversion on the downloaded Internet text, removing messy codes, English and punctuation, and performing Chinese word segmentation.
3. The method of claim 1, wherein the similar context and reinforcement learning-based Chinese word vector generation method is characterized in thatCharacterized in that the word similarity represents the similarity degree of word semantics through similarity degree, wherein the similarity degree s (w) i ,w j ) The calculation formula of (a) is as follows:
Figure FDA0003659308700000021
in the formula, w i And w j Representing two words in a Chinese language database;
Figure FDA0003659308700000031
and
Figure FDA0003659308700000032
denotes w i And w j The initial word vector of (a); s (w) i ,w j ) Representing the similarity between words; i and j represent subscripts of the words; and | represents the modulo length of the word vector.
4. The method for generating a chinese word vector based on similar context and reinforcement learning according to claim 1, wherein the constructing a chinese word vector reinforcement learning framework and performing reinforcement learning specifically includes the following steps:
initializing a proxy pi θ The parameter is theta;
② setting learning rate eta and maximum iteration number T max Inputting a Chinese language database E;
let agent pi θ Interacting with the environment E, and sampling to obtain N track segments tau s ={τ 1 ,...,τ n ,...,τ N };
Fourthly, calculating the total return of each section of track, wherein the calculation formula is as follows:
Figure FDA0003659308700000033
in the formula, R (tau) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,
Figure FDA0003659308700000034
the method comprises the steps of representing the return after a certain action is taken by the tth state in the nth section of track, wherein t represents the tth state, | E | represents the total number of words in a Chinese language database;
calculating the total return expectation of N tracks
Figure FDA0003659308700000035
In the formula (I), the compound is shown in the specification,
Figure FDA0003659308700000036
representing the expectation of return for the total of N tracks, N representing the total number of tracks, R (τ) n ) Representing the total payback of the nth trace, τ n The n-th segment of the track is shown,
Figure FDA0003659308700000037
the method comprises the steps of representing the return after a t-th state in an nth section of track takes action, wherein t represents the t-th state, and | E | represents the total number of words in a Chinese language database;
calculating the expected gradient of total return of N tracks
Figure FDA0003659308700000038
Figure FDA0003659308700000039
In the formula (I), the compound is shown in the specification,
Figure FDA00036593087000000310
the expected gradient representing the total return for N tracks, N representing the total number of tracks, R (τ) n ) Representing the total return of N tracks, τ n Representing the nth track, t representsThe t-th state, | E | represents the total number of words in the Chinese corpus,
Figure FDA00036593087000000311
representing proxy pi θ On the premise of the parameter theta, obtaining the gradient of the probability of the nth track,
Figure FDA00036593087000000312
representing proxy pi θ The action taken in the t-th state of the nth track is CBOW action a cbow Or SG behavior a sg
Figure FDA00036593087000000313
The t-th state, w, representing the nth track t Representing a Chinese target word, SC t Representing similar context of the Chinese target word, w i Representing words in similar context, c represents similar context window size;
seventhly, updating the parameter theta
Figure FDA0003659308700000041
In the formula, theta represents proxy pi θ Is determined by the parameters of (a) and (b),
Figure FDA0003659308700000042
representing the expected gradient of the total return of the N sections of tracks, and eta represents the learning rate;
adding 1 to the iteration times, stopping if the iteration times reach the maximum iteration times, outputting Chinese word vectors, and otherwise returning to the step II to continue the iterative training.
CN201911301344.5A 2019-12-17 2019-12-17 Chinese word vector generation method based on similar context and reinforcement learning Expired - Fee Related CN111026848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911301344.5A CN111026848B (en) 2019-12-17 2019-12-17 Chinese word vector generation method based on similar context and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911301344.5A CN111026848B (en) 2019-12-17 2019-12-17 Chinese word vector generation method based on similar context and reinforcement learning

Publications (2)

Publication Number Publication Date
CN111026848A CN111026848A (en) 2020-04-17
CN111026848B true CN111026848B (en) 2022-08-02

Family

ID=70209462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911301344.5A Expired - Fee Related CN111026848B (en) 2019-12-17 2019-12-17 Chinese word vector generation method based on similar context and reinforcement learning

Country Status (1)

Country Link
CN (1) CN111026848B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291165B (en) * 2020-05-09 2020-08-14 支付宝(杭州)信息技术有限公司 Method and device for embedding training word vector into model
CN112883169B (en) * 2021-04-29 2021-07-16 南京视察者智能科技有限公司 Contradiction evolution analysis method and device based on big data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090061399A1 (en) * 2007-08-30 2009-03-05 Digital Directions International, Inc. Educational software with embedded sheltered instruction
EP3454260A1 (en) * 2017-09-11 2019-03-13 Tata Consultancy Services Limited Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof
CN108763504B (en) * 2018-05-30 2020-07-24 浙江大学 Dialog reply generation method and system based on reinforced double-channel sequence learning
CN109597876B (en) * 2018-11-07 2023-04-11 中山大学 Multi-round dialogue reply selection model based on reinforcement learning and method thereof
CN109918162B (en) * 2019-02-28 2021-11-02 集智学园(北京)科技有限公司 High-dimensional graph interactive display method for learnable mass information

Also Published As

Publication number Publication date
CN111026848A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN107924680B (en) Spoken language understanding system
Chang et al. Chinese named entity recognition method based on BERT
CN109472024B (en) Text classification method based on bidirectional circulation attention neural network
CN108984526B (en) Document theme vector extraction method based on deep learning
CN107358948B (en) Language input relevance detection method based on attention model
CN106502985B (en) neural network modeling method and device for generating titles
Deng et al. Use of kernel deep convex networks and end-to-end learning for spoken language understanding
CN110046248B (en) Model training method for text analysis, text classification method and device
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN110619034A (en) Text keyword generation method based on Transformer model
CN112541356B (en) Method and system for recognizing biomedical named entities
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN109299479A (en) Translation memory is incorporated to the method for neural machine translation by door control mechanism
CN110162789A (en) A kind of vocabulary sign method and device based on the Chinese phonetic alphabet
CN111354333A (en) Chinese prosody hierarchy prediction method and system based on self-attention
CN111026848B (en) Chinese word vector generation method based on similar context and reinforcement learning
Deng et al. Self-attention-based BiGRU and capsule network for named entity recognition
Mocialov et al. Transfer learning for british sign language modelling
Wu et al. An effective approach of named entity recognition for cyber threat intelligence
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN113609284A (en) Method and device for automatically generating text abstract fused with multivariate semantics
Seilsepour et al. Self-supervised sentiment classification based on semantic similarity measures and contextual embedding using metaheuristic optimizer
US11941360B2 (en) Acronym definition network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220802