CN110309317A

CN110309317A - Term vector generation method, system, electronic device and the medium of Chinese corpus

Info

Publication number: CN110309317A
Application number: CN201910429450.5A
Authority: CN
Inventors: 殷复莲; 王颜颜; 李利; 李思彤; 冀美琪; 夏欣雨
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2019-10-08
Anticipated expiration: 2039-05-22
Also published as: CN110309317B

Abstract

The present invention provides term vector generation method, system, electronic device and the medium of a kind of Chinese corpus, comprising: building database stores independent word set, synset and related word set；Chinese corpus is acquired to segment to obtain word collection；Independent word set, synset and the related word set of word is concentrated to encode word；Coding vector input word expression model is obtained to the primary vector of each word；Judge whether word belongs to independent word set；Belong to independent word set, using primary vector as output vector；It is not belonging to independent word set, each primary vector of word is inputted into the first probabilistic model and the second probabilistic model, word is respectively obtained and belongs to the first probability of each meaning and its second probability of context words；First probability and the second probability are inputted into third probabilistic model, obtain the third probability that word belongs to each meaning；Using the primary vector of the meaning of the corresponding word of maximum third probability as the output vector of the word.

Description

Term vector generation method, system, electronic device and the medium of Chinese corpus

Technical field

The present invention relates to natural language processing technique fields, generate more particularly, to a kind of term vector of Chinese corpus Method, system, electronic device and medium.

Background technique

With the continuous development of Internet technology, more and more people, which start to make comments on network, expresses viewpoint, is The deep data value ensconced after user behavior is excavated, natural language processing technique is gradually widely used in a variety of downstreams and appoints In business, such as sentiment analysis, Entity recognition, machine translation, text snippet.In natural language processing task, it is necessary first to examine The problem of worry, is exactly how word indicates in a computer.

Method currently based on distributed vector is using a kind of extensive word representation method.Such as Word2Vec and GloVe indicates that this leads to not the ambiguity for indicating word, especially by the single vector-quantities that the training to corpus obtains word It is that Chinese corpus problem above abundant for semantic information is more prominent.In order to solve the problems, such as that word ambiguity indicates, one A little researchers propose that, by corpus context cluster building polyarch word lists representation model, this method determines that cluster is maximum first Then number uses clustering algorithm, the context of word is gathered by vector similarity as multiple concept items of word Class, every one kind represent a concept of word.But there is also some ignorable problems, such as cluster number for this method The uncertainty and cluster result of K can not corresponding word the meaning etc..Corpus is only relied only on simultaneously carries out model training, in order to Result accuracy rate is improved, a large amount of corpus datas certainly will be needed, causes computation burden to aggravate, and not can avoid corpus itself matter Measuring poor caused word indicates result inaccuracy phenomenon.Therefore study Chinese corpus polyarch expression be very it is necessary to 's.

Summary of the invention

In view of the above problems, the term vector for the Chinese corpus that the polyarch word that the present invention provides a kind of Chinese corpus indicates is raw At method, system, electronic device and medium.

According to an aspect of the present invention, a kind of term vector generation method of Chinese corpus is provided, comprising:

Database is constructed, autonomous word is stored as independent word set by the database, by the synonym of each meaning of word It is stored as synset, the related term of each meaning of word is stored as related word set；

Chinese corpus is acquired, Chinese corpus is segmented, obtains the word collection W=[w that the word of Chinese corpus is constituted₁, w₂,…,w_b], b is the word sum of word collection；

Independent word set, synset and the related word set of word is concentrated to compile by the first setting order the word Code, obtain the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word it is every The coding vector of each meaning of the coding vector and each word of each related term of a meaning；

Coding vector input word is indicated into model, by each of the autonomous word of word concentration, each meaning of each word Each related term of each meaning of synonym, each word and each meaning of each word are converted into primary vector, each In each meaning of word the primary vector of each synonym constitute the first of the synset of each meaning of each word to Quantity set, the primary vector of each related term of each meaning of each word constitute the related word set of each meaning of each word Primary vector collection, the primary vector collection of the synset of each meaning of each word and related word set constitutes each word The primary vector collection of each meaning, the interesting primary vector collection of each word constitute the primary vector collection of each word；

Judge that word concentrates whether each word belongs to independent word set with the second setting order；

If the word belongs to independent word set, using the primary vector of the word as the output vector of the word；

If the word is not belonging to independent word set, following step is executed:

Each primary vector that the primary vector of each meaning of the word is concentrated inputs the first probabilistic model, obtains The word belongs to the first probability of each meaning, wherein first probabilistic model is constructed by following formula (1)

Wherein, w_tT-th of word, c are concentrated for word_jFor word w_tJ-th the meaning,Indicate word Language w_tBelong to meaning c_jThe first probability,Indicate word w_tJ-th of meaning c_jIn i-th of synonym or i-th of phase Close the primary vector of word；

The primary vector of each meaning of the word is inputted into the second probabilistic model, the word is obtained and belongs to each meaning Second probability of the context words of think of, wherein second probabilistic model is constructed by following formula (2)

Wherein, w_t+kIndicate the context words of word wt,Indicate the meaning c of word wt_jPrimary vector, It is that word is concentrated in addition to word w_tOther words in addition indicate that model obtains as the coding vector input word of autonomous word the One vector,For word w_tMeaning c_jCoding vector,Indicate word w_tBelong to meaning c_jIt is upper Hereafter the second probability of word；

The word is belonged into the first probability of each meaning and the word belongs to the context words of each meaning Second probability inputs third probabilistic model, obtains the third probability that word belongs to each meaning, wherein the third probabilistic model It is constructed by following formula (3)

Wherein, p (c_j|w_t) indicate word w_tBelong to meaning c_jThird probability；

Using the primary vector of the meaning of the corresponding word of maximum third probability as the output vector of the word.

According to another aspect of the present invention, the term vector for providing a kind of Chinese corpus generates system, comprising:

Autonomous word is stored as independent word set by database, and the synonym of each meaning of word is stored as synset, The related term of each meaning of word is stored as related word set；

Acquisition module acquires Chinese corpus；

Word segmentation module segments the Chinese corpus of acquisition, obtains the word collection W=that the word of Chinese corpus is constituted [w₁, w₂..., w_b], b is the word sum of word collection；

The word is concentrated independent word set, synset and the related word set of word by the first setting time by coding module Sequence is encoded, obtain the coding vector of autonomous word, each meaning of each word each synonym coding vector, each The coding vector of each meaning of the coding vector and each word of each related term of each meaning of word；

Primary vector collection constructs module, and coding vector input word is indicated model, autonomous word, each word that word is concentrated Each meaning of each synonym of each meaning of language, each related term of each meaning of each word and each word turns Primary vector is turned to, the primary vector of each synonym constitutes each meaning of each word in each meaning of each word The primary vector of the primary vector collection of synset, each related term of each meaning of each word constitutes the every of each word The primary vector collection of the related word set of a meaning, the primary vector of the synset of each meaning of each word and related word set Collection constitutes the primary vector collection of each meaning of each word, and the interesting primary vector collection of each word constitutes each word The primary vector collection of language；

Judgment module judges that word concentrates whether each word belongs to independent word set with the second setting order, if described Word belongs to independent word set, sends a signal to vector output module, if the word is not belonging to independent word set, sends a signal to First probabilistic model module and the second probabilistic model module；

First probabilistic model module, including the first probabilistic model construction unit and the first data processing unit, shown first Probabilistic model construction unit constructs the first probabilistic model, and first data processing unit is by the of each meaning of the word Each primary vector in one vector set inputs the first probabilistic model, obtains the first probability that the word belongs to each meaning, Wherein, first probabilistic model is constructed by following formula (1)

Second probabilistic model module, including the second probabilistic model construction unit and the second data processing unit, shown second Probabilistic model construction unit constructs the second probabilistic model, and second data processing unit is by the of each meaning of the word One vector inputs the second probabilistic model, obtains the second probability that the word belongs to the context words of each meaning, wherein institute The second probabilistic model is stated to construct by following formula (2)

Wherein, w_t+kIndicate the context words of word wt,Indicate the meaning c of word wt_jPrimary vector,It is Word is concentrated in addition to word w_tOther words in addition as autonomous word coding vector input word indicate model obtain first Vector,For the meaning c of word wt_jCoding vector,Indicate word w_tBelong to meaning c_jUp and down Second probability of cliction language；

Third probabilistic module, including third probabilistic model construction unit and third data processing unit, shown third probability The word is belonged to the first of each meaning by model construction building unit third probabilistic model, the third data processing unit Probability and the word belong to the second probability input third probabilistic model of the context words of each meaning, obtain word and belong to The third probability of each meaning, wherein the third probabilistic model is constructed by following formula (3)

Wherein, p (c_j|w_t) indicate word w_tBelong to meaning c_jThird probability；

Vector output module will belong to the primary vector of the word of independent word set as the output vector of the word, incite somebody to action The primary vector of the meaning of the corresponding word of maximum third probability of the word of independent word set is not belonging to as the word Output vector.

In addition, the present invention also provides a kind of electronic device, including memory and processor, it is stored in the memory The term vector of literary corpus generates program, and the term vector generation program of the Chinese corpus is realized above-mentioned when being executed by the processor The step of the term vector generation method of Chinese corpus.

In addition, including in the computer readable storage medium the present invention also provides a kind of computer readable storage medium There is the term vector of Chinese corpus to generate program, when the term vector generation program of the Chinese corpus is executed by processor, in realization The step of stating the term vector generation method of Chinese corpus.

Term vector generation method, system, electronic device and the medium of above-mentioned Chinese corpus carry out the synonym of the word meaning Collection and related word set are searched, and multiple declaration of will vectors of word can be obtained, and the polysemant for preferably resolving word indicates Problem is conducive to the semantic disambiguation in word expression.Simultaneously by the introducing of expert knowledge library, can also supplement between word Relationship improves because of pre-training result inaccuracy situation caused by training corpus is insufficient or quality is bad.

Detailed description of the invention

Fig. 1 is the flow chart of the term vector generation method of Chinese corpus of the present invention；

Fig. 2 is that the term vector of Chinese corpus of the present invention generates the schematic diagram that system constitutes block diagram.

Specific embodiment

In the following description, for purposes of illustration, it in order to provide the comprehensive understanding to one or more embodiments, explains Many details are stated.It may be evident, however, that these embodiments can also be realized without these specific details. In other examples, one or more embodiments for ease of description, well known structure and equipment are shown in block form an.

Each embodiment according to the present invention is described in detail below with reference to accompanying drawings.

Fig. 1 is the flow chart of the term vector generation method of Chinese corpus of the present invention, as shown in Figure 1, the term vector Generation method includes:

Step S1 constructs database, and autonomous word is stored as independent word set by the database, by each meaning of word Synonym is stored as synset, and the related term of each meaning of word is stored as related word set；

Step S2 acquires Chinese corpus, segments to Chinese corpus, obtains the word collection that the word of Chinese corpus is constituted W=[w₁, w₂..., w_b], b is the word sum of word collection；

The word is concentrated independent word set, synset and the related word set of word by the first setting order by step S3 It is encoded, obtains the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word The coding vector of each meaning of the coding vector and each word of each related term of each meaning of language；

Coding vector input word is indicated model, each meaning of autonomous word, each word that word is concentrated by step S4 Each synonym, each word each related term of each meaning and each meaning of each word be converted into first to Amount, the primary vector of each synonym of each meaning of each word constitute the synset of each meaning of each word Primary vector collection, the primary vector of each related term of each meaning of each word constitute the phase of each meaning of each word The primary vector collection of word set is closed, the synset of each meaning of each word and the primary vector collection of correlation word set constitute each The primary vector collection of each meaning of word, the interesting primary vector collection of each word constitute the first of each word to Quantity set, it is preferable that the vocabulary representation model is Skip-gram model (rising space model), CWE (character enhanced Word embedding model, character enhance word incorporation model) or SAT model (sememe attention over Target model, the former attention model of target justice)；

Step S5 judges that word concentrates whether each word belongs to independent word set with the second setting order；

Step S6, if the word belongs to independent word set, using the primary vector of the word as the defeated of the word Outgoing vector；

Step S7 executes following step if the word is not belonging to independent word set:

Step S71, each primary vector that the primary vector of each meaning of the word is concentrated input the first probability Model obtains the first probability that the word belongs to each meaning, wherein first probabilistic model is constructed by following formula (1)

The primary vector of each meaning of the word is inputted the second probabilistic model, obtains the word category by step S72 In the second probability of the context words of each meaning, wherein second probabilistic model is constructed by following formula (2)

Wherein, w_t+kIndicate word w_tContext words,Indicate word w_tMeaning c_jPrimary vector,It is Word is concentrated in addition to word w_tOther words in addition as autonomous word coding vector input word indicate model obtain first Vector,For word w_tMeaning c_jCoding vector,Indicate word w_tBelong to meaning c_jUp and down Second probability of cliction language；

The word is belonged to the first probability of each meaning and the word belongs to the upper and lower of each meaning by step S73 Second probability of cliction language inputs third probabilistic model, obtains the third probability that word belongs to each meaning, wherein the third Probabilistic model is constructed by following formula (3)

Wherein, p (c_j|w_t) indicate word w_tBelong to meaning c_jThird probability；

Step S74, using the primary vector of the meaning of the corresponding word of maximum third probability as the defeated of the word Outgoing vector.

The term vector generation method of above-mentioned Chinese corpus proposes that Chinese polyarch word indicates that (existing word indicates mould to learning model Type, the first probabilistic model, the second probabilistic model and the combination of third probabilistic model), it realizes that Chinese polysemant indicates, solves Chinese word Semantic ambiguity problem in language expression, and be finally applied in Chinese Word similarity, it is quasi- to improve Words similarity detection True rate.

In step S71, further include the steps that being modified the first probabilistic model by attention mechanism, described first The step of probabilistic model is modified include:

It is obtained according to the following formula (4) by the coding vector of any two synonym or related term in each meaning of word The priori similarity of any two synonym or related term in each meaning of the word

Wherein,For word w_tA meaning c_jSynset,For word w_tA meaning c_jCorrelation Word set, w_itAnd w_ntFor synsetIn two synonyms or related word setIn two related terms, f_p(w_it, w_nt) be w_itAnd w_ntPriori similarity, SIM (w_it, w_nt) it is to work as w_itAnd w_ntAll in related word setIn, w_itAnd w_ntCoding vector The priori similarity obtained by similarity based method；

The vector similarity of any two synonym or related term in each meaning of the word is obtained by following formula (5)

f_v(w_it, w_nt)=< w_it, w_nt>=∑_dv_d(w_it)*v_d(w_nt) (5)

Wherein, f_v(w_it, w_nt) it is w_itAnd w_ntVector similarity, v_d(w_it) and v_d(w_nt) indicate word w_itAnd w_nt's Primary vector, d indicate vector dimension；

Word w is corrected according to the following formula (6) and (7) by priori similarity and vector similarity_tEach meaning in it is each The primary vector of synonym or each related term

Wherein,For primary vectorCorrection factor, indicate for word w_tJ-th of meaning c_jIn i-th A synonym or i-th of related termAttention score,For word w_tJ-th of meaning c_jIn i-th of synonym Or the modification vector of i-th of related term；

Modification vector substitution primary vector is inputted into the first probabilistic model or/and the second probabilistic model.

Preferably, in step S71, by the coding of any two synonym or related term in each meaning of word to Amount obtains the priori similarity of any two synonym or related term in each meaning of the word according to the following formula (8)

Wherein, β is the harmonic coefficient that synonym and related term influence, β≤1.

In one alternate embodiment, in step S71, optimize harmonic coefficient β the step of, the step includes:

The similarity numerical value for by expert knowledge library or manually marking out pairs of word, is added to the first sequence with setting sequence In column；

The vector similarity of the primary vector collection of the pairs of word is obtained by similarity based method, it is suitable according to above-mentioned setting Vector similarity is added in the second sequence by sequence；

The overall similarity of First ray and the second sequence is obtained using similarity based method；

Using the corresponding harmonic coefficient β of overall similarity maximum value as best harmonic coefficient β.

Preferably, the vector similarity method of the primary vector collection that the pairs of word is obtained by similarity based method Include:

Wherein, sim (w₁, w₂) be expressed as to word w₁And w₂Primary vector collection vector similarity,Indicate w₁ Kth₁The synset of a meaningPrimary vector collection,Indicate w₁Kth₁The related word set of a meaningPrimary vector collection,Indicate w₂Kth₂The synonym of a meaningPrimary vector collection,It indicates w₂Kth₂The related term of a meaningPrimary vector collection, β is the harmonic coefficient that synonym and related term influence.

The term vector generation method of above-mentioned Chinese corpus gives the Chinese polyarch vocabulary dendrography based on expert knowledge library Practise model.

Furthermore it is preferred that the side of the overall similarity for obtaining First ray and the second sequence using similarity based method Method includes:

The overall similarity of First ray and the second sequence is obtained by following formula (10) using Pearson's coefficient

Wherein, s₁Indicate First ray, s₂Indicating the second sequence, N is expressed as the sum to word,For First ray s₁With the second sequence s₂Pearson's coefficient, indicate First ray s₁With the second sequence s₂Overall similarity.

In another alternative embodiment, the step of optimization harmonic coefficient β, includes:

The output vector of word and other multiple words is obtained by third probabilistic model；

The word and other each words are grouped；

The vector similarity of the output vector of multipair pairs of word is obtained by similarity based method；

Vector similarity is ranked up according to sequence from big to small；

Other words for the forward setting quantity that sorts are extracted into combination, the arest neighbors set as word；

Make arest neighbors set requirements by adjusting harmonic coefficient β, the requirement is by the nearest of artificial judgment output Whether the word quantity of correct or word arest neighbors set reaches the word of its synset and related word set to neighbour's set The setting ratio of quantity summation.

In step s 5, method packet that word concentrates each word whether to belong to independent word set is judged with the second setting order It includes:

It constructs window and word collection input window is obtained into the centre word of window；

Judge whether centre word belongs to independent word set；

By the way of sliding window, successively judge that word concentrates whether each word belongs to independent word set.

The term vector of above-mentioned Chinese corpus generates system for the ambiguity problem in Chinese word expression at present, in proposition Literary polyarch word representation method, wherein the meaning of polyarch assumes that the meaning of word is not unique, is made of multiple meanings, right It is indicated in each meaning using a vector.Expert knowledge library, which refers to, passes through what long-time was write by domain experts Include the dictionary resources that more comprehensive word relationship and word annotate, such as HowNet (Hownet) and Chinese thesaurus.This Invention introduces Chinese thesaurus in the expression of Chinese word, compared to HowNet, can more directly correct due to instructing in advance Practicing word indicates the bad phenomenon of modelling effect caused by the inaccuracy of result, and the research to be indicated based on pre-training word provides new think of Road.And by the synset of each word difference meaning in Chinese thesaurus or related word set, word can be carried out Polyarch indicate that realizes in Chinese word expression semantic disambiguates.And it, can be to a certain degree by the introducing of external resource On make up due to data volume deficiency, cause not training data concentrate word can not indicate the problem of, realize decimal indirectly Word under indicates.Eventually by experiment, method proposed by the present invention is substantially better than existing model.

Fig. 2 is that the term vector of Chinese corpus of the invention generates the schematic diagram that system constitutes block diagram, as shown in Fig. 2, in described The term vector of literary corpus generates system 1

Autonomous word is stored as independent word set by database 10, and the synonym of each meaning of word is stored as synonym Collection, is stored as related word set for the related term of each meaning of word；

Acquisition module 20 acquires Chinese corpus；

Word segmentation module 30 segments the Chinese corpus of acquisition, obtains the word collection W=that the word of Chinese corpus is constituted [w₁, w₂..., w_b], b is the word sum of word collection；

The word is concentrated independent word set, synset and the related word set of word by the first setting by coding module 40 Order is encoded, obtain the coding vector of autonomous word, each meaning of each word each synonym coding vector, every The coding vector of each meaning of the coding vector and each word of each related term of each meaning of a word；

Primary vector collection constructs module 50, and coding vector input word is indicated model, by the autonomous word of word concentration, each Each synonym, each related term of each meaning of each word and each meaning of each word of each meaning of word It is converted into primary vector, the primary vector of each synonym constitutes each meaning of each word in each meaning of each word Synset primary vector collection, the primary vector of each related term of each meaning of each word constitutes each word The first of the primary vector collection of the related word set of each meaning, the synset of each meaning of each word and related word set to Quantity set constitutes the primary vector collection of each meaning of each word, and the interesting primary vector collection of each word constitutes each The primary vector collection of word；

Judgment module 60 judges that word concentrates whether each word belongs to independent word set with the second setting order, if institute Predicate language belongs to independent word set, sends a signal to vector output module, if the word is not belonging to independent word set, sends signal To the first probabilistic model module and the second probabilistic model module；

First probabilistic model module 70, including the first probabilistic model construction unit 71 and the first data processing unit 72, institute Show that the first probabilistic model construction unit 71 constructs the first probabilistic model, first data processing unit 72 is by the every of the word Each primary vector that the primary vector of a meaning is concentrated inputs the first probabilistic model, obtains the word and belongs to each meaning First probability, wherein first probabilistic model is constructed by following formula (1)

Second probabilistic model module 80, including the second probabilistic model construction unit 81 and the second data processing unit 82, institute Show that the second probabilistic model construction unit constructs the second probabilistic model, second data processing unit is by each meaning of the word The primary vector of think of inputs the second probabilistic model, obtains the second probability that the word belongs to the context words of each meaning, Wherein, second probabilistic model is constructed by following formula (2)

Third probabilistic module 90, including third probabilistic model construction unit 91 and third data processing unit 92, shown Three probabilistic model construction units construct third probabilistic model, and the word is belonged to each meaning by the third data processing unit The first probability and the word belong to each meaning context words the second probability input third probabilistic model, obtain word Language belongs to the third probability of each meaning, wherein the third probabilistic model is constructed by following formula (3)

Wherein, p (c_j|w_t) indicate word w_tBelong to meaning c_jThird probability；

Vector output module 100, will belong to the primary vector of the word of independent word set as the output vector of the word, The primary vector of the meaning of the corresponding word of maximum third probability of the word of independent word set be will not belong to as institute's predicate The output vector of language.

The term vector of above-mentioned Chinese corpus generates system and compares traditional Skip-gram model, and the present invention proposes to carry out single The synset of the word meaning and related word set are searched, and can be obtained multiple declaration of will vectors of word, be preferably resolved list The polysemant of word indicates problem, is conducive to the semantic disambiguation in word expression.It, can be with simultaneously by the introducing of expert knowledge library The relationship between word is supplemented, is improved because of pre-training result inaccuracy situation caused by training corpus is insufficient or quality is bad.

Preferably, further include correction module 200, the first probabilistic model be modified by attention mechanism, comprising:

Priori similarity obtaining unit 210 passes through the volume of any two synonym or related term in each meaning of word Code vector obtains the priori similarity of any two synonym or related term in each meaning of the word according to the following formula (4)

It is synonymous to obtain any two in each meaning of the word by following formula (5) for vector similarity obtaining unit 220 The vector similarity of word or related term

f_v(w_it, w_nt)=< w_it, w_nt>=∑_dv_d(w_it)*v_d(w_nt) (5)

Amending unit 230 corrects word w according to the following formula (6) and (7) by priori similarity and vector similarity_tIt is every The primary vector of each synonym or each related term in a meaning

Wherein,For primary vectorCorrection factor, indicate for word w_tJ-th of meaning c_jIn i-th A synonym or i-th of related termAttention score,For word w_tJ-th of meaning c_jIn i-th of synonym Or the modification vector of i-th of related term,

Wherein, the revised modification vector of amending unit 230 substitution primary vector inputs the first probabilistic model or/and second Probabilistic model.

Preferably, the priori similarity obtaining unit 210 by any two synonym in each meaning of word or The coding vector of related term obtains the elder generation of any two synonym or related term in each meaning of the word according to the following formula (8) Test similarity

Preferably, further include optimization module 300, optimize harmonic coefficient β.

In an alternate embodiment of the present invention where, optimization module 300 includes:

First ray construction unit 310 by expert knowledge library or manually marks out the similarity numerical value of pairs of word, with Setting sequence is added in First ray；

Second sequence construct unit 320 obtains the vector of the primary vector collection of the pairs of word by similarity based method Vector similarity is added in the second sequence by similarity according to above-mentioned setting sequence；

Overall similarity obtaining unit 330, the entirety for obtaining First ray and the second sequence using similarity based method are similar Degree；

Best harmonic coefficient obtaining unit 340 reconciles using the corresponding harmonic coefficient β of overall similarity maximum value as best Factor beta.

In another alternative embodiment of the invention, optimization module 300 includes:

Unit 310 ' is grouped in word and other multiple words by group, by third probabilistic model obtain word and The output vector of other multiple words；

Output vector processing unit 320 ' obtains the vector phase of the output vector of multipair pairs of word by similarity based method Like degree, vector similarity is ranked up according to sequence from big to small；

Other words for the forward setting quantity that sorts are extracted combination, as word by arest neighbors set obtaining unit 330 ' Arest neighbors set；

Optimize unit 340 ', makes arest neighbors set requirements by adjusting harmonic coefficient β, the requirement is by manually sentencing Whether the arest neighbors set of disconnected output correct or the word quantity of arest neighbors set of word reaches its synset and phase Close the setting ratio of the word quantity summation of word set.

In one particular embodiment of the present invention, window size is set as 5, the number of iterations 5, the minimum word frequency of word is 10, vector dimension 300；Database has selected expert knowledge library --- Chinese thesaurus, word woods by coding, every a line it is multiple There are three types of state, synonym, related term or autonomous words for word, therefore pass through matching Chinese thesaurus for any word, can To contain multiple meanings, contain multiple synonyms or related term again in each meaning.

Chinese corpus mainly from Baidupedia, wikipedia, People's Daily, search dog news, know question and answer, microblogging and Literary works data；

Vocabulary representation model uses Skip-gram model, and training data comes from search dog laboratory, abbreviation SogouCA, and data have 1.51GB；

Using WordSim-240 and WordSim-297 as evaluation data, this is the good pairs of Words similarity of handmarking Data set.WordSim-240 includes 240 pairs of Chinese word languages, these words most of are all related words, such as " li po " and " poem ".WordSim-297 data set contains 297 pairs of Chinese word languages, is largely similar word, such as " admission ticket " and " admission ticket ". Word in the two data is reversed according to similarity numerical value；

It will be after Skip-gram model, Glove model, SAT model and present invention training that the data input prior art be evaluated Term vector generate system (MP-CWR), using accuracy as evaluation index, obtained evaluation result is as shown in table 1 below

Table 1

Model	WordSim-240	WordSim-297
			Skip-gram(2013)	0.5703	0.5851
Glove(2014)	0.5241	0.5277
			SAT(2017)	0.5220	0.6150
MP-CWR	0.5724	0.6170

As can be seen from the above table, in similarity task, model of the present invention makes on the basis of small data quantity training, The accuracy rate of WordSim-240 and WordSim-297 evaluation data set is all promoted, 0.5724 He is respectively reached 0.6170。

In order to verify performance of the model in arest neighbors Detection task, the present invention has selected some polysemants and single respectively The word of the meaning calculates separately the vector similarity with other words, and in existing Skip-gram, CWE and SAT model Compare, such as selection polysemant " pride " and " troop " as an example, carries out arest neighbors detection respectively, as a result as shown in table 2 below:

Table 2

As can be seen from Table 2, due to existing methods limitation, Skip-gram, CWE and SAT model are finally all only There is unique meaning.Although CWE and SAT model is all building, polyarch indicates model, is finally all integrating the more of word What a meaning vector obtained is that unique vector indicates.And the model constructed through the invention, it can be seen that " pride " and " team 5 " all there are two look like vector.Two meanings of " pride " are pride and complacency respectively, and two meanings of " troop " are team respectively And army.And closest word the case where compared to other models being correlation word mostly, the word that the present invention obtains is main Similar word.

In addition, arest neighbors detection is also carried out using univocal, for example, selected " frog " and " pregnancy " as an example, into The detection of row arest neighbors, as a result as shown in table 3 below:

Table 3

It as can be seen from Table 3, is relevant word mostly by having the word that method obtains, rather than similar word Language.And the model proposed through the invention, what is obtained is similar word mostly, therefore method proposed by the present invention compares it His method can preferably find out the arest neighbors word of word.

The term vector generation method of Chinese corpus of the invention can not indicate polysemant for existing vocabulary representation model at present, make The problem of indicating fuzzy at word and need a large amount of corpus, Chinese corpus polyarch word lists representation model is proposed, by drawing Enter Chinese thesaurus, the different meanings of each word and the synset under each meaning and related word set can be inquired, It is then based on attention to remember, the power different with each word in related term of the synset under each meaning of imparting word Weight, and the different vectors that finally combination obtains the word difference meaning indicate, to realize that the polysemant of word indicates.In similarity In task, model of the present invention makes on the basis of small data quantity training, to WordSim-240 and WordSim-297 review number It is all promoted according to the accuracy rate of collection, respectively reaches 0.5724 and 0.6170.Compared to relatively existing side in arest neighbors Detection task The case where the obtained arest neighbors word of method there are Semantic fuzziness and is mostly related words or other words, the present invention mentions Model out, which obtains arest neighbors result, can clearly distinguish the different meanings for indicating word, and result is mostly with similar import Based on word.

In the present embodiment, memory includes the readable storage medium storing program for executing of at least one type.At least one type Readable storage medium storing program for executing can be the non-volatile memory medium of such as flash memory, hard disk, multimedia card, card-type memory.In some realities It applies in example, the readable storage medium storing program for executing can be the internal storage unit of the electronic device, such as the hard disk of the electronic device. In further embodiments, the readable storage medium storing program for executing is also possible to the external memory of the electronic device, such as the electricity The plug-in type hard disk being equipped in sub-device, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc., the memory can be also used for temporarily storing exported or The data that will be exported.

Processor can be a central processing unit (Central Processing Unit, CPU) in some embodiments, Microprocessor or other data processing chips, program code or processing data for being stored in run memory.

Preferably, electronic device further includes network interface, optionally may include standard wireline interface and wireless interface (such as WI-FI interface), commonly used in establishing communication connection between the electronic device, with other electronic equipments；Communication bus is used Connection communication between these components of realization.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium In include that the term vector of Chinese corpus generates program, the term vector of the Chinese corpus generates program when being executed by processor, The step of realizing the term vector generation method of above-mentioned Chinese corpus.

The word of the electronic device of the present invention and the specific embodiment of computer readable storage medium and above-mentioned Chinese corpus Vector generation method, the specific embodiment of system are roughly the same, and details are not described herein.

Although content disclosed above shows exemplary embodiment of the present invention, it should be noted that without departing substantially from power Under the premise of benefit requires the range limited, it may be many modifications and modify.According to the side of inventive embodiments described herein Function, step and/or the movement of method claim are not required to the execution of any particular order.In addition, although element of the invention can It is unless explicitly limited individual element it is also contemplated that having multiple elements to be described or be required in the form of individual.

Claims

1. a kind of term vector generation method of Chinese corpus characterized by comprising

Database is constructed, autonomous word is stored as independent word set by the database, and the synonym of each meaning of word is stored For synset, the related term of each meaning of word is stored as related word set；

Chinese corpus is acquired, Chinese corpus is segmented, obtains the word collection W=[w that the word of Chinese corpus is constituted₁, w₂..., w_b], b is the word sum of word collection；

It concentrates independent word set, synset and the related word set of word to encode by the first setting order the word, obtains To each meaning of the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word The coding vector of each meaning of the coding vector and each word for each related term thought；

Coding vector input word is indicated into model, each of each meaning of autonomous word, each word that word is concentrated is synonymous Each related term of each meaning of word, each word and each meaning of each word are converted into primary vector, each word Each meaning in each synonym primary vector constitute each word each meaning synset primary vector collection, The primary vector of each related term of each meaning of each word constitutes the of the related word set of each meaning of each word The primary vector collection of one vector set, the synset of each meaning of each word and related word set constitutes each of each word The primary vector collection of the meaning, the interesting primary vector collection of each word constitute the primary vector collection of each word；

Each primary vector that the primary vector of each meaning of the word is concentrated inputs the first probabilistic model, obtains described Word belongs to the first probability of each meaning, wherein first probabilistic model is constructed by following formula (1)

Wherein, w_tT-th of word, c are concentrated for word_jFor word w_tJ-th the meaning,Indicate word w_t Belong to meaning c_jThe first probability,Indicate word w_tJ-th of meaning c_jIn i-th of synonym or i-th of related term Primary vector；

The primary vector of each meaning of the word is inputted into the second probabilistic model, the word is obtained and belongs to each meaning Second probability of context words, wherein second probabilistic model is constructed by following formula (2)

Wherein, w_t+kIndicate word w_tContext words,Indicate word w_tMeaning c_jPrimary vector,It is word It concentrates in addition to word w_tOther words in addition indicate the primary vector that model obtains as the coding vector input word of autonomous word,For word w_tMeaning c_jCoding vector,Indicate word w_tBelong to meaning c_jCliction up and down Second probability of language；

By the word belong to each meaning the first probability and the word belong to each meaning context words second Probability inputs third probabilistic model, obtains the third probability that word belongs to each meaning, wherein the third probabilistic model passes through Following formula (3) building

Wherein, p (c_j|w_t) indicate word w_tBelong to meaning c_jThird probability；

2. the term vector generation method of Chinese corpus according to claim 1, which is characterized in that further include passing through attention The step of the step of mechanism is modified the first probabilistic model, first probabilistic model is modified includes:

Institute's predicate is obtained according to the following formula (4) by the coding vector of any two synonym or related term in each meaning of word The priori similarity of any two synonym or related term in each meaning of language

Wherein,For word w_tA meaning c_jSynset,For word w_tA meaning c_jRelated word set, w_itAnd w_ntFor synsetIn two synonyms or related word setIn two related terms, f_p(w_it, w_nt) it is w_itAnd w_nt Priori similarity, SIM (w_it, w_nt) it is to work as w_itAnd w_ntAll in related word setIn, w_itAnd w_ntCoding vector pass through phase The priori similarity obtained like degree method；

f_v(w_it, w_nt)=< w_it, w_nt>=∑_dv_d(w_it)*v_d(w_nt) (5)

Wherein, f_v(w_it, w_nt) it is w_itAnd w_ntVector similarity, v_d(w_it) and v_d(w_nt) indicate word w_itAnd w_ntFirst to Amount, d indicate vector dimension；

Word w is corrected according to the following formula (6) and (7) by priori similarity and vector similarity_tEach meaning in each synonym Or the primary vector of each related term

Wherein,For primary vectorCorrection factor, indicate for word w_tJ-th of meaning c_jIn i-th it is same Adopted word or i-th of related termAttention score,For word w_tJ-th of meaning c_jIn i-th of synonym or The modification vector of i-th of related term；

3. the term vector generation method of Chinese corpus according to claim 1, which is characterized in that described with the second setting time The method that sequence judges that word concentrates each word whether to belong to independent word set includes:

Judge whether centre word belongs to independent word set；

4. the term vector generation method of Chinese corpus according to claim 2, which is characterized in that pass through each meaning of word The coding vector of any two synonym or related term obtains any two in each meaning of the word according to the following formula (8) in think of The priori similarity of a synonym or related term

5. the term vector generation method of Chinese corpus according to claim 4, which is characterized in that further include optimization reconciliation system The step of number β, the step includes:

The similarity numerical value for by expert knowledge library or manually marking out pairs of word, is added to First ray with setting sequence In；

The vector similarity that the primary vector collection of the pairs of word is obtained by similarity based method, will according to above-mentioned setting sequence Vector similarity is added in the second sequence；

6. a kind of term vector of Chinese corpus generates system characterized by comprising

Autonomous word is stored as independent word set by database, the synonym of each meaning of word is stored as synset, by word The related term of each meaning of language is stored as related word set；

Acquisition module acquires Chinese corpus；

Word segmentation module segments the Chinese corpus of acquisition, obtains the word collection W=[w that the word of Chinese corpus is constituted₁, w₂..., w_b], b is the word sum of word collection；

Coding module, by the word concentrate the independent word set of word, synset and related word set by the first setting order into Row coding, obtains the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word Each meaning each related term coding vector and each word each meaning coding vector；

Primary vector collection constructs module, and coding vector input word is indicated model, autonomous word, each word that word is concentrated Each synonym, each related term of each meaning of each word and each meaning of each word of each meaning are converted into Primary vector, the primary vector of each synonym constitutes the synonymous of each meaning of each word in each meaning of each word The primary vector of the primary vector collection of word set, each related term of each meaning of each word constitutes each meaning of each word The primary vector collection of the related word set of think of, the primary vector collection structure of the synset of each meaning of each word and related word set At the primary vector collection of each meaning of each word, the interesting primary vector collection of each word constitutes each word Primary vector collection；

Judgment module judges that word concentrates whether each word belongs to independent word set with the second setting order, if the word Belong to independent word set, send a signal to vector output module, if the word is not belonging to independent word set, sends a signal to first Probabilistic model module and the second probabilistic model module；

First probabilistic model module, including the first probabilistic model construction unit and the first data processing unit, shown first probability The first probabilistic model of model construction building unit, first data processing unit by the first of each meaning of the word to Each primary vector in quantity set inputs the first probabilistic model, obtains the first probability that the word belongs to each meaning, wherein First probabilistic model is constructed by following formula (1)

Second probabilistic model module, including the second probabilistic model construction unit and the second data processing unit, shown second probability The second probabilistic model of model construction building unit, second data processing unit by the first of each meaning of the word to Amount the second probabilistic model of input, obtains the second probability that the word belongs to the context words of each meaning, wherein described the Two probabilistic models are constructed by following formula (2)

Wherein, w_t+kIndicate word w_tContext words,Indicate word w_tMeaning c_jPrimary vector,It is word It concentrates in addition to word w_tOther words in addition indicate the primary vector that model obtains as the coding vector input word of autonomous word,For word w_tMeaning c_jCoding vector,Indicate word w_tBelong to meaning c_jContext words The second probability；

Third probabilistic module, including third probabilistic model construction unit and third data processing unit, shown third probabilistic model Construction unit constructs third probabilistic model, and the word is belonged to the first probability of each meaning by the third data processing unit The second probability for belonging to the context words of each meaning with the word inputs third probabilistic model, and acquisition word belongs to each The third probability of the meaning, wherein the third probabilistic model is constructed by following formula (3)

Wherein, p (c_j|w_t) indicate word w_tBelong to meaning c_jThird probability；

Vector output module will belong to the primary vector of the word of independent word set as the output vector of the word, will not belong to In the corresponding word of maximum third probability of the word of independent word set the meaning primary vector as the defeated of the word Outgoing vector.

7. the term vector of Chinese corpus according to claim 6 generates system, which is characterized in that it further include correction module, The first probabilistic model is modified by attention mechanism, comprising:

Priori similarity obtaining unit passes through the coding vector root of any two synonym or related term in each meaning of word The priori similarity of any two synonym or related term in each meaning of the word is obtained according to following formula (4)

Vector similarity obtaining unit obtains any two synonym or phase in each meaning of the word by following formula (5) Close the vector similarity of word

f_v(w_it, w_nt)=< w_it, w_nt>=∑_dv_d(w_it)*v_d(w_nt) (5)

Amending unit corrects word w according to the following formula (6) and (7) by priori similarity and vector similarity_tEach meaning in The primary vector of each synonym or each related term

Wherein,For primary vectorCorrection factor, indicate for word w_tJ-th of meaning c_jIn i-th it is same Adopted word or i-th of related termAttention score,For word w_tJ-th of meaning c_jIn i-th of synonym or The modification vector of i-th of related term,

Wherein, the revised modification vector substitution primary vector of amending unit inputs the first probabilistic model or/and the second probability mould Type.

8. the term vector of Chinese corpus according to claim 6 generates system, which is characterized in that the judgment module includes Window construction unit, judging unit and sliding unit, wherein the window construction unit constructs window, by word collection input window Mouthful, obtain the centre word of window；The sliding unit, sliding window；The judging unit judges whether centre word belongs to independence Word set.

9. the term vector of Chinese corpus according to claim 7 generates system, which is characterized in that the priori similarity obtains Unit by the coding vector of any two synonym or related term in each meaning of word described in (8) obtain according to the following formula The priori similarity of any two synonym or related term in each meaning of word

10. the term vector of Chinese corpus according to claim 9 generates system, which is characterized in that it further include optimization module, Optimize harmonic coefficient β, comprising:

First ray construction unit by expert knowledge library or manually marks out the similarity numerical value of pairs of word, suitable to set Sequence is added in First ray；

Second sequence construct unit obtains the vector similarity of the primary vector collection of the pairs of word by similarity based method, Vector similarity is added in the second sequence according to above-mentioned setting sequence；

Overall similarity obtaining unit obtains the overall similarity of First ray and the second sequence using similarity based method；

Best harmonic coefficient obtaining unit, using the corresponding harmonic coefficient β of overall similarity maximum value as best harmonic coefficient β.