CN104391963A

CN104391963A - Method for constructing correlation networks of keywords of natural language texts

Info

Publication number: CN104391963A
Application number: CN201410719639.5A
Authority: CN
Inventors: 郭光�
Original assignee: BEIJING ZHONGKE CHUANGYI TECHNOLOGY Co Ltd
Current assignee: BEIJING ZHONGKE CHUANGYI TECHNOLOGY Co Ltd
Priority date: 2014-12-01
Filing date: 2014-12-01
Publication date: 2015-03-04

Abstract

The invention provides a method for constructing correlation networks of keywords of natural language texts. The method includes steps of constructing dictionaries of the keywords and segmenting words of target corpuses according to the dictionaries to obtain a plurality of words; performing statistics on front and rear word correlation frequencies of the multiple obtained words on the basis of N-element statistic language models; training the language models by the aid of neural networks under training conditions which are the frequencies obtained by means of statistics, and acquiring word vectors; computing the similarity degrees of the word vectors of each two corresponding words and generating semantic correlation between each two corresponding words; generating the correlation networks of the keywords of the texts according to the level of the semantic correlation between the corresponding words. The similarity degrees of the word vectors of each two corresponding words are used as measurement for the semantic correlation of the two words. The method has the advantage that the accuracy of the correlation networks of the texts in relevant items can be effectively improved by the aid of the method.

Description

A kind of natural language text keyword related network construction method

Technical field

The invention belongs to natural language processing technique field, more particularly, particularly a kind of natural language text keyword related network construction method.

Background technology

Generally, evaluate magnanimity science and technology item data processing or expert info Data Summary, computer process seems particularly necessary, in natural language processing technique, due to the language feature of Chinese self, Chinese language processing is more more complex than the western language process based on Romance.And the prerequisite making computing machine can process natural language is text quantification.The process means that text quantizes extract the Feature Words in content of text, namely from the text materials such as all kinds of scientific and technical literature, science and technology item project verification and evaluation, extract industry or field keyword, then build the related network between text by Keywords matching etc.

For Chinese language processing, a prerequisite of keyword extraction carries out participle to text, carry out participle operation obtain vocabulary after, current the most frequently used word method for expressing is that each word is expressed as a very long vector, the dimension of vector is vocabulary size, and wherein most element is 0, only has the value of a dimension to be 1, this dimension just represents current word, namely imparts a numerical coding to each word in text.The method is that sparse mode stores, very brief and practical.But be all isolated between any two words, vector cannot represent the relation between word.Therefore, the synonym of different word composition, such as " microphone " and " microphone ", cannot embody its identical meaning by this method for expressing.Which results in the keyword that the degree of association is very high sometimes can not be identified, make the related network precision of structure not high.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of natural language text keyword related network construction method, in order to solve the problems of the technologies described above.

For this reason, the invention provides a kind of natural language text keyword related network construction method, comprise step:

Steps A, builds the dictionary of keyword, carries out participle operation, obtain multiple word according to described dictionary to target corpus;

Step B, to the multiple bases the obtained frequency that word association occurs before and after N unit statistical language model statistics;

Step C, with the frequency counted for training condition, adopts neural metwork training language model, and obtains term vector;

Step D, calculates the similarity of the term vector of two words, as the tolerance of two word semantic dependencies, generates semantic association degree between two words;

Step e, according to described semantic association degree, according to the height of the described semantic association degree between two words, generates text key word related network.

Wherein, the dictionary building keyword in described steps A comprises step:

Crawl the key word information in target corpus by crawler technology, the multiple keywords obtained are gathered for dictionary.

Wherein, carry out participle operation according to described dictionary to target corpus in described steps A to comprise:

Carry out participle based on string matching, and carry out participle based on semantic understanding and/or carry out participle based on the adjacent co-occurrence frequency statistics of word.

Wherein, obtain term vector in described step C and comprise the low-dimensional real number vector that acquisition dimension is less than or equal to 100.

Wherein, in described step B, step is comprised to the multiple bases the obtained frequency that word association occurs before and after N unit statistical language model statistics:

To the multiple words after cutting, according to adjacent appearance 1, a 2N word is a tuple, carries out tuple division, the probability occurred under adding up the condition that each word occurs at front N-1 word.

Wherein, neural metwork training language model is adopted to comprise in described step C:

Adopt the neural metwork training language model of three layers, front N-1 vectorial end to end spelling got up, form the vector that (N-1) m ties up, as the ground floor of described neural network, m is the dimension of described term vector;

Use d+Hx to calculate the second layer, and use tanh as activation function, d is a bias term;

Third layer exports V node yi, and output valve y is normalized into probability by rear use softmax activation function, and yi represents that next word is the non-normalization log probability of i, and the computing formula of y is:

y＝b+Wx+Utanh(d+Hx)

Wherein U is the parameter that the second layer arrives third layer, and b is also a bias term;

By stochastic gradient descent method, described language model is optimized out.

Wherein, the similarity calculating the term vector of two words in described step D comprises the COS distance of the term vector of calculating two words.

The invention provides a kind of natural language text keyword related network construction method, after participle is carried out to Chinese natural language text, based on the frequency that word association before and after N unit statistical language model statistics occurs, with the frequency counted for training condition, adopt neural metwork training language model, and obtain term vector, with the similarity of two term vectors, measure semantic association degree between two words, and then structure related network, be about to the mode of semantic information by probability statistics of Chinese, the training of language model is carried out in conjunction with neural network, be quantified as term vector information, the related network of such structure, combine semantic information, compare simple different words is encoded and do not consider semantic interrelational form, the precision of obvious related network provided by the invention is higher.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

A kind of natural language text keyword related network construction method process flow diagram that Fig. 1 provides for the embodiment of the present invention.

Embodiment

In order to make those skilled in the art person understand the present invention better, below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

Embodiments provide a kind of natural language text keyword related network construction method.

Shown in Figure 1, the method comprising the steps of:

Step S110, builds the dictionary of keyword, carries out participle operation, obtain multiple word according to described dictionary to target corpus.

Crawl the key word information in target corpus by crawler technology, the multiple keywords obtained are gathered for dictionary, and according to dictionary, participle operation is carried out to corpus.

Participle operation comprises carries out participle based on string matching, preferably, also in conjunction with carrying out segmenting method based on semantic understanding and/or carrying out segmenting method based on the adjacent co-occurrence frequency statistics of word, comprehensively should carry out participle, obtain vocabulary.Adopt single participle mode, possible accuracy is not high, therefore, will carry out reasonably combined and R. concomitans, and can improve the accuracy of participle based on string matching with based on understanding and these three kinds of modes of Corpus--based Method.

Preferably, can utilize n rank Markov model n-gram model, will treat that participle text carries out participle, and obtain the first text, n-gram model is for eliminating segmentation ambiguity, and it is the word string at interval that the first text comprises with space; When the first text comprises target word string, target word string is added into dictionary, obtain the dictionary after upgrading, target word string is the word string be not stored in described dictionary.According to the dictionary after described renewal, utilize forward direction maximum match segmentation and backward maximum match segmentation to carry out participle to described first text, obtain the second text and the 3rd text respectively.From second this paper and the 3rd text, choose word hope that the satisfactory text of rectangular difference is as word segmentation result with word for a long time.

More preferably, training is carried out to the language material of participle and obtain CRF model; Described CRF model is adopted to carry out participle to the language material of non-participle; Judge that whether the successful language material of participle meets the condition arranged, if so, then in the language material of participle described in joining; Circulation performs above-mentioned steps, until the scale of the language material of described participle no longer expands, obtains final CRF model.

Step S111, to the multiple bases the obtained frequency that word association occurs before and after N unit statistical language model statistics.

Statistics needs the word string after to cutting to carry out tuple division, the probability occurred under adding up the condition that each word occurs at front N-1 word.To the multiple words after cutting, according to adjacent appearance 1, a 2N word is a tuple, carries out tuple division, the probability occurred under adding up the condition that each word occurs at front N-1 word.

Wherein N is natural number, is not namely the real number integer of 0.

N unit statistical language model formalized description: a given word string, its be natural language probability P (w1, w2 ..., wt), w1 to wt represents each word in text successively, then have following inference:

P(w1,w2,…,wt)＝P(w1)×P(w2|w1)×P(w3|w1,w2)×…×P(wt|w1,w2,…,wt-1)

Wherein P (w1) represents the probability that first word w1 occurs, P (w2|w1) is under the prerequisite of known first word, and the probability that second word occurs is analogized in proper order.Can find out, the probability of occurrence of word w depends on all words before it, because the word amount in conventional natural language is all very large, cause calculate P (w1, w2 ... wt) very complicated, therefore current natural language processing field is all considered to use N gram language model, and N meta-model supposes that the probability that each word occurs only has relation with N-1 the word occurred above, therefore P (wt|wt-n+1 is used,, wt-1) and approximate solution P (wt|w1, w2,, wt-1).

Such as, for 3 gram language model, assuming that whole corpus has been cut into word string w1, w2, wn, then can obtain all continuous print 1 tuple (<w1>, <w2>, <w3>, <wn>), 2 tuple (<w1, w2>, <w2, w3>, <wn-1, wn>) and 3 tuple (<w1, w2, w3>, <w2, w3, w4>, <wn-2, wn-1, wn>), and then count each word wt at front 2 word wt-1, the probability occurred under the condition that wt-2 occurs.

Step S112, with the frequency counted for training condition, adopts neural metwork training language model, and obtains term vector.

The term vector used in the embodiment of the present invention be a kind of shape as [0.792 ,-0.177 ,-0.107,0.109 ,-0.542 ...] low-dimensional real number vector, dimension is generally no more than 100, can be 50 or 100 such integers.This term vector obtains semantic similarity by the distance weighed each other, and the expression complexity of higher-dimension vocabulary reduces greatly simultaneously.

Term vector in the present invention obtains by utilizing feedforward or recurrent neural network train language model, the term vector corresponding to word w is represented with C (w), the input of neural network is front N-1 word wt-n+1, the term vector that wt-1 word is corresponding, output is a vector, the next word of i-th element representation in vector is the probability of wi, and then the statistical probability that the N tuple utilizing corpus to obtain calculates is as training condition, and then constantly adjust each layer weight of neural network, optimize after terminating and obtain language model and term vector.

As a kind of embodiment, the embodiment of the present invention uses the neural network of three layers to build language model.

Wt-n+1 ..., wt-1 is a front N-1 word, needs to predict next word wt according to this known N-1 word.C (w) represents the term vector corresponding to word w, uses a set of unique term vector in whole model, exists in Matrix C (| the matrix of V| × m).Wherein | V| represents the size (the total word number in language material) of vocabulary, and m represents the dimension of term vector.The conversion of C to C (w) takes out a line exactly from matrix.

The ground floor (input layer) of network is by C (wt-n+1) ..., C (wt-2), C (wt-1) this N-1 vectorial end to end spelling is got up, and forms the vector that (N-1) m ties up.

The second layer (hidden layer) of network, as common neural network, directly uses d+Hx to calculate.D is a bias term.After this, use tanh as activation function.

The third layer (output layer) one of network has | and V| node, each node yi represents that next word is the non-normalization log probability of i.Finally use softmax activation function that output valve y is normalized into probability.Finally, the computing formula of y is:

y＝b+Wx+Utanh(d+Hx)

U in formula (one | the matrix of V| × h) be the parameter of hidden layer to output layer, the majority of whole model calculates and concentrates in the matrix multiplication of U and hidden layer.Finally use stochastic gradient descent method this model optimization out.The input layer of general neural network is an input value, and the input layer of this model is also parameter (existing in C), also need optimize.Optimize and terminate to create term vector and language model afterwards simultaneously.

More preferably, following neural network algorithm representation language model is adopted:

h = Σ_{i = 1}^{t - 1} H_{i} C (w_{i})

y _j＝C(w _j)T _h

Wherein, C (w) is term vector.Wherein, h represents the second layer hidden layer in n-gram three-layer neural network, with semantic information.Hi is the matrix of a m × m, and this matrix can be understood as the contribution that i-th word produces t word after Hi conversion.Therefore hidden layer is here the summary to a front t-1 word, and namely hidden layer h predicts the one of next word.

Yj predicts that next word is the log probability of wj, is calculated and obtains, directly can react the similarity of two words by the inner product of C (wj) and h.If the mould of each term vector is basically identical, the numerical values recited of inner product can react two vectorial cosine value sizes.

Preferably, also large vocabulary can be split as multiple little vocabulary; By corresponding for an each little vocabulary neural network language model, the input dimension of each neural network language model is identical and independently carry out first time and train; The output vector of each neural network language model is merged and carries out second time training; Obtain normalized neural network language model.

Step S113, calculates the similarity of the term vector of two words, as the tolerance of two word semantic dependencies, generates semantic association degree between two words.

Utilize vector space model (Vector Space Model) to carry out distance calculating as the tolerance of two word semantic dependencies to the vector of two words, generate semantic association degree between two words, and then construct whole keyword semantic network.The similarity calculating the term vector of two words comprises the COS distance of the term vector of calculating two words.

Each word is expressed as a floating point vector, can be expressed as a vector in higher dimensional space, utilizes the vectorial distance of the angle calcu-lation two between two vectors and represents their similarity degree.

Step S114, according to described semantic association degree, according to the height of the described semantic association degree between two words, generates text key word related network.

Science and technology item, scientific and technological achievement or expert info are all be described in the form of text and express.Quantizing the content in the databases such as large-scale scientific and technological achievement database, expert info storehouse and document databse, compare, the analysis operation such as evaluation time, need computing machine can understand the semanteme of various content of text, correlation computations could be carried out more accurately.Whether such as, all need to use Text similarity computing when there is between analysis project similarity, fuzzy search; The keyword and project keyword that are used for describing expert is needed to carry out pattern match analysis etc. in expert domain capability analysis.

In addition, existing keyword network is many builds dictionary realization, for emerging word and the word None-identified not in dictionary by manual type.The keyword of normally used segmentation methods None-identified specific industry in Chinese information processing, and project to be evaluated is often owing to relating to scientific and technical innovation, can create some new technical term and nouns.Therefore, not only need to identify keyword, also need the semantic dependency depending on keyword to carry out keyword association more accurately to identify, the i.e. technology such as unified with nature Language Processing, information retrieval, pattern-recognition, corpus is formed, by the correlativity between statistical means analysis of key word according to existing information.

Therefore, the construction method that the embodiment of the present invention provides, can be used for quantitatively evaluating science and technology item, scientific and technological achievement, and expert assessment and evaluation and the application such as to select.Owing to have employed the term vector algorithm that can calculate distance, therefore can obtain the semantic similarity of term vector, and then the word semantic network generated can represent the approximation relation between each word preferably.The method is when being applied to process extensive expectation storehouse simultaneously, and term vector dimension is lower, is generally no more than 100, and relative to now conventional sparse term vector representation, complexity greatly reduces.

To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. a natural language text keyword related network construction method, is characterized in that, comprise step:

2. natural language text keyword related network construction method according to claim 1, it is characterized in that, the dictionary building keyword in described steps A comprises step:

3. natural language text keyword related network construction method according to claim 2, is characterized in that, carries out participle operation comprise in described steps A according to described dictionary to target corpus:

4. natural language text keyword related network construction method according to claim 1, is characterized in that, obtains term vector and comprise the low-dimensional real number vector that acquisition dimension is less than or equal to 100 in described step C.

5. natural language text keyword related network construction method according to claim 1, is characterized in that, comprises step in described step B to the multiple bases the obtained frequency that word association occurs before and after N unit statistical language model statistics:

6. natural language text keyword related network construction method according to claim 1, is characterized in that, adopt neural metwork training language model to comprise in described step C:

Third layer exports V node y _i, output valve y is normalized into probability, y by rear use softmax activation function _irepresent that next word is the non-normalization log probability of i, the computing formula of y is:

y＝b+Wx+Utanh(d+Hx)

7. natural language text keyword related network construction method according to claim 1, is characterized in that, the similarity calculating the term vector of two words in described step D comprises the COS distance of the term vector of calculating two words.