CN106776580A

CN106776580A - The theme line recognition methods of the deep neural network CNN and RNN of mixing

Info

Publication number: CN106776580A
Application number: CN201710047031.6A
Authority: CN
Inventors: 张志勇; 任江涛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2017-05-31

Abstract

The inventive method trains term vector using the whole network news data collection in search dog laboratory so that each close word spatially closely located；And the travel notes of 600 are respectively crawled from Baidu's tour site and hornet nest tour site, and sentence is divided into travel notes, these sentences are divided into training set and test set and according to 8：2 ratio is divided, and then calculates the information entropy and association relationship of each word according to the computing formula of comentropy and mutual information for training set；Then, for each sentence in training set according to the term vector that calculates and the comentropy for calculating and mutual information come construction feature, as the input of the interacting depth neutral net CNN_RNN for building, get parameter；Meanwhile, each sentence in test set is input in CNN_RNN according to the term vector that calculates and the comentropy for calculating and mutual information come construction feature, classification is calculated using the parameter for obtaining, the error of standard results and prediction is drawn, evaluate its performance.

Description

The theme line recognition methods of the deep neural network CNN and RNN of mixing

Technical field

The present invention relates to text mining field, more particularly, to the deep neural network CNN and RNN of a kind of mixing Theme line recognition methods.

Background technology

In recent years, with expanding economy, increasing people starts to travel to enrich the cultural life of oneself.Really Real, tourism can not only be relaxed, and allow I am happier can also to expand the visual field.According to the data that National Tourism Administration announces It has been shown that, tourist industry alreadys exceed 10% to the contribution rate of GDP.At present, tourism has become critically important during schedule is lived one Point.In the Internet Age, many people begin through microblogging, and social network sites share tourism experience in a text form.

In general, most of what is seen and heard in travelling is described and oneself view to these sight spots is delivered in travel notes And to later visitor's some of the recommendations, but still can irregular some unrelated contents.These theme sentences can be identified, The knowledge in tourism of excavating for success is very important.Because these unrelated contents can cause certain to result Noise effect.

For example：In the travel notes in the description Guangzhou in hornet nest, someone writes：" thank to your concern and support, if felt Obtaining this article value must share, and please recommend your friend or wechat group, and share in the circle of friends of oneself ".This is very bright Show description is not the what is seen and heard in tourism, is not naturally also just theme line, and these sentences are undoubtedly to suitable to text analyzing In the certain noise of addition.For another example, someone writes：" the flower city square at nightfall is brilliantly illuminated, can look into the distance " small rough waist " five contemporary literature histories, It is more charming than daytime ".This clearly be exactly theme line, described in it the night scene of the Zhujiang New City in Guangzhou.Exactly these are led The content for inscribing sentence is only the emphasis of concern.

When travelling recommendation is carried out, guests can't only refer only to Guangzhou in the travel notes in Guangzhou, can also mention In the city on Guangzhou periphery, for example：The what is seen and heard in the cities such as Hong Kong, Zhuhai, Shenzhen, description and comment to these sight spots Removal has great significance to later Knowledge Discovery.Because one of shortcoming of LDA models is exactly be it to noise ratio compared with It is sensitive.That is, influence of the noise to result is very big.

Therefore, in travel notes, most sentence all be illustrate travelling in sight spot and these sight spots are commented on How sentence, correctly identify that these theme sentences are a current problems for challenge.

The content of the invention

The present invention provides a kind of theme line recognition methods of the deep neural network CNN and RNN of the mixing of more preferable effect.

In order to reach above-mentioned technique effect, technical scheme is as follows：

The theme line recognition methods of the deep neural network CNN and RNN of a kind of mixing, comprises the following steps：

S1：Term vector is trained using the whole network news data collection in search dog laboratory so that each close word is in space On it is closely located；

S2：The travel notes of 600 are respectively crawled from Baidu's tour site and hornet nest tour site, sentence are divided into travel notes, These sentences are divided into training set and test set and according to 8：2 ratio is divided, then for training set according to information The computing formula of entropy and mutual information calculates the information entropy and association relationship of each word；

S3：Comentropy and mutual information that the term vector and S2 calculated according to S1 for each sentence in training set are calculated Carry out construction feature, as the input of the interacting depth neutral net CNN_RNN for building, get parameter；

S4：Comentropy that the same term vector calculated according to S1 to each sentence in test set and S2 is calculated and mutually Information carrys out construction feature, is input in deep neural network CNN_RNN, and the parameter obtained using S3 calculates its classification, draws Standard results and the error of prediction, evaluate its performance.

Further, the detailed process of the step S1 is as follows：

S11：The whole network news data collection in search dog laboratory is downloaded first, and data set is cleaned, draw every Complete news；

S12：Participle is carried out to data set, is written in file, " t are used between word and word " separate, between news and news With " n " separate；

S13：The word2vec instruments in the gensim of python are called, unsupervised training is carried out to word, obtain its word Vector representation.

Further, the detailed process of the step S2 is as follows：

S21：Participle is carried out for each sentence in training set, stop words is removed, a collection for word is obtained to each sentence Close, count the appearance frequency of each word in appearance frequency and the not a theme sentence of each word in theme line；

S22：The information entropy IG of each word is calculated, formula is calculated as follows：

Wherein, K is coefficient, and n represents classification number, p_iThe probability that each word appears in classification i is represented, meanwhile, setting frequency Number threshold value, the word for frequency less than 3 does not consider its value；

S23：Calculate each word in different classes of middle association relationship, formula is calculated as follows：

For " pleasure " this word, p (pleasure, theme line) represents that pleasure appears in the number of times in theme line, similarly p (pleasure, not a theme sentence) represents the number of times during " pleasure " this word appears in not a theme sentence；

PMI value computing formula to each word is as follows：

PMI (pleasure)=PMI (pleasure, theme line)/PMI (pleasure, theme line).

Further, the detailed process of the step S3 is as follows：

S31：According to 200 dimensions of term vector for drawing each word before, comentropy IG and association relationship PMI, therefore often Individual word 202 features altogether, in training set, word number is used as standard, such as, this sentence in selecting that maximum sentence There are 200 words in son.So just there is 200*202 character representation, it is inadequate for word in sentence, that is to say, that the word in sentence Number less than 200,100 words for example, then actually have 100*202 feature, not enough use 0 is supplemented and is also accomplished by (200-100) * 202 values；

S32：For each sentence vector, 200*202 feature is obtained, first, be introduced into convolutional neural networks layer, meter Calculate formula：

WhereinJ-th characteristic pattern of l convolutional layers is represented, the right represents the result of last layerWith j-th convolution kernel Convolution is carried out, and plus bias vectorIt is eventually adding activation primitive；

S33：By above-mentioned input 200*202 vectors, it is assumed that set a convolution kernel, convolution kernel size is 3, then warp The output for crossing S32 is 198 dimensions, then, is input to pond layer in convolutional neural networks CNN, and its computing formula is：

Herein, it is 198 dimensions to each above-mentioned characteristic pattern, takes maximum, reforms into 1 dimension, but it is actual On, n characteristic pattern is set to each sentence, therefore, each sentence has n features；

S34：For the result of above-mentioned convolutional neural networks CNN, n feature is formed to each sentence, using this as following The input of ring neutral net RNN, calculates the vector of concealed nodes, and computing formula is：

h^t=f (x^tU+h^t-1W+b^t)

Wherein, x^tIt is input, U is enter into the conversion of concealed nodes, h^t-1The concealed nodes of last layer are represented, W represents hidden Layer to the conversion of hidden layer is hidden, b is bias vector, finally plus f activation primitives；

S35：It is time series models because RNN master is to be processed, therefore it is sorted in final step, the calculating of output is public Formula is：

o^t=soft max (h^tV+b^t)

o^tOutput is represented, wherein V is the conversion for representing hidden layer to output layer, finally plus softmax functions；

S36：After result is calculated, error and true error are contrasted, then counting loss function progressively adjusts Parameter so that loss function is minimum.

Further, the detailed process of the step S4 is as follows：

S41：For each sentence in test set, participle is carried out, remove stop words, obtain the term vector of each word, information Entropy, association relationship, for being supplemented with 0 less than 200 words in sentence；

S42：To each sentence expression into 200*202 form, it is input in CNN_RNN models, obtains each sentence Classification；

S43：Result and standard results to model output are contrasted, counting accuracy, recall rate, F test values and standard True rate.

Compared with prior art, the beneficial effect of technical solution of the present invention is：

The inventive method trains term vector using the whole network news data collection in search dog laboratory so that each close word Spatially closely located；And the travel notes of 600 are respectively crawled from Baidu's tour site and hornet nest tour site, to travel notes point Sentence is cut into, these sentences are divided into training set and test set and according to 8：2 ratio is divided, then for training set Computing formula according to comentropy and mutual information calculates the information entropy and association relationship of each word；Then, for training set In each sentence according to the term vector that calculates and the comentropy for calculating and mutual information come construction feature, as the mixing for building The input of deep neural network CNN_RNN, gets parameter；Meanwhile, to each sentence in test set according to the word for calculating to The comentropy and mutual information measured and calculate carry out construction feature, are input in deep neural network CNN_RNN, using the ginseng for obtaining Number, calculates its classification, draws the error of standard results and prediction, evaluates its performance, and it is good that experiment proves that the method has Recognition effect.

Brief description of the drawings

Fig. 1 is the inventive method flow chart；

Fig. 2 is the CNN_RNN model structure schematic diagrames that the present invention builds；

Fig. 3 is the impact effect histogram of the Information Entropy Features to classification results of proposition in the present invention；

Fig. 4 is the disaggregated model of proposition in the present invention and the contrast histogram of traditional SVM and xgboost classifying qualities.

Specific embodiment

Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent；

In order to more preferably illustrate the present embodiment, accompanying drawing some parts have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it can be to understand that some known features and its explanation may be omitted in accompanying drawing 's.

Technical scheme is described further with reference to the accompanying drawings and examples.

Embodiment 1

As shown in figure 1, the theme line recognition methods of the deep neural network CNN and RNN of a kind of mixing, including following step Suddenly：

S3：Comentropy and mutual information that the term vector and S2 calculated according to S1 for each sentence in training set are calculated Carry out construction feature, as the input (model of structure such as Fig. 2) of the interacting depth neutral net CNN_RNN for building, get ginseng Number；

Further, the detailed process of the step S1 is as follows：

Further, the detailed process of the step S2 is as follows：

PMI value computing formula to each word is as follows：

PMI (pleasure)=PMI (pleasure, theme line)/PMI (pleasure, theme line).

Further, the detailed process of the step S3 is as follows：

h^t=f (x^tU+h^t-1W+b^t)

o^t=soft max (h^tV+b^t)

Further, the detailed process of the step S4 is as follows：

Tested using the method：

1st, experimental data set：Baidu travels and 1200 travel notes on hornet nest；

2nd, experimental situation：Python2.7.9 and tensorflow；

3rd, experimental tool collection：Python Open-Source Tools casees；

4th, experimental technique：The data set of crawl includes the Guangzhou travel notes of 1200 correlations, and every length of travel notes is from 2000 To 20000 numbers of words, the sentence that travel notes are divided into from 20 to 500 so far.Split by travel notes, obtain altogether The sentence of 50000.These sentences include 100000 words altogether, carry out artificial they being labeled as theme line or non-master Topic sentence, classification number is 2.

Each root first in each sentence according to the word2vec models for training obtain its corresponding 200 dimension word to Amount, for the word for not having to occur in sentence, randomly generates.For each word in training set calculate association relationship PMI and Information entropy, we produce a vector for the number of 202 sentence words to each sentence.We to each sentence using 35 as Standard, is supplemented less than 35 with 0.In a model, we have been described in for these pretreatments, and we are not repeating.

Combination and single model convolutional neural networks CNN or Recognition with Recurrent Neural Network RNN first to our model are entered Row contrast, and carry out the contrast of effect respectively to the addition of comentropy and mutual information feature extraction mode.Fig. 3 main presentations are Evaluation of the different models to effect, in order to more convenient, RNN is the LSTM units of two-layer.

5th, evaluation criterion：Accurate rate, recall rate, F values, accuracy rate

6th, experimental result：As shown in figure 4, for comparison model and the performance of traditional grader, we use SVM algorithm Contrasted with xgbosot algorithms.In traditional classifier, SVM and xgboost is to be known as being best single point respectively Class device and integrated classifier.

It can be found that our model can obtain extraordinary effect for other model.

The same or analogous part of same or analogous label correspondence；

Position relationship for the explanation of being for illustration only property described in accompanying drawing, it is impossible to be interpreted as the limitation to this patent；

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. the theme line recognition methods of the deep neural network CNN and RNN of a kind of mixing, it is characterised in that comprise the following steps：

S1：Using the whole network news data collection in search dog laboratory train term vector so that each close word is spatially It is closely located；

S2：The travel notes of 600 are respectively crawled from Baidu's tour site and hornet nest tour site, sentence is divided into travel notes, by this A little sentences are divided into training set and test set and according to 8：2 ratio is divided, then for training set according to comentropy and The computing formula of mutual information calculates the information entropy and association relationship of each word；

S3：Comentropy that the term vector and S2 calculated according to S1 for each sentence in training set are calculated and mutual information are come structure Feature is built, as the input of the interacting depth neutral net CNN_RNN for building, parameter is got；

S4：Comentropy and mutual information that the same term vector calculated according to S1 to each sentence in test set and S2 is calculated Carry out construction feature, be input in deep neural network CNN_RNN, the parameter obtained using S3 calculates its classification, draws standard Result and the error of prediction, evaluate its performance.

2. the theme line recognition methods of the deep neural network CNN and RNN of mixing according to claim 1, its feature exists In the detailed process of the step S1 is as follows：

S11：The whole network news data collection in search dog laboratory is downloaded first, and data set is cleaned, draw every completely News；

S12：Participle is carried out to data set, is written in file, " t are used between word and word " separate, ” is used between news and news N " is separated；

S13：The word2vec instruments in the gensim of python are called, unsupervised training is carried out to word, obtain its term vector Represent.

3. the theme line recognition methods of the deep neural network CNN and RNN of mixing according to claim 2, its feature exists In the detailed process of the step S2 is as follows：

S21：Participle is carried out for each sentence in training set, stop words is removed, a set for word is obtained to each sentence, united Count out the appearance frequency of each word in appearance frequency and the not a theme sentence of each word in theme line；

\inf o r m a t i o n G a i n = - K Σ_{i = 1}^{n} p_{i} {logp}_{i}

Wherein, K is coefficient, and n represents classification number, p_iThe probability that each word appears in classification i is represented, meanwhile, set frequency threshold Value, the word for frequency less than 3, its value is not considered；

For " pleasure " this word, p (pleasure, theme line) represents that pleasure appears in the number of times in theme line, and similarly p is (pleased Happy, not a theme sentence) represent number of times during " pleasure " this word appears in not a theme；

PMI value computing formula to each word is as follows：

PMI (pleasure)=PMI (pleasure, theme line)/PMI (pleasure, theme line).

4. the theme line recognition methods of the deep neural network CNN and RNN of mixing according to claim 3, its feature exists In the detailed process of the step S3 is as follows：

S31：According to 200 dimensions of term vector for drawing each word before, comentropy IG and association relationship PMI, therefore each word 202 features altogether, in training set, in selecting that maximum sentence word number as standard, such as, in this sentence There are 200 words.So just there is 200*202 character representation, it is inadequate for word in sentence, that is to say, that the word number in sentence Less than 200,100 words for example, then actually have 100*202 feature, not enough use 0 is supplemented and is also accomplished by (200-100) * 202 values；

S32：For each sentence vector, 200*202 feature is obtained, first, be introduced into convolutional neural networks layer, calculate public Formula：

x_{j}^{l} = f (\underset{i &Element; M_{j}}{Σ} x_{i}^{l - 1} * k_{i j}^{l} + b_{j}^{l})

WhereinJ-th characteristic pattern of l convolutional layers is represented, the right represents the result of last layerWith j-th convolution kernelCarry out Convolution, and plus bias vectorIt is eventually adding activation primitive；

S33：By above-mentioned input 200*202 vectors, it is assumed that set a convolution kernel, convolution kernel size is 3, then passed through The output of S32 is 198 dimensions, then, is input to pond layer in convolutional neural networks CNN, and its computing formula is：

x_{j}^{l} = f (m a x (x_{j}^{l - 1}))

Herein, it is 198 dimensions to each above-mentioned characteristic pattern, takes maximum, reforms into 1 dimension, but it is in fact, right Each sentence sets n characteristic pattern, therefore, each sentence has n features；

S34：For the result of above-mentioned convolutional neural networks CNN, n feature is formed to each sentence, it is refreshing using this as circulation Through the input of network RNN, the vector of concealed nodes is calculated, computing formula is：

h^t=f (x^tU+h^t-1W+b^t)

Wherein, x^tIt is input, U is enter into the conversion of concealed nodes, h^t-1The concealed nodes of last layer are represented, W represents hidden layer To the conversion of hidden layer, b is bias vector, finally plus f activation primitives；

S35：It is time series models because RNN master is to be processed, therefore it is sorted in final step, the computing formula of output It is：

o^t=soft max (h^tV+b^t)

S36：After result is calculated, error and true error are contrasted, counting loss function, then progressively adjustment is joined Number so that loss function is minimum.

5. the theme line recognition methods of the deep neural network CNN and RNN of mixing according to claim 4, its feature exists In the detailed process of the step S4 is as follows：

S41：For each sentence in test set, participle is carried out, remove stop words, obtain the term vector of each word, comentropy, mutually The value of information, for being supplemented with 0 less than 200 words in sentence；

S42：To each sentence expression into 200*202 form, it is input in CNN_RNN models, obtains the classification of each sentence；

S43：Result and standard results to model output are contrasted, counting accuracy, recall rate, F test values and accuracy rate.