CN109189925A

CN109189925A - Term vector model based on mutual information and based on the file classification method of CNN

Info

Publication number: CN109189925A
Application number: CN201810938236.8A
Authority: CN
Inventors: 李万理; 吴海明; 薛云
Original assignee: South China Normal University
Current assignee: Shanghai Airlines Intellectual Property Services Ltd; Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2018-08-16
Filing date: 2018-08-16
Publication date: 2019-01-11
Anticipated expiration: 2038-08-16
Also published as: CN109189925B

Abstract

The present invention discloses term vector model based on mutual information and based on the file classification method of CNN.This method comprises: (1) passes through the global term vector method training term vector model based on mutual information；(2) according to trained term vector model, the term vector matrix of the text is determined；(3) feature in term vector matrix, and train classification models are extracted by CNN；(4) according to trained term vector model and CNN Feature Selection Model to input Text character extraction；(5) text feature obtained according to CNN Feature Selection Model calculates the mapping distance of text and pre-set categories by softmax and cross-entropy method, and taking distance is recently that text corresponds to classification.The method overcome deficiency of the Glove term vector on semantic capture and statistics co-occurrence matrix, reduce model training complexity, can accurately excavate the characteristic of division of text, suitable for the text classification in various fields, have great practical value.

Description

Term vector model based on mutual information and based on the file classification method of CNN

Technical field

It is specifically a kind of based on mutual information the present invention relates to the text classification field of natural language processing technique Term vector model and the file classification method for being based on CNN (convolutional neural networks).

Background technique

With the development of internet technology, the data volume in WWW is growing day by day, wherein having a large amount of data is text How data, all trades and professions that these data are related to society accomplish the reasonable of data in face of the text data of the scale of construction huge in this way Changing classification becomes an important research puzzle.Text is rationalized, mechanized classification, people can be helped to solve many difficult Topic, such as: many occasions such as junk information differentiates, deceptive information is found.In recent years, to complete text classification, then text Expression just seems most important, the reasonable available accurate text semantic information of text representation.

1. term vector technical background

In natural language representation method, the vectorization expression of word is important basic technology.Traditional term vector table Show that method is one dictionary of creation and each word order is numbered, i.e. one-hot encoding representation.This representation method can not capture word Semantic similarity between language, and dimension disaster easily occurs.For this purpose, Hinton [1] was distributed in proposition term vector in 1986 Formula representation method, this method indicate word using the vector of fixed dimension, the language of word are indicated with the distance between vector Adopted distance has also broken the semantic gap between word while playing the role of dimensionality reduction, so that the semantic relation between word obtains To better description.With the continuous deepening of research, Bengio proposes to use neural network language model, while obtaining word Vector.This model is able to use neural network and carries out unsupervised learning, the context relation between word is captured, in training net Term vector is also obtained as parameter together training while network model.Although effectively, calculation amount is huge for the model of Bengio, In order to reduce computation complexity, Miklov proposes improved model on this basis --- and word2vec simultaneously obtains better knot Fruit, while the complexity of model is reduced to n × log by n × V₂V, so that the training of extensive term vector becomes more efficient. Although Miklov is yielded good result in the expression of word, still there are many scholar to language indicate carry out deeper into Probe into.Wherein, the objective function in Glove model refinement word2vec model that Pennington is proposed, the overall situation is united Information --- co-occurrence matrix is introduced into the training of term vector meter, and knot more better than word2vec is achieved in multiple experiments Fruit.

This method is in order to improve deficiency of the Glove model on semantic capture and statistics co-occurrence matrix, herein in its mould Introduction point mutual information on the basis of type constructs global point mutual information matrix, and training obtains final term vector.In multiple and language The semanteme that is carried out on the relevant data set of justice the experimental results showed that, the term vector based on global point mutual information matrix can better table Existing semantic relation.The main contributions of this paper have: 1, global point mutual information matrix being introduced into term vector calculating, make the system of term vector It is more accurate to count information.2, the objective function for improving Glove model eliminates break-in operation, therefore can significantly reduce model instruction Experienced calculation amount.

2. classification method technical background

Text classification common technology is divided into the file classification method based on sentiment dictionary, the text classification based on machine learning Method and file classification method based on deep learning.The main application characteristic of these methods is as follows:

1) based on the file classification method of sentiment dictionary

Method based on sentiment dictionary is literary using existing semantic dictionary resource construction domain lexicon, then by comparing emotion Forward direction emotion word, negative sense emotion word included in this, mark positive and negative integer value as emotional value, while also to consider Influence of the special part-of-speech rule, syntactic structure to Judgment by emotion, such as negative, progressive sentence, turnover sentence.

File classification method based on sentiment dictionary is easy to accomplish, but this method needs fairly large sentiment dictionary, and it It is a linear model, limited capacity.

2) based on the file classification method of machine learning

Text classification based on machine learning, key are that feature selecting, feature weight quantization, disaggregated model etc. 3 is wanted Element.Feature selecting mainly has based on information gain, based on the methods of document frequency.Feature weight quantization mainly has word frequency, inverse text Shelves frequency, TF-IDF, entropy weight are again etc..Sorter model includes: naive Bayesian, k nearest neighbor, support vector machines, neural network, determines Plan tree etc..The method using three kinds of sorter models is as follows.

(a) support vector machine method

In classification problem, there are problems that linearly inseparable, support vector machine method is to linearly can not in lower dimensional space Point sample set mapped, allow linear separability in sample high-dimensional feature space, and acquire in this higher dimensional space optimal super Plane, so that class interval maximizes, the final classification for realizing sample.

But there is also some disadvantages for support vector machines:

First, algorithm of support vector machine is difficult to carry out large-scale training sample.Because support vector machines is by secondary Planning solves quadratic programming and is related to the calculating of m rank matrix to solve, and wherein m is the number of sample.When m it is in a large number when The storage and calculating of the matrix will expend a large amount of machine memory and operation time；

Second, more classification problems, which are solved, with support vector machines has difficulties.Classical algorithm of support vector machine only provides two The algorithm of classification.

(b) traditional decision-tree

Decision tree can be used for a kind of tree construction of classification, be made of node and branch.Decision tree learning is substantially One group of classifying rules is summarized from training data concentration.The algorithm of decision tree is usually a recursive selection optimal characteristics, and Training set is split according to this feature, so that each Sub Data Set has the process of a best classification.

But there is also some disadvantages for decision tree:

First, decision Tree algorithms are very easy to over-fitting, cause generalization ability not strong；

Second, decision tree can be because a little change of sample, results in the violent change of tree construction；

Third, some more complicated relationships, decision tree is difficult to learn, such as exclusive or.

(c) neural network method

Neural network method can also be used to solve the problems, such as Nonlinear Classification.One neural network model generally comprises big The neuron being connected to each other is measured, the weighting input value of the other adjacent neurons of activation primitive calculation processing is passed through.Neural network mould Type is trained by a large amount of data, adjusts weight by optimizing the value of cost function, so that category of model effect is best, The final classification for realizing sample.

But there is also some disadvantages for neural network:

First, when facing big data, need the feature for artificially extracting initial data as input.And deep learning can be certainly The feature of dynamic selection initial data；

Second, it is desirable to more accurate approximate complicated function, it is necessary to increase the number of plies of hidden layer, this is easy for generating gradient The problem of disappearance or gradient are exploded；

Third can not handle time series data (such as audio, text), because neural network is free of time parameter.

3) based on the file classification method of deep learning

Deep learning refers to deep neural network model, refers generally to nerve of the network number of plies at three layers or three layers or more Network structure.

Some models of deep learning also improve deficiency existing for neural network model simultaneously.Such as convolutional neural networks (CNN) trained number of parameters, Recognition with Recurrent Neural Network (RNN) and shot and long term memory network is greatly reduced in " power is shared " (LSTM) it can handle time series data.The model of two kinds of deep learnings is as follows.

(a) RNN method

Recognition with Recurrent Neural Network is a kind of neural network of node orientation connection cyclization.It is shown as in network structure, it is multiple The hidden layer of simple neural network joins end to end according to time series.

But due to RNN model, during model training, error, which is reversely relayed, can have gradient disappearance or explosion Problem, so model cannot establish the dependence of long period sequence.That is, the short period can only be arranged in RNN model The dependence of sequence.

(b) LSTM method

The dependence of long period sequence cannot be established in order to solve RNN model, proposes LSTM model.It and RNN Network is compared, and mainly the internal structure of hidden layer is different.Only one activation primitive of the hidden layer of standard RNN network, and The hidden layer of LSTM network has a more complex network structure, also comprising input gate, forget door and out gate these three doors.

But LSTM can only avoid the gradient of RNN from disappearing, but gradient explosion issues cannot be fought.

Summary of the invention

It is of the existing technology above-mentioned insufficient it is an object of the invention to overcome, the term vector based on mutual information is provided Model and file classification method based on CNN.

The purpose of the present invention is achieved through the following technical solutions.

Based on a Chinese comment text classification method for mutual information overall situation term vector model comprising: (S1) is by being based on The global term vector method training term vector model of point mutual information；

(S2) according to trained term vector model, the term vector matrix of the text is determined；

(S3) feature in term vector matrix, and train classification models are extracted by convolutional neural networks (CNN)；(S4) root According to trained term vector model and CNN Feature Selection Model to input Text character extraction；

(S5) text feature obtained according to CNN Feature Selection Model calculates text by softmax and cross-entropy method With the mapping distance of pre-set categories, taking distance is recently that text corresponds to classification.

Further, step (S1) specifically includes:

(1) Chinese wikipedia data set is inputted, data are pre-processed, removes punctuation mark and space；

(2) word segmentation processing is carried out with the data set that Chinese word segmentation tool obtains step (1), corpus data is converted to word Word order column；

(3) word frequency statistics are carried out to the word that step (2) obtains, and statistical result is saved in a hard disk；As a result lattice are saved Formula are as follows: " word t word frequency "；

(4) co-occurrence statistics are carried out to the word that step (2) obtains, corpus is traversed according to the window size of setting, is obtained It is saved in the form of triple in a hard disk to co-occurrence number of the every two word in window, and by result, saves format: " word Language 1 t word 2 t co-occurrence number "；

(5) triple obtained to step (4) is upset at random, and the triple after upsetting at random is stored in hard disk In, save format are as follows: " word 1 t word 2 t co-occurrence number "；

(6) it to all words occurred in step (2), random initializtion term vector, and saves in memory, facilitates program It reads and modifies；

(7) triple obtained in step (5) is completely traversed, according to objective function:Term vector is adjusted using gradient descent method, w in objective function_iWithCentered on respectively Word and the corresponding term vector of upper and lower cliction, V indicate all term vectors in vocabulary,

(8) iterative step (7) are repeated continuously, until result restrain to get arrive the term vector based on mutual information, will in Term vector in depositing saves in a hard disk, saves format are as follows: " Ci Yu t term vector ".

Further, in rapid (5) to step (8), for possessing two word w of similar contexts_iAnd w_jFor, w_iAnd w_j Between relationship can by with third wordRelationship embody, to w_iAnd w_jBetween relationship modeled to obtain:

In equation, w_iAnd w_jIndicate two centre words for possessing similar contexts,For context term vector, and For w_iWithThe probability occurred jointly；

Ratio on the right of equation is model output, represents the relationship between the word for wanting prediction；Keeping initial model Export it is constant under the premise of, the input of initial model is simplified, to establish optimizable objective function；In view of vector Space has inherent linear structure, then is limited to only be influenced by the difference of two center term vectors by input function form, Obtain following formula:

Since the right of equation is that scalar by complicated linearly or nonlinearly transformation converts scalar for input vector Form；But this undoubtedly will increase the complexity of model, and influence the linear structure of model.In order to avoid such case, adopt Vector operation is carried out with the form of dot product, portrays the relationship between two words, such as following formula:

In order to convert the equation left side to the form of ratio, in conjunction with the condition of continuity, above-mentioned equation left side functional equation General solution form is F (x)=e^ax；In view of the norm of term vector can be normalized, F (x)=e is directly taken^x, then have:

It enabling again at this time (1), molecule denominator is equal to each other in (2) two formulas, then it can obtain:

That is:

Further have:

It modifies as a result, to the objective function of GloVe model, by the co-occurrence number X in former objective function_ikIt replaces with a little Mutual informationI.e. new objective function are as follows:

The objective function of GloVe model:

The final goal function of term vector model are as follows:

Obtaining the term vector of target word after gradient descent method training is w_i, then to each word repeat into The term vector of all words can be obtained in the above-mentioned operation of row.

Further, step (S2) specifically includes:

Based on a term vector data for mutual information overall situation term vector model storage, training corpus is found by matched method In the corresponding word vectors w ∈ R of each word^d×1(word vectors are the one-dimensional vectors that length is d), and these term vectors are pressed Sequence (the w of original sentence₁,w₂,w₃,…,w_s) the sentence matrix S that is combined₀(S₀∈R^d×s), wherein d is term vector Dimension, s are the word number of longest sentence in corpus, i.e. sentence length.

Further, (S3) is specifically included:

(1) tag along sort in Chinese wikipedia data set is carried out boolean vector indicates y^j∈Y(j∈(1,2,…, L)), whereinIndicate that the vector of text i jth class indicates that Y is class set；L is the total number of class, and vector y^jDimension For l；

(2) based on obtained sentence matrix S₀(S₀∈R^d×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution Operation obtains eigenmatrix S₁；

(3) based on obtained eigenmatrix S₁The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S₂；

(4) step (2) and step (3) are repeated, until eigenmatrix S_n(wherein, n indicates to carry out convolution sum pondization behaviour altogether The total degree of work) contain only l number until；

(5) by eigenmatrix S_nExpand into the one-dimensional vector y that length is l_-, text i is then calculated by softmax function Probability value in each dimensionIt is calculated finally by entropy function is intersectedWith correct category(i.e. class belonging to text itself Category vector) distance d_i；

(6) by d_iIt is cumulative, and to minimize(wherein, the text number that M indicates entire training corpus) is target Model training is carried out, and the parameter of obtained CNN disaggregated model is preserved.

Further, sentence matrix S₀Convolution operation, which is carried out, by convolution nuclear parameter (W, b) obtains eigenmatrix S₁:

S₁=f (S₀·W+b) (3)

Wherein, W is convolution kernel parameter matrix, and b is bias vector, and f () indicates activation primitive；Based on obtained feature square Battle array S₁, pondization is carried out by maximum pond method and is calculated:

S₂=downsample (S₁) (4)

Wherein, downsample () indicates pond function；The calculating for passing through formula (3) and (4) repeatedly, obtains to the end Eigenmatrix S_n, and spread out the one-dimensional vector y for being l for length_-, and text is calculated in each class by softmax function Probability value vectorIt is calculated finally by entropy function is intersectedWithDistance d_i:

Wherein, y_-ikIndicate one-dimensional vector y_-Kth (1≤k≤l) a value,Indicate text i in category vector kth dimension Probability；Last objective function are as follows:

Loss=d_i

This method is by gradient descent method, to minimize loss calculating parameter (W, b), and last (W, b) is saved and is made For model parameter, for being used when text classification to be sorted.

Further, step (S5) specifically includes:

(1) it based on the pre-treatment step of training sample, treats classifying text and is equally pre-processed, sentence square is calculated Battle array S '₀(S′₀∈R^d×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution algorithm, obtains eigenmatrix S '₁；

(2) based on obtained eigenmatrix S '₁The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S'₂；

(3) step (2) and step (3) are repeated, until eigenmatrix S'_nAlso until becoming only l number；

(4) by eigenmatrix S'_nThe one-dimensional vector y'_ that length is l is expanded into, text is then calculated by softmax function Originally the probability value vector in each classIt is calculated finally by entropy function is intersectedWith each class label y^jDistance d'^j；

(5) finally, in obtained l distance d'^jIt is middle to find class, the as class of the text corresponding to the smallest distance Label: label'(j)=min (d'^j)

Compared with prior art, the invention has the advantages that and technical effect:

This method sufficiently extracts the semantic information and local feature of text context, introduces on the basis of original classification method Convolutional neural networks (CNN) method of deep learning.CNN is applied to the feature extraction of image procossing originally, and has fabulous Local message extractability is well suited as the feature extracting method of text term vector matrix.This method mainly improves first Shortcoming of the Glove term vector on semantic capture and statistics co-occurrence matrix, draws pronouns, general term for nouns, numerals and measure words vector for global point mutual information matrix In matrix, keep the statistical information of term vector more accurate, while improving the objective function of Glove model, remove break-in operation, reduces Model training complexity, in addition, the key message of text can be effectively found in conjunction with CNN Feature Selection Model, it is accurate to determine text This meaning.The present invention can accurately excavate the characteristic of division of text, suitable for the text classification in various fields, have very big Practical value.

Detailed description of the invention

Fig. 1 is the flow chart of the term vector model based on mutual information and the file classification method based on CNN.

Fig. 2 is the term vector training method flow chart based on mutual information.

Fig. 3 is the training flow chart of the textual classification model based on CNN.

Specific embodiment

The solution of the present invention is made to explain in detail enough in foregoing summary part.Below in conjunction with attached drawing and Specific implementation of the invention is described in detail in specific embodiment, but implementation of the invention is without being limited thereto.It is noted that It is that those skilled in the art can refer to prior art understanding or realize if having the process or symbol of not special detailed description below , such as be all the theoretical reason for being referred to existing CNN for some conventional the parameter such as w and b etc. in CNN neural network Solution, it repeats no more below.

Referring to Fig. 1, term vector model in this example based on mutual information and based on the file classification method of CNN, comprising:

(S1) pass through the global term vector method training term vector model based on mutual information；

(S3) feature in term vector matrix, and train classification models are extracted by convolutional neural networks (CNN)；

(S4) according to trained term vector model and CNN Feature Selection Model to input Text character extraction；

1, based on point mutual information training term vector

For the language model based on statistical information training term vector, how with comprehensive and accurate information word is portrayed Between relationship be the key that model training.Therefore, the present invention improves Glove model.By deriving, discovery uses word Between point mutual information matrix can preferably between portrayed words and word statistical relationship.Such as Fig. 2, specific technical solution is as follows:

Term vector training method based on mutual information, comprising the following steps:

(3) word frequency statistics are carried out to the word that step (2) obtains, and statistical result is saved in a hard disk.As a result lattice are saved Formula are as follows: " word t word frequency "；

(4) co-occurrence statistics are carried out to the word that step (2) obtains, according to the window size being set in advance to corpus progress time It goes through, obtains co-occurrence number of the every two word in window, and result is saved in a hard disk in the form of triple, save lattice Formula: " word 1 t word 2 t co-occurrence number "；

(7) triple obtained in step (5) is completely traversed, according to objective function:Term vector is adjusted using gradient descent method, in objective function

In the above-mentioned term vector training method based on mutual information, step (5) in step (8), this method be based on " for Possess two word w of similar contexts_iAnd w_jFor, w_iAnd w_jBetween relationship can by with third wordRelationship come The hypothesis of embodiment " is to w_iAnd w_jBetween relationship modeled to obtain:

Ratio on the right of equation is model output, represents the relationship between the word for wanting prediction.Keeping model output Under the premise of constant, the input of model is simplified, to establish optimizable objective function.In view of vector space has Inherent linear structure, input function form then is limited to only be influenced by the difference of two center term vectors, obtain following formula:

Since the right of equation is scalar, can be converted input vector to by complicated linearly or nonlinearly transformation The form of scalar.But this undoubtedly will increase the complexity of model, and influence the linear structure of model.In order to avoid this feelings Condition is carried out vector operation in the form of dot product, portrays the relationship between two words, such as following formula:

In order to convert the equation left side to the form of ratio, in conjunction with the condition of continuity, above-mentioned equation left side functional equation General solution form is F (x)=e^ax.In view of that the norm of term vector can be normalized, F (x)=e might as well be directly taken^x, then Have:

That is:

Further have:

The objective function of GloVe model:

The objective function of this method:

From the above equation, we can see that the model midpoint mutual information considers the corresponding probability of occurrence of two words in the denominator, therefore not It will receive the interference of high frequency words, the relationship between the better portrayed words of energy, so that the term vector of training more accurately reflects the meaning of a word. In addition, compare objective function in the model and GloVe model, it is more simple to be apparent from the objective function form that this method defines, and And the operation of truncation funcation is eliminated, therefore calculation amount can be effectively reduced.

After determining objective function, model can carry out the training process of term vector.Firstly, passing through the side of traversal corpus Formula, statistics obtain point mutual information (PMI) matrix between word.After obtaining a mutual information matrix, model is to a mutual information square Value in battle array is traversed, while being trained using gradient descent method to term vector.By constantly iteration, finally obtain just True term vector indicates.

Finally, by above-mentioned steps, by text representation each in data set at the term vector matrix of s × d, wherein s is indicated The word number of longest text, d indicate the dimension of each word in text set S to be sorted, certainly, are less than s for text size Sentence, can pass through " zero padding " operation carry out completion.

2, the textual classification model based on CNN is established

Based on text representation matrix obtained above, this method needs the training classification mould on the tape label corpus being collected into Type, referring to Fig. 3, the specific steps are as follows:

(1) tag along sort in data set (such as Tan Song wave data set) is subjected to vectorization expressionWherein,Table Show that text i belongs to the vector expression (vector of j classDimension be l), Y is class set；

(2) based on obtained sentence matrix S₀, the convolution kernel that this method difference preferred dimension is 3 × 2, progress convolution fortune It calculates, obtains eigenmatrix S₁；

(4) step (2) and step (3) are repeated, until eigenmatrix S_nUntil becoming only l numerical value；

(5) by eigenmatrix S_nExpand into the one-dimensional vector y that length is l_-, text is then calculated by softmax function and is existed Probability value vector in each classIt is calculated finally by entropy function is intersectedWithDistance d_i；

(6) by d_iIt is cumulative, and to minimize d_iAnd carry out model training for objective function, and model parameter is saved Get off.N is the number of iterations in Fig. 3, and N is the maximum times of setting.

In the training of above-mentioned CNN disaggregated model, sentence matrix S₀Convolution operation is carried out by convolution nuclear parameter (W, b) to obtain Eigenmatrix S₁:

S₁=f (S₀·W+b) (3)

Wherein, f () indicates activation primitive.Based on obtained eigenmatrix S₁, pond is carried out by maximum pond method It calculates:

S₂=downsample (S₁) (4)

Wherein, downsample () indicates pond function.The calculating for passing through formula (3) and (4) repeatedly, obtains to the end Eigenmatrix S_n, and spread out the one-dimensional vector y for being l for length_-, and text is calculated in each class by softmax function Probability value vectorIt is calculated finally by entropy function is intersectedWithDistance d_i:

Wherein, y_-ikIndicate one-dimensional vector y_-Kth (1≤k≤l) a value,Indicate text i in category vector kth dimension Probability.Last objective function are as follows:

Loss=d_i

This method is by gradient descent method, to minimize loss calculating parameter (W, b).

3, text classification

Based on obtained term vector model and CNN disaggregated model, can to classify to text, detailed process is such as Under:

(1) sentence matrix S is calculated based on text to be sorted₀, the convolution kernel that this method difference preferred dimension is 3 × 2, Convolution algorithm is carried out, eigenmatrix S is obtained₁；

(2) based on obtained eigenmatrix S₁The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S₂；

(3) step (2) and step (3) are repeated, until eigenmatrix S_nUntil becoming only l numerical value；

(4) by eigenmatrix S_nExpand into the one-dimensional vector y that length is l_-, text is then calculated by softmax function and is existed Probability value vector in each classIt is calculated finally by entropy function is intersectedWith label y^jDistance d_j；

(5) based on obtained distance d_j, by minimizing formula

Label=min (d_j)

Select the category label apart from nearest class for text.

This method sufficiently extracts the semantic information and local feature of text context, introduces on the basis of original classification method Convolutional neural networks (CNN) method of deep learning.CNN is applied to the feature extraction of image procossing originally, and has fabulous Local message extractability is well suited as the feature extracting method of text term vector matrix.This method mainly improves first Shortcoming of the Glove term vector on semantic capture and statistics co-occurrence matrix, draws pronouns, general term for nouns, numerals and measure words vector for global point mutual information matrix In matrix, keep the statistical information of term vector more accurate, while improving the objective function of Glove model, remove break-in operation, reduces Model training complexity, in addition, the key message of text can be effectively found in conjunction with CNN Feature Selection Model, it is accurate to determine text This meaning.

Claims

1. term vector model based on mutual information and based on the file classification method of CNN, characterized by comprising: (S1) passes through Global term vector method training term vector model based on mutual information；

(S3) feature in term vector matrix, and train classification models are extracted by convolutional neural networks (CNN)；(S4) according to instruction The term vector model and CNN Feature Selection Model perfected are to input Text character extraction；

(S5) text feature obtained according to CNN Feature Selection Model, by softmax and cross-entropy method calculating text and in advance If the mapping distance of classification, taking distance is recently that text corresponds to classification.

2. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that step (S1) specifically includes:

(2) word segmentation processing is carried out with the data set that Chinese word segmentation tool obtains step (1), corpus data is converted to word sequence Column；

(3) word frequency statistics are carried out to the word that step (2) obtains, and statistical result is saved in a hard disk；

(4) co-occurrence statistics are carried out to the word that step (2) obtains, corpus is traversed according to the window size of setting, obtained every Co-occurrence number of two words in window, and result is saved in a hard disk in the form of triple；

(5) triple obtained to step (4) is upset at random, and the triple after upsetting at random saves in a hard disk；

(6) it to all words occurred in step (2), random initializtion term vector, and saves in memory, program is facilitated to read And modification；

(7) triple obtained in step (5) is completely traversed, according to objective function: Term vector is adjusted using gradient descent method, w in objective function_iWithRespectively centre word and the corresponding word of upper and lower cliction Vector, V indicate all term vectors in vocabulary,

(8) iterative step (7) are repeated continuously, until result restrain to get arrive the term vector based on mutual information, will be in memory Term vector save in a hard disk.

3. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is in step (5) to step (8), for possessing two word w of similar contexts_iAnd w_jFor, w_iAnd w_jBetween relationship Can by with third wordRelationship embody, to w_iAnd w_jBetween relationship modeled to obtain:

In equation, w_iAnd w_jIndicate two centre words for possessing similar contexts,For context term vector, andFor w_i WithThe probability occurred jointly；

Ratio on the right of equation is model output, represents the relationship between the word for wanting prediction；Keeping initial model output Under the premise of constant, the input of initial model is simplified, to establish optimizable objective function；In view of vector space With inherent linear structure, then input function form is limited to only be influenced by the difference of two center term vectors, be obtained Following formula:

Since the right of equation is that scalar by complicated linearly or nonlinearly transformation converts input vector to the shape of scalar Formula；Vector operation is carried out in the form of dot product, portrays the relationship between two words, such as following formula:

In order to convert the equation left side to the form of ratio, in conjunction with the condition of continuity, the general solution of above-mentioned equation left side functional equation Form is F (x)=e^ax；In view of the norm of term vector can be normalized, F (x)=e is directly taken^x, then have:

That is:

Further have:

It modifies as a result, to the objective function of GloVe model, by the co-occurrence number X in former objective function_ikReplace with a mutual trust BreathI.e. new objective function are as follows:

The objective function of GloVe model:

The final goal function of term vector model are as follows:

Obtaining the term vector of target word after gradient descent method training is w_i, then on repeating to each word Stating operation can be obtained the term vector of all words.

4. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that step (S2) specifically includes:

Based on a term vector data for mutual information overall situation term vector model storage, found by matched method each in training corpus The corresponding word vectors w ∈ R of a word^d×1, word vectors are the one-dimensional vectors that length is d, and by these term vectors by original sentence Sequence (the w of son₁,w₂,w₃,…,w_s) the sentence matrix S that is combined₀(S₀∈R^d×s), wherein d is the dimension of term vector, s It is the word number of longest sentence in corpus, i.e. sentence length.

5. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that (S3) is specifically included:

(1) tag along sort in Chinese wikipedia data set is carried out boolean vector indicates y^j∈ Y (j ∈ (1,2 ..., l)), In,Indicate that the vector of text i jth class indicates that Y is class set；L is the total number of class, and vector y^jDimension be also l；

(2) based on obtained sentence matrix S₀(S₀∈R^d×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution algorithm, Obtain eigenmatrix S₁；

(3) based on obtained eigenmatrix S₁The maximum pondization operation for carrying out 2 × 2, the maximum value of each 2 × 2 matrix is extracted To be reassembled into new eigenmatrix S₂；

(4) step (2) and step (3) are repeated, until eigenmatrix S_nUntil containing only l number, wherein n indicates to carry out altogether The total degree of convolution sum pondization operation；

(5) by eigenmatrix S_nExpand into the one-dimensional vector y that length is l_-, text i is then calculated each by softmax function Probability value in dimensionIt is calculated finally by entropy function is intersectedWith correct categoryThat is the category of class belonging to text itself VectorDistance d_i；

(6) by d_iIt is cumulative, and to minimizeModel training is carried out for target, and by the ginseng of obtained CNN disaggregated model Number preserves, and M indicates the text number of entire training corpus.

6. term vector model according to claim 5 based on mutual information and based on the file classification method of CNN, special Sign is sentence matrix S₀Convolution operation, which is carried out, by convolution nuclear parameter (W, b) obtains eigenmatrix S₁:

S₁=f (S₀·W+b) (3)

Wherein, W is convolution kernel parameter matrix, and b is bias vector, and f () indicates activation primitive；Based on obtained eigenmatrix S₁, Pondization is carried out by maximum pond method to calculate:

S₂=downsample (S₁) (4)

Wherein, downsample () indicates pond function；The calculating for passing through formula (3) and (4) repeatedly, obtains feature to the end Matrix S_n, and spread out the one-dimensional vector y for being l for length_-, and it is general in each class by softmax function calculating text Rate value vectorIt is calculated finally by entropy function is intersectedWithDistance d_i:

Wherein, y_-ikIndicate one-dimensional vector y_-Kth (1≤k≤l) a value,Indicate that text i is general in category vector kth dimension Rate；Last objective function are as follows:

Loss=d_i

This method is by gradient descent method, to minimize loss calculating parameter (W, b), and last (W, b) is saved and is used as mould Shape parameter, for being used when text classification to be sorted.

7. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that step (S5) specifically includes:

(1) it based on the pre-treatment step of training sample, treats classifying text and is equally pre-processed, sentence matrix S ' is calculated₀ (S′₀∈R^d×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution algorithm, obtains eigenmatrix S '₁；

(2) based on obtained eigenmatrix S '₁The maximum pondization operation for carrying out 2 × 2, the maximum value of each 2 × 2 matrix is extracted It is reassembled into new eigenmatrix S ' out₂；

(3) step (2) and step (3) are repeated, until eigenmatrix S '_nAlso until becoming only l number；

(4) by eigenmatrix S '_nExpand into the one-dimensional vector y ' that length is l_{_}, text is then calculated each by softmax function Probability value vector in a classIt is calculated finally by entropy function is intersectedWith each class label y^jDistance d '^j；

(5) finally, in obtained l distance d '^jIt is middle to find class, as the class label of the text corresponding to the smallest distance: Label ' (j)=min (d '^j)。