CN109189925A - Term vector model based on mutual information and based on the file classification method of CNN - Google Patents

Term vector model based on mutual information and based on the file classification method of CNN Download PDF

Info

Publication number
CN109189925A
CN109189925A CN201810938236.8A CN201810938236A CN109189925A CN 109189925 A CN109189925 A CN 109189925A CN 201810938236 A CN201810938236 A CN 201810938236A CN 109189925 A CN109189925 A CN 109189925A
Authority
CN
China
Prior art keywords
term vector
text
vector
model
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810938236.8A
Other languages
Chinese (zh)
Other versions
CN109189925B (en
Inventor
李万理
吴海明
薛云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Airlines Intellectual Property Services Ltd
Nanjing Silicon Intelligence Technology Co Ltd
Original Assignee
South China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Normal University filed Critical South China Normal University
Priority to CN201810938236.8A priority Critical patent/CN109189925B/en
Publication of CN109189925A publication Critical patent/CN109189925A/en
Application granted granted Critical
Publication of CN109189925B publication Critical patent/CN109189925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The present invention discloses term vector model based on mutual information and based on the file classification method of CNN.This method comprises: (1) passes through the global term vector method training term vector model based on mutual information;(2) according to trained term vector model, the term vector matrix of the text is determined;(3) feature in term vector matrix, and train classification models are extracted by CNN;(4) according to trained term vector model and CNN Feature Selection Model to input Text character extraction;(5) text feature obtained according to CNN Feature Selection Model calculates the mapping distance of text and pre-set categories by softmax and cross-entropy method, and taking distance is recently that text corresponds to classification.The method overcome deficiency of the Glove term vector on semantic capture and statistics co-occurrence matrix, reduce model training complexity, can accurately excavate the characteristic of division of text, suitable for the text classification in various fields, have great practical value.

Description

Term vector model based on mutual information and based on the file classification method of CNN
Technical field
It is specifically a kind of based on mutual information the present invention relates to the text classification field of natural language processing technique Term vector model and the file classification method for being based on CNN (convolutional neural networks).
Background technique
With the development of internet technology, the data volume in WWW is growing day by day, wherein having a large amount of data is text How data, all trades and professions that these data are related to society accomplish the reasonable of data in face of the text data of the scale of construction huge in this way Changing classification becomes an important research puzzle.Text is rationalized, mechanized classification, people can be helped to solve many difficult Topic, such as: many occasions such as junk information differentiates, deceptive information is found.In recent years, to complete text classification, then text Expression just seems most important, the reasonable available accurate text semantic information of text representation.
1. term vector technical background
In natural language representation method, the vectorization expression of word is important basic technology.Traditional term vector table Show that method is one dictionary of creation and each word order is numbered, i.e. one-hot encoding representation.This representation method can not capture word Semantic similarity between language, and dimension disaster easily occurs.For this purpose, Hinton [1] was distributed in proposition term vector in 1986 Formula representation method, this method indicate word using the vector of fixed dimension, the language of word are indicated with the distance between vector Adopted distance has also broken the semantic gap between word while playing the role of dimensionality reduction, so that the semantic relation between word obtains To better description.With the continuous deepening of research, Bengio proposes to use neural network language model, while obtaining word Vector.This model is able to use neural network and carries out unsupervised learning, the context relation between word is captured, in training net Term vector is also obtained as parameter together training while network model.Although effectively, calculation amount is huge for the model of Bengio, In order to reduce computation complexity, Miklov proposes improved model on this basis --- and word2vec simultaneously obtains better knot Fruit, while the complexity of model is reduced to n × log by n × V2V, so that the training of extensive term vector becomes more efficient. Although Miklov is yielded good result in the expression of word, still there are many scholar to language indicate carry out deeper into Probe into.Wherein, the objective function in Glove model refinement word2vec model that Pennington is proposed, the overall situation is united Information --- co-occurrence matrix is introduced into the training of term vector meter, and knot more better than word2vec is achieved in multiple experiments Fruit.
This method is in order to improve deficiency of the Glove model on semantic capture and statistics co-occurrence matrix, herein in its mould Introduction point mutual information on the basis of type constructs global point mutual information matrix, and training obtains final term vector.In multiple and language The semanteme that is carried out on the relevant data set of justice the experimental results showed that, the term vector based on global point mutual information matrix can better table Existing semantic relation.The main contributions of this paper have: 1, global point mutual information matrix being introduced into term vector calculating, make the system of term vector It is more accurate to count information.2, the objective function for improving Glove model eliminates break-in operation, therefore can significantly reduce model instruction Experienced calculation amount.
2. classification method technical background
Text classification common technology is divided into the file classification method based on sentiment dictionary, the text classification based on machine learning Method and file classification method based on deep learning.The main application characteristic of these methods is as follows:
1) based on the file classification method of sentiment dictionary
Method based on sentiment dictionary is literary using existing semantic dictionary resource construction domain lexicon, then by comparing emotion Forward direction emotion word, negative sense emotion word included in this, mark positive and negative integer value as emotional value, while also to consider Influence of the special part-of-speech rule, syntactic structure to Judgment by emotion, such as negative, progressive sentence, turnover sentence.
File classification method based on sentiment dictionary is easy to accomplish, but this method needs fairly large sentiment dictionary, and it It is a linear model, limited capacity.
2) based on the file classification method of machine learning
Text classification based on machine learning, key are that feature selecting, feature weight quantization, disaggregated model etc. 3 is wanted Element.Feature selecting mainly has based on information gain, based on the methods of document frequency.Feature weight quantization mainly has word frequency, inverse text Shelves frequency, TF-IDF, entropy weight are again etc..Sorter model includes: naive Bayesian, k nearest neighbor, support vector machines, neural network, determines Plan tree etc..The method using three kinds of sorter models is as follows.
(a) support vector machine method
In classification problem, there are problems that linearly inseparable, support vector machine method is to linearly can not in lower dimensional space Point sample set mapped, allow linear separability in sample high-dimensional feature space, and acquire in this higher dimensional space optimal super Plane, so that class interval maximizes, the final classification for realizing sample.
But there is also some disadvantages for support vector machines:
First, algorithm of support vector machine is difficult to carry out large-scale training sample.Because support vector machines is by secondary Planning solves quadratic programming and is related to the calculating of m rank matrix to solve, and wherein m is the number of sample.When m it is in a large number when The storage and calculating of the matrix will expend a large amount of machine memory and operation time;
Second, more classification problems, which are solved, with support vector machines has difficulties.Classical algorithm of support vector machine only provides two The algorithm of classification.
(b) traditional decision-tree
Decision tree can be used for a kind of tree construction of classification, be made of node and branch.Decision tree learning is substantially One group of classifying rules is summarized from training data concentration.The algorithm of decision tree is usually a recursive selection optimal characteristics, and Training set is split according to this feature, so that each Sub Data Set has the process of a best classification.
But there is also some disadvantages for decision tree:
First, decision Tree algorithms are very easy to over-fitting, cause generalization ability not strong;
Second, decision tree can be because a little change of sample, results in the violent change of tree construction;
Third, some more complicated relationships, decision tree is difficult to learn, such as exclusive or.
(c) neural network method
Neural network method can also be used to solve the problems, such as Nonlinear Classification.One neural network model generally comprises big The neuron being connected to each other is measured, the weighting input value of the other adjacent neurons of activation primitive calculation processing is passed through.Neural network mould Type is trained by a large amount of data, adjusts weight by optimizing the value of cost function, so that category of model effect is best, The final classification for realizing sample.
But there is also some disadvantages for neural network:
First, when facing big data, need the feature for artificially extracting initial data as input.And deep learning can be certainly The feature of dynamic selection initial data;
Second, it is desirable to more accurate approximate complicated function, it is necessary to increase the number of plies of hidden layer, this is easy for generating gradient The problem of disappearance or gradient are exploded;
Third can not handle time series data (such as audio, text), because neural network is free of time parameter.
3) based on the file classification method of deep learning
Deep learning refers to deep neural network model, refers generally to nerve of the network number of plies at three layers or three layers or more Network structure.
Some models of deep learning also improve deficiency existing for neural network model simultaneously.Such as convolutional neural networks (CNN) trained number of parameters, Recognition with Recurrent Neural Network (RNN) and shot and long term memory network is greatly reduced in " power is shared " (LSTM) it can handle time series data.The model of two kinds of deep learnings is as follows.
(a) RNN method
Recognition with Recurrent Neural Network is a kind of neural network of node orientation connection cyclization.It is shown as in network structure, it is multiple The hidden layer of simple neural network joins end to end according to time series.
But due to RNN model, during model training, error, which is reversely relayed, can have gradient disappearance or explosion Problem, so model cannot establish the dependence of long period sequence.That is, the short period can only be arranged in RNN model The dependence of sequence.
(b) LSTM method
The dependence of long period sequence cannot be established in order to solve RNN model, proposes LSTM model.It and RNN Network is compared, and mainly the internal structure of hidden layer is different.Only one activation primitive of the hidden layer of standard RNN network, and The hidden layer of LSTM network has a more complex network structure, also comprising input gate, forget door and out gate these three doors.
But LSTM can only avoid the gradient of RNN from disappearing, but gradient explosion issues cannot be fought.
Summary of the invention
It is of the existing technology above-mentioned insufficient it is an object of the invention to overcome, the term vector based on mutual information is provided Model and file classification method based on CNN.
The purpose of the present invention is achieved through the following technical solutions.
Based on a Chinese comment text classification method for mutual information overall situation term vector model comprising: (S1) is by being based on The global term vector method training term vector model of point mutual information;
(S2) according to trained term vector model, the term vector matrix of the text is determined;
(S3) feature in term vector matrix, and train classification models are extracted by convolutional neural networks (CNN);(S4) root According to trained term vector model and CNN Feature Selection Model to input Text character extraction;
(S5) text feature obtained according to CNN Feature Selection Model calculates text by softmax and cross-entropy method With the mapping distance of pre-set categories, taking distance is recently that text corresponds to classification.
Further, step (S1) specifically includes:
(1) Chinese wikipedia data set is inputted, data are pre-processed, removes punctuation mark and space;
(2) word segmentation processing is carried out with the data set that Chinese word segmentation tool obtains step (1), corpus data is converted to word Word order column;
(3) word frequency statistics are carried out to the word that step (2) obtains, and statistical result is saved in a hard disk;As a result lattice are saved Formula are as follows: " word t word frequency ";
(4) co-occurrence statistics are carried out to the word that step (2) obtains, corpus is traversed according to the window size of setting, is obtained It is saved in the form of triple in a hard disk to co-occurrence number of the every two word in window, and by result, saves format: " word Language 1 t word 2 t co-occurrence number ";
(5) triple obtained to step (4) is upset at random, and the triple after upsetting at random is stored in hard disk In, save format are as follows: " word 1 t word 2 t co-occurrence number ";
(6) it to all words occurred in step (2), random initializtion term vector, and saves in memory, facilitates program It reads and modifies;
(7) triple obtained in step (5) is completely traversed, according to objective function:Term vector is adjusted using gradient descent method, w in objective functioniWithCentered on respectively Word and the corresponding term vector of upper and lower cliction, V indicate all term vectors in vocabulary,
(8) iterative step (7) are repeated continuously, until result restrain to get arrive the term vector based on mutual information, will in Term vector in depositing saves in a hard disk, saves format are as follows: " Ci Yu t term vector ".
Further, in rapid (5) to step (8), for possessing two word w of similar contextsiAnd wjFor, wiAnd wj Between relationship can by with third wordRelationship embody, to wiAnd wjBetween relationship modeled to obtain:
In equation, wiAnd wjIndicate two centre words for possessing similar contexts,For context term vector, and For wiWithThe probability occurred jointly;
Ratio on the right of equation is model output, represents the relationship between the word for wanting prediction;Keeping initial model Export it is constant under the premise of, the input of initial model is simplified, to establish optimizable objective function;In view of vector Space has inherent linear structure, then is limited to only be influenced by the difference of two center term vectors by input function form, Obtain following formula:
Since the right of equation is that scalar by complicated linearly or nonlinearly transformation converts scalar for input vector Form;But this undoubtedly will increase the complexity of model, and influence the linear structure of model.In order to avoid such case, adopt Vector operation is carried out with the form of dot product, portrays the relationship between two words, such as following formula:
In order to convert the equation left side to the form of ratio, in conjunction with the condition of continuity, above-mentioned equation left side functional equation General solution form is F (x)=eax;In view of the norm of term vector can be normalized, F (x)=e is directly takenx, then have:
It enabling again at this time (1), molecule denominator is equal to each other in (2) two formulas, then it can obtain:
That is:
Further have:
It modifies as a result, to the objective function of GloVe model, by the co-occurrence number X in former objective functionikIt replaces with a little Mutual informationI.e. new objective function are as follows:
The objective function of GloVe model:
The final goal function of term vector model are as follows:
Obtaining the term vector of target word after gradient descent method training is wi, then to each word repeat into The term vector of all words can be obtained in the above-mentioned operation of row.
Further, step (S2) specifically includes:
Based on a term vector data for mutual information overall situation term vector model storage, training corpus is found by matched method In the corresponding word vectors w ∈ R of each wordd×1(word vectors are the one-dimensional vectors that length is d), and these term vectors are pressed Sequence (the w of original sentence1,w2,w3,…,ws) the sentence matrix S that is combined0(S0∈Rd×s), wherein d is term vector Dimension, s are the word number of longest sentence in corpus, i.e. sentence length.
Further, (S3) is specifically included:
(1) tag along sort in Chinese wikipedia data set is carried out boolean vector indicates yj∈Y(j∈(1,2,…, L)), whereinIndicate that the vector of text i jth class indicates that Y is class set;L is the total number of class, and vector yjDimension For l;
(2) based on obtained sentence matrix S0(S0∈Rd×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution Operation obtains eigenmatrix S1
(3) based on obtained eigenmatrix S1The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S2
(4) step (2) and step (3) are repeated, until eigenmatrix Sn(wherein, n indicates to carry out convolution sum pondization behaviour altogether The total degree of work) contain only l number until;
(5) by eigenmatrix SnExpand into the one-dimensional vector y that length is l-, text i is then calculated by softmax function Probability value in each dimensionIt is calculated finally by entropy function is intersectedWith correct category(i.e. class belonging to text itself Category vector) distance di
(6) by diIt is cumulative, and to minimize(wherein, the text number that M indicates entire training corpus) is target Model training is carried out, and the parameter of obtained CNN disaggregated model is preserved.
Further, sentence matrix S0Convolution operation, which is carried out, by convolution nuclear parameter (W, b) obtains eigenmatrix S1:
S1=f (S0·W+b) (3)
Wherein, W is convolution kernel parameter matrix, and b is bias vector, and f () indicates activation primitive;Based on obtained feature square Battle array S1, pondization is carried out by maximum pond method and is calculated:
S2=downsample (S1) (4)
Wherein, downsample () indicates pond function;The calculating for passing through formula (3) and (4) repeatedly, obtains to the end Eigenmatrix Sn, and spread out the one-dimensional vector y for being l for length-, and text is calculated in each class by softmax function Probability value vectorIt is calculated finally by entropy function is intersectedWithDistance di:
Wherein, y-ikIndicate one-dimensional vector y-Kth (1≤k≤l) a value,Indicate text i in category vector kth dimension Probability;Last objective function are as follows:
Loss=di
This method is by gradient descent method, to minimize loss calculating parameter (W, b), and last (W, b) is saved and is made For model parameter, for being used when text classification to be sorted.
Further, step (S5) specifically includes:
(1) it based on the pre-treatment step of training sample, treats classifying text and is equally pre-processed, sentence square is calculated Battle array S '0(S′0∈Rd×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution algorithm, obtains eigenmatrix S '1
(2) based on obtained eigenmatrix S '1The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S'2
(3) step (2) and step (3) are repeated, until eigenmatrix S'nAlso until becoming only l number;
(4) by eigenmatrix S'nThe one-dimensional vector y'_ that length is l is expanded into, text is then calculated by softmax function Originally the probability value vector in each classIt is calculated finally by entropy function is intersectedWith each class label yjDistance d'j
(5) finally, in obtained l distance d'jIt is middle to find class, the as class of the text corresponding to the smallest distance Label: label'(j)=min (d'j)
Compared with prior art, the invention has the advantages that and technical effect:
This method sufficiently extracts the semantic information and local feature of text context, introduces on the basis of original classification method Convolutional neural networks (CNN) method of deep learning.CNN is applied to the feature extraction of image procossing originally, and has fabulous Local message extractability is well suited as the feature extracting method of text term vector matrix.This method mainly improves first Shortcoming of the Glove term vector on semantic capture and statistics co-occurrence matrix, draws pronouns, general term for nouns, numerals and measure words vector for global point mutual information matrix In matrix, keep the statistical information of term vector more accurate, while improving the objective function of Glove model, remove break-in operation, reduces Model training complexity, in addition, the key message of text can be effectively found in conjunction with CNN Feature Selection Model, it is accurate to determine text This meaning.The present invention can accurately excavate the characteristic of division of text, suitable for the text classification in various fields, have very big Practical value.
Detailed description of the invention
Fig. 1 is the flow chart of the term vector model based on mutual information and the file classification method based on CNN.
Fig. 2 is the term vector training method flow chart based on mutual information.
Fig. 3 is the training flow chart of the textual classification model based on CNN.
Specific embodiment
The solution of the present invention is made to explain in detail enough in foregoing summary part.Below in conjunction with attached drawing and Specific implementation of the invention is described in detail in specific embodiment, but implementation of the invention is without being limited thereto.It is noted that It is that those skilled in the art can refer to prior art understanding or realize if having the process or symbol of not special detailed description below , such as be all the theoretical reason for being referred to existing CNN for some conventional the parameter such as w and b etc. in CNN neural network Solution, it repeats no more below.
Referring to Fig. 1, term vector model in this example based on mutual information and based on the file classification method of CNN, comprising:
(S1) pass through the global term vector method training term vector model based on mutual information;
(S2) according to trained term vector model, the term vector matrix of the text is determined;
(S3) feature in term vector matrix, and train classification models are extracted by convolutional neural networks (CNN);
(S4) according to trained term vector model and CNN Feature Selection Model to input Text character extraction;
(S5) text feature obtained according to CNN Feature Selection Model calculates text by softmax and cross-entropy method With the mapping distance of pre-set categories, taking distance is recently that text corresponds to classification.
1, based on point mutual information training term vector
For the language model based on statistical information training term vector, how with comprehensive and accurate information word is portrayed Between relationship be the key that model training.Therefore, the present invention improves Glove model.By deriving, discovery uses word Between point mutual information matrix can preferably between portrayed words and word statistical relationship.Such as Fig. 2, specific technical solution is as follows:
Term vector training method based on mutual information, comprising the following steps:
(1) Chinese wikipedia data set is inputted, data are pre-processed, removes punctuation mark and space;
(2) word segmentation processing is carried out with the data set that Chinese word segmentation tool obtains step (1), corpus data is converted to word Word order column;
(3) word frequency statistics are carried out to the word that step (2) obtains, and statistical result is saved in a hard disk.As a result lattice are saved Formula are as follows: " word t word frequency ";
(4) co-occurrence statistics are carried out to the word that step (2) obtains, according to the window size being set in advance to corpus progress time It goes through, obtains co-occurrence number of the every two word in window, and result is saved in a hard disk in the form of triple, save lattice Formula: " word 1 t word 2 t co-occurrence number ";
(5) triple obtained to step (4) is upset at random, and the triple after upsetting at random is stored in hard disk In, save format are as follows: " word 1 t word 2 t co-occurrence number ";
(6) it to all words occurred in step (2), random initializtion term vector, and saves in memory, facilitates program It reads and modifies;
(7) triple obtained in step (5) is completely traversed, according to objective function:Term vector is adjusted using gradient descent method, in objective function
(8) iterative step (7) are repeated continuously, until result restrain to get arrive the term vector based on mutual information, will in Term vector in depositing saves in a hard disk, saves format are as follows: " Ci Yu t term vector ".
In the above-mentioned term vector training method based on mutual information, step (5) in step (8), this method be based on " for Possess two word w of similar contextsiAnd wjFor, wiAnd wjBetween relationship can by with third wordRelationship come The hypothesis of embodiment " is to wiAnd wjBetween relationship modeled to obtain:
Ratio on the right of equation is model output, represents the relationship between the word for wanting prediction.Keeping model output Under the premise of constant, the input of model is simplified, to establish optimizable objective function.In view of vector space has Inherent linear structure, input function form then is limited to only be influenced by the difference of two center term vectors, obtain following formula:
Since the right of equation is scalar, can be converted input vector to by complicated linearly or nonlinearly transformation The form of scalar.But this undoubtedly will increase the complexity of model, and influence the linear structure of model.In order to avoid this feelings Condition is carried out vector operation in the form of dot product, portrays the relationship between two words, such as following formula:
In order to convert the equation left side to the form of ratio, in conjunction with the condition of continuity, above-mentioned equation left side functional equation General solution form is F (x)=eax.In view of that the norm of term vector can be normalized, F (x)=e might as well be directly takenx, then Have:
It enabling again at this time (1), molecule denominator is equal to each other in (2) two formulas, then it can obtain:
That is:
Further have:
It modifies as a result, to the objective function of GloVe model, by the co-occurrence number X in former objective functionikIt replaces with a little Mutual informationI.e. new objective function are as follows:
The objective function of GloVe model:
The objective function of this method:
From the above equation, we can see that the model midpoint mutual information considers the corresponding probability of occurrence of two words in the denominator, therefore not It will receive the interference of high frequency words, the relationship between the better portrayed words of energy, so that the term vector of training more accurately reflects the meaning of a word. In addition, compare objective function in the model and GloVe model, it is more simple to be apparent from the objective function form that this method defines, and And the operation of truncation funcation is eliminated, therefore calculation amount can be effectively reduced.
After determining objective function, model can carry out the training process of term vector.Firstly, passing through the side of traversal corpus Formula, statistics obtain point mutual information (PMI) matrix between word.After obtaining a mutual information matrix, model is to a mutual information square Value in battle array is traversed, while being trained using gradient descent method to term vector.By constantly iteration, finally obtain just True term vector indicates.
Finally, by above-mentioned steps, by text representation each in data set at the term vector matrix of s × d, wherein s is indicated The word number of longest text, d indicate the dimension of each word in text set S to be sorted, certainly, are less than s for text size Sentence, can pass through " zero padding " operation carry out completion.
2, the textual classification model based on CNN is established
Based on text representation matrix obtained above, this method needs the training classification mould on the tape label corpus being collected into Type, referring to Fig. 3, the specific steps are as follows:
(1) tag along sort in data set (such as Tan Song wave data set) is subjected to vectorization expressionWherein,Table Show that text i belongs to the vector expression (vector of j classDimension be l), Y is class set;
(2) based on obtained sentence matrix S0, the convolution kernel that this method difference preferred dimension is 3 × 2, progress convolution fortune It calculates, obtains eigenmatrix S1
(3) based on obtained eigenmatrix S1The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S2
(4) step (2) and step (3) are repeated, until eigenmatrix SnUntil becoming only l numerical value;
(5) by eigenmatrix SnExpand into the one-dimensional vector y that length is l-, text is then calculated by softmax function and is existed Probability value vector in each classIt is calculated finally by entropy function is intersectedWithDistance di
(6) by diIt is cumulative, and to minimize diAnd carry out model training for objective function, and model parameter is saved Get off.N is the number of iterations in Fig. 3, and N is the maximum times of setting.
In the training of above-mentioned CNN disaggregated model, sentence matrix S0Convolution operation is carried out by convolution nuclear parameter (W, b) to obtain Eigenmatrix S1:
S1=f (S0·W+b) (3)
Wherein, f () indicates activation primitive.Based on obtained eigenmatrix S1, pond is carried out by maximum pond method It calculates:
S2=downsample (S1) (4)
Wherein, downsample () indicates pond function.The calculating for passing through formula (3) and (4) repeatedly, obtains to the end Eigenmatrix Sn, and spread out the one-dimensional vector y for being l for length-, and text is calculated in each class by softmax function Probability value vectorIt is calculated finally by entropy function is intersectedWithDistance di:
Wherein, y-ikIndicate one-dimensional vector y-Kth (1≤k≤l) a value,Indicate text i in category vector kth dimension Probability.Last objective function are as follows:
Loss=di
This method is by gradient descent method, to minimize loss calculating parameter (W, b).
3, text classification
Based on obtained term vector model and CNN disaggregated model, can to classify to text, detailed process is such as Under:
(1) sentence matrix S is calculated based on text to be sorted0, the convolution kernel that this method difference preferred dimension is 3 × 2, Convolution algorithm is carried out, eigenmatrix S is obtained1
(2) based on obtained eigenmatrix S1The maximum pondization operation for carrying out 2 × 2, by the maximum value of each 2 × 2 matrix It extracts and is reassembled into new eigenmatrix S2
(3) step (2) and step (3) are repeated, until eigenmatrix SnUntil becoming only l numerical value;
(4) by eigenmatrix SnExpand into the one-dimensional vector y that length is l-, text is then calculated by softmax function and is existed Probability value vector in each classIt is calculated finally by entropy function is intersectedWith label yjDistance dj
(5) based on obtained distance dj, by minimizing formula
Label=min (dj)
Select the category label apart from nearest class for text.
This method sufficiently extracts the semantic information and local feature of text context, introduces on the basis of original classification method Convolutional neural networks (CNN) method of deep learning.CNN is applied to the feature extraction of image procossing originally, and has fabulous Local message extractability is well suited as the feature extracting method of text term vector matrix.This method mainly improves first Shortcoming of the Glove term vector on semantic capture and statistics co-occurrence matrix, draws pronouns, general term for nouns, numerals and measure words vector for global point mutual information matrix In matrix, keep the statistical information of term vector more accurate, while improving the objective function of Glove model, remove break-in operation, reduces Model training complexity, in addition, the key message of text can be effectively found in conjunction with CNN Feature Selection Model, it is accurate to determine text This meaning.

Claims (7)

1. term vector model based on mutual information and based on the file classification method of CNN, characterized by comprising: (S1) passes through Global term vector method training term vector model based on mutual information;
(S2) according to trained term vector model, the term vector matrix of the text is determined;
(S3) feature in term vector matrix, and train classification models are extracted by convolutional neural networks (CNN);(S4) according to instruction The term vector model and CNN Feature Selection Model perfected are to input Text character extraction;
(S5) text feature obtained according to CNN Feature Selection Model, by softmax and cross-entropy method calculating text and in advance If the mapping distance of classification, taking distance is recently that text corresponds to classification.
2. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that step (S1) specifically includes:
(1) Chinese wikipedia data set is inputted, data are pre-processed, removes punctuation mark and space;
(2) word segmentation processing is carried out with the data set that Chinese word segmentation tool obtains step (1), corpus data is converted to word sequence Column;
(3) word frequency statistics are carried out to the word that step (2) obtains, and statistical result is saved in a hard disk;
(4) co-occurrence statistics are carried out to the word that step (2) obtains, corpus is traversed according to the window size of setting, obtained every Co-occurrence number of two words in window, and result is saved in a hard disk in the form of triple;
(5) triple obtained to step (4) is upset at random, and the triple after upsetting at random saves in a hard disk;
(6) it to all words occurred in step (2), random initializtion term vector, and saves in memory, program is facilitated to read And modification;
(7) triple obtained in step (5) is completely traversed, according to objective function: Term vector is adjusted using gradient descent method, w in objective functioniWithRespectively centre word and the corresponding word of upper and lower cliction Vector, V indicate all term vectors in vocabulary,
(8) iterative step (7) are repeated continuously, until result restrain to get arrive the term vector based on mutual information, will be in memory Term vector save in a hard disk.
3. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is in step (5) to step (8), for possessing two word w of similar contextsiAnd wjFor, wiAnd wjBetween relationship Can by with third wordRelationship embody, to wiAnd wjBetween relationship modeled to obtain:
In equation, wiAnd wjIndicate two centre words for possessing similar contexts,For context term vector, andFor wi WithThe probability occurred jointly;
Ratio on the right of equation is model output, represents the relationship between the word for wanting prediction;Keeping initial model output Under the premise of constant, the input of initial model is simplified, to establish optimizable objective function;In view of vector space With inherent linear structure, then input function form is limited to only be influenced by the difference of two center term vectors, be obtained Following formula:
Since the right of equation is that scalar by complicated linearly or nonlinearly transformation converts input vector to the shape of scalar Formula;Vector operation is carried out in the form of dot product, portrays the relationship between two words, such as following formula:
In order to convert the equation left side to the form of ratio, in conjunction with the condition of continuity, the general solution of above-mentioned equation left side functional equation Form is F (x)=eax;In view of the norm of term vector can be normalized, F (x)=e is directly takenx, then have:
It enabling again at this time (1), molecule denominator is equal to each other in (2) two formulas, then it can obtain:
That is:
Further have:
It modifies as a result, to the objective function of GloVe model, by the co-occurrence number X in former objective functionikReplace with a mutual trust BreathI.e. new objective function are as follows:
The objective function of GloVe model:
The final goal function of term vector model are as follows:
Obtaining the term vector of target word after gradient descent method training is wi, then on repeating to each word Stating operation can be obtained the term vector of all words.
4. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that step (S2) specifically includes:
Based on a term vector data for mutual information overall situation term vector model storage, found by matched method each in training corpus The corresponding word vectors w ∈ R of a wordd×1, word vectors are the one-dimensional vectors that length is d, and by these term vectors by original sentence Sequence (the w of son1,w2,w3,…,ws) the sentence matrix S that is combined0(S0∈Rd×s), wherein d is the dimension of term vector, s It is the word number of longest sentence in corpus, i.e. sentence length.
5. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that (S3) is specifically included:
(1) tag along sort in Chinese wikipedia data set is carried out boolean vector indicates yj∈ Y (j ∈ (1,2 ..., l)), In,Indicate that the vector of text i jth class indicates that Y is class set;L is the total number of class, and vector yjDimension be also l;
(2) based on obtained sentence matrix S0(S0∈Rd×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution algorithm, Obtain eigenmatrix S1
(3) based on obtained eigenmatrix S1The maximum pondization operation for carrying out 2 × 2, the maximum value of each 2 × 2 matrix is extracted To be reassembled into new eigenmatrix S2
(4) step (2) and step (3) are repeated, until eigenmatrix SnUntil containing only l number, wherein n indicates to carry out altogether The total degree of convolution sum pondization operation;
(5) by eigenmatrix SnExpand into the one-dimensional vector y that length is l-, text i is then calculated each by softmax function Probability value in dimensionIt is calculated finally by entropy function is intersectedWith correct categoryThat is the category of class belonging to text itself VectorDistance di
(6) by diIt is cumulative, and to minimizeModel training is carried out for target, and by the ginseng of obtained CNN disaggregated model Number preserves, and M indicates the text number of entire training corpus.
6. term vector model according to claim 5 based on mutual information and based on the file classification method of CNN, special Sign is sentence matrix S0Convolution operation, which is carried out, by convolution nuclear parameter (W, b) obtains eigenmatrix S1:
S1=f (S0·W+b) (3)
Wherein, W is convolution kernel parameter matrix, and b is bias vector, and f () indicates activation primitive;Based on obtained eigenmatrix S1, Pondization is carried out by maximum pond method to calculate:
S2=downsample (S1) (4)
Wherein, downsample () indicates pond function;The calculating for passing through formula (3) and (4) repeatedly, obtains feature to the end Matrix Sn, and spread out the one-dimensional vector y for being l for length-, and it is general in each class by softmax function calculating text Rate value vectorIt is calculated finally by entropy function is intersectedWithDistance di:
Wherein, y-ikIndicate one-dimensional vector y-Kth (1≤k≤l) a value,Indicate that text i is general in category vector kth dimension Rate;Last objective function are as follows:
Loss=di
This method is by gradient descent method, to minimize loss calculating parameter (W, b), and last (W, b) is saved and is used as mould Shape parameter, for being used when text classification to be sorted.
7. term vector model according to claim 1 based on mutual information and based on the file classification method of CNN, special Sign is that step (S5) specifically includes:
(1) it based on the pre-treatment step of training sample, treats classifying text and is equally pre-processed, sentence matrix S ' is calculated0 (S′0∈Rd×s), the convolution kernel that preferred dimension is 3 × 2 respectively carries out convolution algorithm, obtains eigenmatrix S '1
(2) based on obtained eigenmatrix S '1The maximum pondization operation for carrying out 2 × 2, the maximum value of each 2 × 2 matrix is extracted It is reassembled into new eigenmatrix S ' out2
(3) step (2) and step (3) are repeated, until eigenmatrix S 'nAlso until becoming only l number;
(4) by eigenmatrix S 'nExpand into the one-dimensional vector y ' that length is l_, text is then calculated each by softmax function Probability value vector in a classIt is calculated finally by entropy function is intersectedWith each class label yjDistance d 'j
(5) finally, in obtained l distance d 'jIt is middle to find class, as the class label of the text corresponding to the smallest distance: Label ' (j)=min (d 'j)。
CN201810938236.8A 2018-08-16 2018-08-16 Word vector model based on point mutual information and text classification method based on CNN Active CN109189925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810938236.8A CN109189925B (en) 2018-08-16 2018-08-16 Word vector model based on point mutual information and text classification method based on CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810938236.8A CN109189925B (en) 2018-08-16 2018-08-16 Word vector model based on point mutual information and text classification method based on CNN

Publications (2)

Publication Number Publication Date
CN109189925A true CN109189925A (en) 2019-01-11
CN109189925B CN109189925B (en) 2020-01-17

Family

ID=64918641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810938236.8A Active CN109189925B (en) 2018-08-16 2018-08-16 Word vector model based on point mutual information and text classification method based on CNN

Country Status (1)

Country Link
CN (1) CN109189925B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147449A (en) * 2019-05-27 2019-08-20 中国联合网络通信集团有限公司 File classification method and device
CN110196909A (en) * 2019-05-14 2019-09-03 北京来也网络科技有限公司 Text denoising method and device based on intensified learning
CN110287236A (en) * 2019-06-25 2019-09-27 平安科技(深圳)有限公司 A kind of data digging method based on interview information, system and terminal device
CN110289050A (en) * 2019-05-30 2019-09-27 湖南大学 A kind of drug based on figure convolution sum term vector-target interaction prediction method
CN110287319A (en) * 2019-06-13 2019-09-27 南京航空航天大学 Students' evaluation text analyzing method based on sentiment analysis technology
CN110298391A (en) * 2019-06-12 2019-10-01 同济大学 A kind of iterative increment dialogue intention classification recognition methods based on small sample
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN110472053A (en) * 2019-08-05 2019-11-19 广联达科技股份有限公司 A kind of automatic classification method and its system towards public resource bidding advertisement data
CN110598207A (en) * 2019-08-14 2019-12-20 华南师范大学 Word vector obtaining method and device and storage medium
CN110659892A (en) * 2019-07-31 2020-01-07 林勇 Method and device for acquiring total price of article, computer equipment and storage medium
CN110750652A (en) * 2019-10-21 2020-02-04 广西大学 Story ending generation method combining context entity words and knowledge
CN110781662A (en) * 2019-10-21 2020-02-11 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN110955776A (en) * 2019-11-16 2020-04-03 中电科大数据研究院有限公司 Construction method of government affair text classification model
CN111159396A (en) * 2019-12-04 2020-05-15 中国电子科技集团公司第三十研究所 Method for establishing text data classification hierarchical model facing data sharing exchange
CN111259658A (en) * 2020-02-05 2020-06-09 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation
CN111611801A (en) * 2020-06-02 2020-09-01 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN111610975A (en) * 2019-02-26 2020-09-01 深信服科技股份有限公司 Executable file type determination method, device, equipment and storage medium
CN111881690A (en) * 2020-06-15 2020-11-03 华南师范大学 Processing method, system, device and medium for dynamic adjustment of word vectors
CN111930892A (en) * 2020-08-07 2020-11-13 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113011155A (en) * 2021-03-16 2021-06-22 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for text matching
CN113095087A (en) * 2021-04-30 2021-07-09 哈尔滨理工大学 Chinese word sense disambiguation method based on graph convolution neural network
CN113495958A (en) * 2020-03-20 2021-10-12 北京沃东天骏信息技术有限公司 Text classification method and device
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium
CN115828926A (en) * 2022-11-30 2023-03-21 华中科技大学 Construction quality hidden danger data mining model training method and mining system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301246A (en) * 2017-07-14 2017-10-27 河北工业大学 Chinese Text Categorization based on ultra-deep convolutional neural networks structural model
CN108399230A (en) * 2018-02-13 2018-08-14 上海大学 A kind of Chinese financial and economic news file classification method based on convolutional neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301246A (en) * 2017-07-14 2017-10-27 河北工业大学 Chinese Text Categorization based on ultra-deep convolutional neural networks structural model
CN108399230A (en) * 2018-02-13 2018-08-14 上海大学 A kind of Chinese financial and economic news file classification method based on convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑毅: "基于情感词典的中文微博情感分析研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111610975A (en) * 2019-02-26 2020-09-01 深信服科技股份有限公司 Executable file type determination method, device, equipment and storage medium
CN110196909A (en) * 2019-05-14 2019-09-03 北京来也网络科技有限公司 Text denoising method and device based on intensified learning
CN110147449A (en) * 2019-05-27 2019-08-20 中国联合网络通信集团有限公司 File classification method and device
CN110289050A (en) * 2019-05-30 2019-09-27 湖南大学 A kind of drug based on figure convolution sum term vector-target interaction prediction method
CN110298391A (en) * 2019-06-12 2019-10-01 同济大学 A kind of iterative increment dialogue intention classification recognition methods based on small sample
CN110287319A (en) * 2019-06-13 2019-09-27 南京航空航天大学 Students' evaluation text analyzing method based on sentiment analysis technology
CN110287319B (en) * 2019-06-13 2021-06-15 南京航空航天大学 Student evaluation text analysis method based on emotion analysis technology
CN110287236A (en) * 2019-06-25 2019-09-27 平安科技(深圳)有限公司 A kind of data digging method based on interview information, system and terminal device
CN110287236B (en) * 2019-06-25 2024-03-19 平安科技(深圳)有限公司 Data mining method, system and terminal equipment based on interview information
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN110348497B (en) * 2019-06-28 2021-09-10 西安理工大学 Text representation method constructed based on WT-GloVe word vector
CN110659892A (en) * 2019-07-31 2020-01-07 林勇 Method and device for acquiring total price of article, computer equipment and storage medium
CN110472053A (en) * 2019-08-05 2019-11-19 广联达科技股份有限公司 A kind of automatic classification method and its system towards public resource bidding advertisement data
CN110598207A (en) * 2019-08-14 2019-12-20 华南师范大学 Word vector obtaining method and device and storage medium
CN110781662A (en) * 2019-10-21 2020-02-11 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN110750652A (en) * 2019-10-21 2020-02-04 广西大学 Story ending generation method combining context entity words and knowledge
CN110781662B (en) * 2019-10-21 2022-02-01 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN110955776A (en) * 2019-11-16 2020-04-03 中电科大数据研究院有限公司 Construction method of government affair text classification model
CN111159396A (en) * 2019-12-04 2020-05-15 中国电子科技集团公司第三十研究所 Method for establishing text data classification hierarchical model facing data sharing exchange
CN111159396B (en) * 2019-12-04 2022-04-22 中国电子科技集团公司第三十研究所 Method for establishing text data classification hierarchical model facing data sharing exchange
CN111259658A (en) * 2020-02-05 2020-06-09 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation
CN111259658B (en) * 2020-02-05 2022-08-19 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation
CN113495958A (en) * 2020-03-20 2021-10-12 北京沃东天骏信息技术有限公司 Text classification method and device
CN111611801A (en) * 2020-06-02 2020-09-01 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN111881690B (en) * 2020-06-15 2024-03-29 华南师范大学 Word vector dynamic adjustment processing method, system, device and medium
CN111881690A (en) * 2020-06-15 2020-11-03 华南师范大学 Processing method, system, device and medium for dynamic adjustment of word vectors
CN111930892A (en) * 2020-08-07 2020-11-13 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
CN111930892B (en) * 2020-08-07 2023-09-29 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
WO2022116444A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Text classification method and apparatus, and computer device and medium
CN112612892B (en) * 2020-12-29 2022-11-01 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113011155B (en) * 2021-03-16 2023-09-05 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for text matching
CN113011155A (en) * 2021-03-16 2021-06-22 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for text matching
CN113095087A (en) * 2021-04-30 2021-07-09 哈尔滨理工大学 Chinese word sense disambiguation method based on graph convolution neural network
CN115828926A (en) * 2022-11-30 2023-03-21 华中科技大学 Construction quality hidden danger data mining model training method and mining system
CN115828926B (en) * 2022-11-30 2023-08-04 华中科技大学 Construction quality hidden danger data mining model training method and mining system

Also Published As

Publication number Publication date
CN109189925B (en) 2020-01-17

Similar Documents

Publication Publication Date Title
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
Dhillon et al. Eigenwords: spectral word embeddings.
Wang et al. Research on Web text classification algorithm based on improved CNN and SVM
Gallant et al. Representing objects, relations, and sequences
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN110765260A (en) Information recommendation method based on convolutional neural network and joint attention mechanism
CN111027595B (en) Double-stage semantic word vector generation method
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
US20230169271A1 (en) System and methods for neural topic modeling using topic attention networks
Sokkhey et al. Development and optimization of deep belief networks applied for academic performance prediction with larger datasets
CN113806543B (en) Text classification method of gate control circulation unit based on residual jump connection
Bhende et al. Integrating multiclass light weighted BiLSTM model for classifying negative emotions
Rahman Robust and consistent estimation of word embedding for bangla language by fine-tuning word2vec model
CN111581364A (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
Jeon et al. Dropout prediction over weeks in MOOCs via interpretable multi-layer representation learning
CN110705259A (en) Text matching method for capturing matching features in multiple granularities
Lin et al. Text classification feature extraction method based on deep learning for unbalanced data sets
Sadr et al. A novel deep learning method for textual sentiment analysis
CN114265936A (en) Method for realizing text mining of science and technology project
Gao et al. Chinese short text classification method based on word embedding and Long Short-Term Memory Neural Network
Fu et al. A hybrid algorithm for text classification based on CNN-BLSTM with attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210706

Address after: 210012 4th floor, building C, Wanbo Science Park, 20 Fengxin Road, Yuhuatai District, Nanjing City, Jiangsu Province

Patentee after: NANJING SILICON INTELLIGENCE TECHNOLOGY Co.,Ltd.

Address before: Room 614-615, No.1, Lane 2277, Zuchongzhi Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Patentee before: Shanghai Airlines Intellectual Property Services Ltd.

Effective date of registration: 20210706

Address after: Room 614-615, No.1, Lane 2277, Zuchongzhi Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Patentee after: Shanghai Airlines Intellectual Property Services Ltd.

Address before: School of physics and telecommunication engineering, South China Normal University, No. 378, Waihuan West Road, Panyu District, Guangzhou City, Guangdong Province, 510006

Patentee before: SOUTH CHINA NORMAL University