CN107357895A

CN107357895A - A kind of processing method of the text representation based on bag of words

Info

Publication number: CN107357895A
Application number: CN201710569638.0A
Authority: CN
Inventors: 姚念民; 牛世雄
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2017-01-05
Filing date: 2017-07-14
Publication date: 2017-11-17
Anticipated expiration: 2037-07-14
Also published as: CN107357895B

Abstract

The invention belongs to computer application field, discloses a kind of processing method of the text representation based on bag of words, and this method segments to the text data set collected, goes stop words, goes the processing procedures such as low-frequency word, feature selecting；Then with the text after vector space model expression processing；Term vector is trained with the method for neutral net to the text after processing simultaneously；The weight of the Feature Words of bag of words is changed according to the similitude of term vector, obtains new text representation model.To handle text representation problem, the accuracy of classification is improved.

Description

A kind of processing method of the text representation based on bag of words

Technical field

The invention belongs to computer application field, the processing side of more particularly to a kind of text representation based on bag of words Method.

Background technology

At present, text-processing has been widely used in every field, and in general to text, it is necessary to be segmented, gone Stop words, low-frequency word, feature selecting, text is then represented, finally carry out classification processing.Different countries are for text-processing Research, acquired achievement are equally inconsistent.Relative to other countries, China falls behind relatively to the research and probe of text-processing, rises Step is also than later.

Word segmentation processing, due to having space as nature delimiter between English word, therefore no longer need to segment.However, When computer disposal Chinese text, it is necessary first to text is segmented, automatic word segmentation is to need computer will according to the meaning of one's words Sentence cutting is rational word.When handling natural language, be all using word as minimum unit, participle it is accurate Property directly affects the quality of text classification.

Feature selecting, if representing the text with all Feature Words in text, then the dimension of feature space is usual More than 100,000, the space of such higher-dimension can make computational efficiency very low, or even can not complete to calculate.In fact, have in the text The contribution of a little words is very weak, as adverbial word " " can all occur in nearly all text, can not as the feature of particular text, because This it is nonsensical to ensuing classification.Therefore need to choose the word that can represent text from text and form new feature Space, so as to reach the purpose of dimensionality reduction.

Text representation, the text of human intelligible is character encoding forms, and computer architecture is binary coded form, text The effect of this expression is how text code to be converted into computer code, and enables a computer to carry out text message Calculate.The selection of text representation directly influences the effect of text classification.Conventional text representation model is vector space model. But it is zero to have the weights of many Feature Words in vector space model, it is not so preferable to cause classifying quality, and the present invention proposes The feature weight in vector space model is changed, improves the degree of accuracy of classification.

Term vector is to expect to obtain the vector representation of each word with neutral net Natural Language Processing Models training text, Approach application for being called Word2Vec neutral net language model of Google's exploitation, this method can catch linguistic context letter Compressed data scale while breath.Word2Vec actually includes two kinds of different methods：Continuous Bag of Words And Skip-gram (CBOW).CBOW target is based on context to predict the probability of current term.Skip-gram just phases Instead：The probability of context (as shown in Figure 2) is predicted according to current term.Both approaches all by the use of artificial neural network as Their sorting algorithm.Originally, each word is a random N-dimensional vector.After training, the algorithm using CBOW or Person Skip-gram method obtains the optimal vector of each word.These present term vectors have captured the letter of context Breath, it can be used for predicting the heartbeat conditions of unknown data.

The content of the invention

In order to solve the problems, such as text representation during prior art text-processing, the accuracy of text classification is improved.This Invention provides a kind of processing method of the text representation based on bag of words, utilization space vector model bluebeard compound of the present invention to The method of amount establishes text model, so as to whole text document carry out classification processing, improves the accuracy of classification.This hair Bright technical scheme is：

The first step, pretreatment；

Text data set is segmented, goes stop words, removes low-frequency word, then carries out Feature Words selection；

Second step, text data set after pretreatment, is represented with bag of words；Described bag of words be with TFIDF (term frequency-inverse document frequency) is the text representation model of weight；

3rd step, text data set after pretreatment, train to obtain word with neutral net Natural Language Processing Models Vector；

4th step, the similitude of the term vector obtained according to the 3rd step change the Feature Words for the bag of words that second step obtains Weight, obtain new text representation model.In the TFIDF weight matrix of the vector space model, each feature is corresponding special One-dimensional in sign space, each text representation represents a Feature Words into a line in matrix, each row.Had in this matrix The TFIDF weighted values of many Feature Words are zero, and these feature weights for being zero influence the effect of classification.It is zero for some , the item that this is zero is changed according to the TFIDF values of the similitude of the term vector of neural metwork training word similar in n. Concrete modification mode is：The TFIDF obtained for second step is the text representation model of weight, its corresponding feature weight matrix Certain a line in some Feature Words t, if its feature weight W_tIt is zero；

A kind of situation, then feature weight W_tWith Feature Words t close word t₁, t₂, t₃..., t_nWeight W_t1, W_t2, W_t3..., W_tnCarry out approximate representation W_t, the similarity threshold m that the quantity n of similar word passes through controlling feature word size controls.

Wherein, S_(t,tn)In be characterized word t and Feature Words tn similarity.

Another situation, then feature weight W_tWith Feature Words t close word t₁, t₂, t₃..., t_nIn most close word weight W_iCarry out approximate representation W_t。

W_t=W_i*S_(t,i) (2)

Wherein, S_(t,i)In be characterized word t and Feature Words i similarity.

Further, for less data set, text data set after pretreatment is replicated n times, n is positive integer, For the size of dilated data set, then train with neutral net Natural Language Processing Models to obtain term vector, it is so obtained Term vector effect is more excellent.

The beneficial effects of the present invention are the method for, utilization space vector model combination term vector to establish text model, So as to carry out classification processing to whole text document, the accuracy of classification is improved.

Brief description of the drawings

Text representation processing procedure schematic diagrames of the Fig. 1 based on bag of words and term vector.

Fig. 2 trains the CBOW models and Skip-gram models of term vector.

Fig. 3 uses the classifying quality comparison diagram of RandomForest graders.

Embodiment

Described specific embodiment is merely to illustrate the implementation of the present invention, and does not limit the scope of the invention.Below Embodiments of the present invention are described in detail with reference to accompanying drawing, specifically include following steps：

1st, the formatting of data set.It is different for data set form, have using txt file data storage, have using pkl texts Part data storage.The present invention implements to provide text processing system, and data set is uniformly converted into csv file, and CSV is a kind of general , relatively simple file format plain text, it uses some character set, such as ASCII, Unicode, GB2312, UTF-8； It is made up of (typically often one record of row) record；Every record is that (typical separation symbol has field by separators Comma, branch, tab or space)；Every record has same field sequence.

2nd, the pretreatment of data.It is generally necessary to text is segmented, removes stop words, low-frequency word.

(1) word segmentation processing, there is space as nature delimiter between English word, therefore no longer need to segment, it is only necessary to Punctuate and numeral are removed can.But each word of Chinese is made up of the word of different numbers, in processes during text Segmented firstly the need of to text.Automatic word segmentation is that to need computer according to the meaning of one's words be rational word by sentence cutting. All it is the unit using word as minimum when handling natural language, the accuracy of participle directly affects the good of text classification It is bad, therefore segmented firstly the need of to text, the present invention implements to have used stammerer participle bag to carry out Chinese word segmentation.

(2) go stop words, in text " ", " ", " ", the word such as " I " occurs in each text, these words for Distinguishing the classification of document will not have an impact, thus remove them.For there is the stopwords storehouses of standard in English NLTK, Stop words are easily removed, obtain good effect.But for Chinese, because the pause dictionary of no standard is, it is necessary to search down Pause vocabulary is carried, removes stop words.

(3) influence of the low-frequency word for document is smaller, needs to remove low-frequency word in some cases；But it is exactly in some cases These specific words are different from other documents.

(4) English reduces its word prototype due to tense, voice be present, it is necessary to stemmed in this case.

3rd, feature selecting.The dimension of feature space is usually more than 100,000, and the space of such higher-dimension will make computational efficiency very It is low, or even calculate and can not carry out.And the contribution of some words is very weak in the text, can all occur in nearly all text, nothing Feature of the method as particular text, therefore it is nonsensical to ensuing classification.Therefore selection being capable of generation from text for needs The word of table text forms new feature space, so as to reach the purpose of dimensionality reduction.Conventional feature selection approach has text frequency Method (Document frequency, DF), mutual information method (Mutual information, MI), information gain method (Information gain, IG), X²Statistic law (CHI) etc., wherein the most widely used in text classification is that information increases Beneficial method, present invention uses information gain method to carry out feature selecting.

4th, text representation.Text representation is exactly that formalization processing is carried out to text, represents to be enough in meter as computer capacity The numeral of calculation, to reach computer it will be appreciated that the purpose of natural language text.The general text representation model used now for Vector space model (VSM), maximally efficient in text classification is vector space model.The selection of text representation directly affects To the effect of text classification.VSM basic thought is that substantial amounts of text representation is characterized into word matrix, so as to similar to text The comparison of degree is converted into the similarity-rough set of characteristic vector spatially, than more visible and be readily appreciated that.In this feature word matrix In, one-dimensional in each feature character pair space, the line number of matrix represents all textual datas to be sorted, by each text table The a line being shown as in matrix, each row represent a Feature Words.In actual applications, vector space model is passed through frequently with TFIDF For weighted value.TFIDF weight calculation formula are as follows：

5th, go training step 1 pretreated with neutral net language model (the Word2vec frameworks that Google increases income) Data set, the data set for the use that the present invention is implemented is relatively small, using the n times of quantity come dilated data set of replicate data collection. Training obtains a dictionary, and each word is a vector in dictionary, and these term vectors have captured the information of context.This hair It is bright to use vector space model combination term vector, this document representation method, improve classifying quality.

6th, for obtaining the TFIDF weight matrix of vector space model in step 4, in this feature word matrix, Mei Gete Levy one-dimensional in character pair space, the line number of matrix represents all textual datas to be sorted, by each text representation into matrix In a line, each row represent a Feature Words.The TFIDF weighted values that many Feature Words are had in this matrix are zero, these The feature weight for being zero influences the effect of classification.The present invention considers the term vector obtained using step 5, it is proposed that for TFIDF The Feature Words that weight is zero, its similar word is searched with term vector, the weighted value for the similar word being not zero with these TFIDF values Carry out the Feature Words that approximate representation this TFIDF value is zero.Specific implementation is as follows：For obtained vector space model, its is right The TFIDF weight matrix answered, some Feature Words t in its certain a line, if its feature weight W_tIt is zero, can uses：

(1) feature weight W_tWith Feature Words t close word t₁, t₂, t₃..., t_nWeight W_t1, W_t2, W_t3..., W_tnCome Approximate representation W_t, can be controlled as similar word n quantity by the similarity threshold m of controlling feature word size, such as formula (1) shown in.

(2) feature weight W_tWith Feature Words t close word t₁, t₂, t₃..., t_nIn most close word weight W_iCarry out approximate table Show W_t, as shown in formula (2).

7th, the text model established for the present invention is classified using RandomForest graders, and random forest cares for name Think justice, be to establish a forest with random manner, be made up of inside forest many decision trees, each of random forest is certainly Be between plan tree do not have it is related.After forest is obtained, when thering is a new input sample to enter, just allow in forest Each decision tree once judged respectively, look at which kind of (for sorting algorithm) this sample should belong to, then It is most to look at which kind of is chosen, just predicts that this sample is a kind of for that.SST (Standford are used for categorized data set Sentiment treebankdataset), the classification accuracy for the model that contrast bag of words and the present invention change, the present invention The classification degree of accuracy of the processing method of the text representation based on bag of words proposed is higher.

Claims

1. a kind of processing method of the text representation based on bag of words, it is characterised in that comprise the following steps：

The first step, pretreatment；

Second step, text data set after pretreatment, is represented with bag of words；Described bag of words be using TFIDF as The text representation model of weight；

3rd step, text data set after pretreatment, train to obtain term vector with neutral net Natural Language Processing Models；

4th step, the similitude of the term vector obtained according to the 3rd step change the power of the Feature Words for the bag of words that second step obtains Weight, obtains new text representation model；Concrete modification mode is：The TFIDF obtained for second step is the text representation of weight Model, some Feature Words t in certain a line of its corresponding feature weight matrix, if its feature weight W_tBe zero, then feature Weight W_tWith Feature Words t close word t₁, t₂, t₃..., t_nWeight W_t1, W_t2, W_t3..., W_tnCarry out approximate representation W_t, it is similar The similarity threshold m that the quantity n of word passes through controlling feature word size controls.

2. the processing method of a kind of text representation based on bag of words according to claim 1, it is characterised in that second In step, text data set after pretreatment is replicated n times, n is positive integer, for the size of dilated data set, then with god Term vector is obtained through nature network Language Processing model training.

A kind of 3. processing method of text representation based on bag of words according to claim 1 or 2, it is characterised in that 4th step, the similitude of the term vector obtained according to the 3rd step change the weight of the Feature Words for the bag of words that second step obtains, Obtain new text representation model；Concrete modification mode is：The TFIDF obtained for second step is the text representation mould of weight Type, some Feature Words t in certain a line of its corresponding feature weight matrix, if its feature weight W_tIt is zero, then feature is weighed Weight W_tWith Feature Words t close word t₁, t₂, t₃..., t_nIn most close word weight W_iCarry out approximate representation W_t。