CN107357895B

CN107357895B - Text representation processing method based on bag-of-words model

Info

Publication number: CN107357895B
Application number: CN201710569638.0A
Authority: CN
Inventors: 姚念民; 牛世雄
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2017-01-05
Filing date: 2017-07-14
Publication date: 2020-05-19
Anticipated expiration: 2037-07-14
Also published as: CN107357895A

Abstract

The invention belongs to the field of computer application, and discloses a text representation processing method based on a bag-of-words model, which carries out processing processes such as word segmentation, word pause removal, low-frequency word removal, feature selection and the like on an acquired text data set; then, representing the processed text by using a space vector model; meanwhile, training word vectors of the processed text by using a neural network method; and modifying the weight of the characteristic words of the word bag model according to the similarity of the word vectors to obtain a new text representation model. The method is used for processing the text representation problem, and improves the classification accuracy.

Description

Text representation processing method based on bag-of-words model

Technical Field

The invention belongs to the field of computer application, and particularly relates to a text representation processing method based on a bag-of-words model.

Background

At present, text processing is widely applied to various fields, generally, word segmentation, word pause removal, low-frequency word and feature selection are needed to be carried out on a text, then the text is represented, and finally classification processing is carried out. The results obtained from different countries for the study of text processing are also inconsistent. Compared with other countries, the research and exploration of text processing in China is relatively lagged and starts relatively late.

And (4) word segmentation processing, wherein a blank space is formed between English words as a natural delimiter, so that word segmentation is not required. However, when a computer processes text of a text, the text needs to be participled first, and automatic word segmentation is word segmentation which requires the computer to segment sentences into reasonable words according to the meaning of the words. When natural language is processed, words are used as the minimum unit, and the accuracy of word segmentation directly influences the quality of text classification.

Feature selection, if the text is represented by all feature words in the text, the dimension of the feature space is usually more than one hundred thousand, so that the high-dimensional space causes very low computational efficiency and even fails to complete the computation. In fact, some words in text contribute very weakly, such as the adverb "that appears in almost all texts, and cannot be a feature of a particular text, so it has no meaning for the next classification. Therefore, words capable of representing the text are selected from the text to form a new feature space, so that the purpose of reducing the dimension is achieved.

The text representation, the human-understood text being in character-coded form and the computer architecture being in binary-coded form, functions as how to convert the text coding into computer coding and to enable the computer to perform calculations on the text information. The selection of the text representation directly affects the effect of the text classification. A commonly used text representation model is a vector space model. However, the weight of a plurality of feature words in the space vector model is zero, so that the classification effect is not ideal.

Word vectors are vector representations of each Word which are expected by training texts by using a neural network natural language processing model, and a method called Word2Vec developed by google uses the neural network language model, and can compress data scale while capturing context information. Word2Vec actually includes two different approaches: continuous Bag of Words (CBOW) and Skip-gram. The goal of CBOW is to predict the probability of a current word based on context. Skip-gram is just the opposite: the probability of the context is predicted from the current word (as shown in fig. 2). Both methods utilize artificial neural networks as their classification algorithms. Initially, each word is a random N-dimensional vector. After training, the algorithm obtains the optimal vector of each word by using a CBOW or Skip-gram method. These word vectors now capture contextual information that can be used to predict the emotional state of unknown data.

Disclosure of Invention

The method and the device aim to solve the problem of text representation in the text processing process in the prior art and improve the accuracy of text classification. The invention provides a text representation processing method based on a bag-of-words model, which utilizes a space vector model and a word vector method to establish a text model, thereby classifying the whole text document and improving the classification accuracy. The technical scheme of the invention is as follows:

firstly, preprocessing;

performing word segmentation, stop word removal and low-frequency word removal on the text data set, and then performing feature word selection;

secondly, representing the preprocessed text data set by using a bag-of-words model; the bag-of-words model is a text representation model taking TFIDF (term frequency-inverse document frequency) as weight;

thirdly, training the preprocessed text data set by using a neural network natural language processing model to obtain word vectors;

and fourthly, modifying the weight of the characteristic words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model. In the TFIDF weight matrix of the space vector model, each feature corresponds to one dimension in the feature space, each text is represented as a row in the matrix, and each column represents a feature word. There are many feature words in this matrix with TFIDF weights of zero, and these zero feature weights affect the classification effect. For a zero entry, the zero entry is modified with the TFIDF values of the n similar words based on the similarity of the word vectors trained by the neural network. The specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is provided_tIs zero;

in one case, the feature weight W_tUsing similar words t of characteristic words t₁，t₂，t₃，...，t_nWeight W of_t1，W_t2，W_t3，...，W_tnTo approximate W_tThe number n of similar words is controlled by controlling the size of the similarity threshold m of the characteristic wordsAnd (5) preparing.

Wherein S is_(t,tn)The middle is the similarity between the characteristic word t and the characteristic word tn.

In another case, the feature weight W_tUsing similar words t of characteristic words t₁，t₂，t₃，...，t_nWeight W of the nearest word in_iTo approximate W_t。

W_t＝W_i*S_(t,i)(2)

Wherein S is_(t,i)The middle is the similarity of the characteristic word t and the characteristic word i.

Further, for a smaller data set, copying the preprocessed text data set by L times, wherein L is a positive integer and is used for enlarging the size of the data set, and training by using a neural network natural language processing model to obtain word vectors, so that the obtained word vectors have better effect.

The method has the advantages that the space vector model is combined with the word vector method to establish the text model, so that the whole text document is classified, and the classification accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of a text representation process based on a bag of words model and word vectors.

FIG. 2 trains the CBOW model and Skip-gram model of the word vector.

FIG. 3 is a comparison graph of the classification effect using the RandomForest classifier.

Detailed Description

The specific embodiments described are merely illustrative of implementations of the invention and do not limit the scope of the invention. The following detailed description of the embodiments of the present invention with reference to the drawings specifically includes the following steps:

1. formatting of the data set. For different data set forms, txt files are adopted to store data, and pkl files are adopted to store data. The implementation of the invention provides a text processing system, a data set is uniformly converted into a CSV file, the CSV is a universal and relatively simple file format plain text, and a certain character set, such as ASCII, Unicode, GB2312 and UTF-8, is used; it consists of records (typically one record per line); each record is separated into fields by separators (typical separators are commas, semicolons, tabs, or spaces); each record has the same field sequence.

2. And (4) preprocessing data. It is often desirable to perform word segmentation, word de-stop, and word de-bass on text.

(1) And (4) word segmentation processing is carried out, and spaces are arranged among English words to serve as natural delimiters, so that word segmentation is not needed, and punctuations and numbers are removed. However, each word in chinese is composed of a different number of characters, and when chinese text is processed, the text needs to be participled first. Automatic word segmentation is the segmentation of sentences into reasonable words that requires a computer to semantically segment the words. When natural language is processed, words are used as the minimum unit, the accuracy of word segmentation directly influences the quality of text classification, therefore, the words of the text are segmented firstly, and Chinese word segmentation is carried out by using a Chinese word segmentation packet.

(2) The words such as the word of "in", "do", "i", etc. appear in each text, and these words do not affect the classification of the documents, so they are removed. For the standard stopwords library in English NLTK, the stop words can be easily removed, and a good effect is obtained. However, for chinese, there is no standard pause word library, and the pause word list needs to be searched and downloaded to remove the pause word.

(3) The influence of low-frequency words on the document is small, and the low-frequency words need to be removed in some cases; but in some cases it is these specific words that are distinguished from other documents.

(4) In the case of english, due to tense and morpheme, it is necessary to stem and restore its word prototype.

3. And (4) selecting the characteristics. The dimensions of the feature space are typically more than one hundred thousand, and such a high dimensional space would make the computation very inefficient or even impossible. While some words in text contribute very weakly, appear in almost all texts and are not characteristic of a particular text, so it has no meaning for the next classification. Therefore, words capable of representing the text are selected from the text to form a new feature space, so that the purpose of reducing the dimension is achieved. The commonly used feature selection methods include a text frequency method (DF), a Mutual Information Method (MI), an Information gain method (IG), an X2 statistical method (CHI), and the like, wherein the Information gain method is most widely used in text classification, and the Information gain method is used in the present invention for feature selection.

4. And (4) text representation. The text representation is to perform formalization processing on the text and represent the text into numbers which can be used for calculation by a computer so as to achieve the purpose that the computer can understand the natural language text. The general text representation model used today is the space vector model (VSM), which is the most efficient in text classification. The selection of the text representation directly affects the effect of the text classification. The basic idea of the VSM is to represent a large amount of text as a feature word matrix, so that comparison of text similarity is converted into similarity comparison of feature vectors in space, and the comparison is clear and easy to understand. In the feature word matrix, each feature corresponds to one dimension in the feature space, the number of rows of the matrix represents the number of texts to be classified, each text is represented as one row in the matrix, and each column represents one feature word. In practical applications, the space vector model often uses TFIDF as the weight value. TFIDF weight calculation formula is as follows:

5. and (3) training the data set preprocessed in the step (1) by using a neural network language model (Google open-source Word2vec architecture), wherein the data set adopted by the method is relatively small, and the quantity of the data set is enlarged by copying the data set by L times. Training results in a word bank, each word in the word bank is a vector, and the word vectors capture context information. The invention uses the vector space model to combine with the word vector, and the text representation method improves the classification effect.

6. And (4) obtaining a TFIDF weight matrix of the space vector model in the step (4), wherein each feature in the feature word matrix corresponds to one dimension in the feature space, the line number of the matrix represents the number of all texts to be classified, each text is represented as one line in the matrix, and each column represents one feature word. There are many feature words in this matrix with TFIDF weights of zero, and these zero feature weights affect the classification effect. The invention considers the word vector obtained in step 5, proposes a feature word with zero TFIDF weight, uses the word vector to search its similar words, and uses the weight value of the similar words with non-zero TFIDF value to approximately represent the feature word with zero TFIDF value. The specific implementation is as follows: for the obtained space vector model, the corresponding TFIDF weight matrix, a certain feature word t in a certain row of the TFIDF weight matrix, if the feature weight W of the TFIDF weight matrix_tTo zero, one can use:

(1) feature weight W_tUsing similar words t of characteristic words t₁，t₂，t₃，...，t_nWeight W of_t1，W_t2，W_t3，...，W_tnTo approximate W_tAs for the number of similar words n, the size of the similarity threshold m of the feature words can be controlled, as shown in formula (1).

(2) Feature weight W_tUsing similar words t of characteristic words t₁，t₂，t₃，...，t_nWeight W of the nearest word in_iTo approximate W_tAs shown in equation (2).

7. The text model established by the invention is classified by using a random forest classifier, a forest is established by using a random mode as the name of the random forest, a lot of decision trees are arranged in the forest, and each decision tree of the random forest is not related. After a forest is obtained, when a new input sample enters, each decision tree in the forest is judged, the class to which the sample belongs is seen (for a classification algorithm), and then the class is selected most, so that the sample is predicted to be the class. For classification data set adopting SST (Standard deformation laboratory treebankdataset), the classification accuracy of the bag-of-words model and the model modified by the invention is compared, and the classification accuracy of the text representation processing method based on the bag-of-words model is higher.

Claims

1. A processing method of text representation based on a bag of words model is characterized by comprising the following steps:

firstly, preprocessing;

secondly, representing the preprocessed text data set by using a bag-of-words model; the bag-of-words model is a text representation model taking TFIDF as weight;

fourthly, modifying the weight of the feature words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model; the specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is provided_tIs zero, then the feature weight W_tUsing similar words t of characteristic words t₁，t₂，t₃，...，t_nWeight W of_t1，W_t2，W_t3，...，W_tnTo approximate W_tThe number n of similar words is controlled by controlling the size of the similarity threshold m of the characteristic words.

2. The method as claimed in claim 1, wherein in the second step, the preprocessed text data set is copied by L times, where L is a positive integer for enlarging the size of the data set, and then word vectors are obtained by training using a neural network natural language processing model.

3. The method as claimed in claim 1 or 2, wherein W is the fourth step_tThe representation can also be represented in the following way: modifying the weight of the feature words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model; the specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is provided_tIs zero, then the feature weight W_tUsing similar words t of characteristic words t₁，t₂，t₃，...，t_nWeight W of the nearest word in_iTo approximate W_t。