CN107357895B - Text representation processing method based on bag-of-words model - Google Patents
Text representation processing method based on bag-of-words model Download PDFInfo
- Publication number
- CN107357895B CN107357895B CN201710569638.0A CN201710569638A CN107357895B CN 107357895 B CN107357895 B CN 107357895B CN 201710569638 A CN201710569638 A CN 201710569638A CN 107357895 B CN107357895 B CN 107357895B
- Authority
- CN
- China
- Prior art keywords
- word
- weight
- model
- text
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of computer application, and discloses a text representation processing method based on a bag-of-words model, which carries out processing processes such as word segmentation, word pause removal, low-frequency word removal, feature selection and the like on an acquired text data set; then, representing the processed text by using a space vector model; meanwhile, training word vectors of the processed text by using a neural network method; and modifying the weight of the characteristic words of the word bag model according to the similarity of the word vectors to obtain a new text representation model. The method is used for processing the text representation problem, and improves the classification accuracy.
Description
Technical Field
The invention belongs to the field of computer application, and particularly relates to a text representation processing method based on a bag-of-words model.
Background
At present, text processing is widely applied to various fields, generally, word segmentation, word pause removal, low-frequency word and feature selection are needed to be carried out on a text, then the text is represented, and finally classification processing is carried out. The results obtained from different countries for the study of text processing are also inconsistent. Compared with other countries, the research and exploration of text processing in China is relatively lagged and starts relatively late.
And (4) word segmentation processing, wherein a blank space is formed between English words as a natural delimiter, so that word segmentation is not required. However, when a computer processes text of a text, the text needs to be participled first, and automatic word segmentation is word segmentation which requires the computer to segment sentences into reasonable words according to the meaning of the words. When natural language is processed, words are used as the minimum unit, and the accuracy of word segmentation directly influences the quality of text classification.
Feature selection, if the text is represented by all feature words in the text, the dimension of the feature space is usually more than one hundred thousand, so that the high-dimensional space causes very low computational efficiency and even fails to complete the computation. In fact, some words in text contribute very weakly, such as the adverb "that appears in almost all texts, and cannot be a feature of a particular text, so it has no meaning for the next classification. Therefore, words capable of representing the text are selected from the text to form a new feature space, so that the purpose of reducing the dimension is achieved.
The text representation, the human-understood text being in character-coded form and the computer architecture being in binary-coded form, functions as how to convert the text coding into computer coding and to enable the computer to perform calculations on the text information. The selection of the text representation directly affects the effect of the text classification. A commonly used text representation model is a vector space model. However, the weight of a plurality of feature words in the space vector model is zero, so that the classification effect is not ideal.
Word vectors are vector representations of each Word which are expected by training texts by using a neural network natural language processing model, and a method called Word2Vec developed by google uses the neural network language model, and can compress data scale while capturing context information. Word2Vec actually includes two different approaches: continuous Bag of Words (CBOW) and Skip-gram. The goal of CBOW is to predict the probability of a current word based on context. Skip-gram is just the opposite: the probability of the context is predicted from the current word (as shown in fig. 2). Both methods utilize artificial neural networks as their classification algorithms. Initially, each word is a random N-dimensional vector. After training, the algorithm obtains the optimal vector of each word by using a CBOW or Skip-gram method. These word vectors now capture contextual information that can be used to predict the emotional state of unknown data.
Disclosure of Invention
The method and the device aim to solve the problem of text representation in the text processing process in the prior art and improve the accuracy of text classification. The invention provides a text representation processing method based on a bag-of-words model, which utilizes a space vector model and a word vector method to establish a text model, thereby classifying the whole text document and improving the classification accuracy. The technical scheme of the invention is as follows:
firstly, preprocessing;
performing word segmentation, stop word removal and low-frequency word removal on the text data set, and then performing feature word selection;
secondly, representing the preprocessed text data set by using a bag-of-words model; the bag-of-words model is a text representation model taking TFIDF (term frequency-inverse document frequency) as weight;
thirdly, training the preprocessed text data set by using a neural network natural language processing model to obtain word vectors;
and fourthly, modifying the weight of the characteristic words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model. In the TFIDF weight matrix of the space vector model, each feature corresponds to one dimension in the feature space, each text is represented as a row in the matrix, and each column represents a feature word. There are many feature words in this matrix with TFIDF weights of zero, and these zero feature weights affect the classification effect. For a zero entry, the zero entry is modified with the TFIDF values of the n similar words based on the similarity of the word vectors trained by the neural network. The specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is providedtIs zero;
in one case, the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W oft1,Wt2,Wt3,...,WtnTo approximate WtThe number n of similar words is controlled by controlling the size of the similarity threshold m of the characteristic wordsAnd (5) preparing.
Wherein S is(t,tn)The middle is the similarity between the characteristic word t and the characteristic word tn.
In another case, the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W of the nearest word iniTo approximate Wt。
Wt=Wi*S(t,i)(2)
Wherein S is(t,i)The middle is the similarity of the characteristic word t and the characteristic word i.
Further, for a smaller data set, copying the preprocessed text data set by L times, wherein L is a positive integer and is used for enlarging the size of the data set, and training by using a neural network natural language processing model to obtain word vectors, so that the obtained word vectors have better effect.
The method has the advantages that the space vector model is combined with the word vector method to establish the text model, so that the whole text document is classified, and the classification accuracy is improved.
Drawings
FIG. 1 is a schematic diagram of a text representation process based on a bag of words model and word vectors.
FIG. 2 trains the CBOW model and Skip-gram model of the word vector.
FIG. 3 is a comparison graph of the classification effect using the RandomForest classifier.
Detailed Description
The specific embodiments described are merely illustrative of implementations of the invention and do not limit the scope of the invention. The following detailed description of the embodiments of the present invention with reference to the drawings specifically includes the following steps:
1. formatting of the data set. For different data set forms, txt files are adopted to store data, and pkl files are adopted to store data. The implementation of the invention provides a text processing system, a data set is uniformly converted into a CSV file, the CSV is a universal and relatively simple file format plain text, and a certain character set, such as ASCII, Unicode, GB2312 and UTF-8, is used; it consists of records (typically one record per line); each record is separated into fields by separators (typical separators are commas, semicolons, tabs, or spaces); each record has the same field sequence.
2. And (4) preprocessing data. It is often desirable to perform word segmentation, word de-stop, and word de-bass on text.
(1) And (4) word segmentation processing is carried out, and spaces are arranged among English words to serve as natural delimiters, so that word segmentation is not needed, and punctuations and numbers are removed. However, each word in chinese is composed of a different number of characters, and when chinese text is processed, the text needs to be participled first. Automatic word segmentation is the segmentation of sentences into reasonable words that requires a computer to semantically segment the words. When natural language is processed, words are used as the minimum unit, the accuracy of word segmentation directly influences the quality of text classification, therefore, the words of the text are segmented firstly, and Chinese word segmentation is carried out by using a Chinese word segmentation packet.
(2) The words such as the word of "in", "do", "i", etc. appear in each text, and these words do not affect the classification of the documents, so they are removed. For the standard stopwords library in English NLTK, the stop words can be easily removed, and a good effect is obtained. However, for chinese, there is no standard pause word library, and the pause word list needs to be searched and downloaded to remove the pause word.
(3) The influence of low-frequency words on the document is small, and the low-frequency words need to be removed in some cases; but in some cases it is these specific words that are distinguished from other documents.
(4) In the case of english, due to tense and morpheme, it is necessary to stem and restore its word prototype.
3. And (4) selecting the characteristics. The dimensions of the feature space are typically more than one hundred thousand, and such a high dimensional space would make the computation very inefficient or even impossible. While some words in text contribute very weakly, appear in almost all texts and are not characteristic of a particular text, so it has no meaning for the next classification. Therefore, words capable of representing the text are selected from the text to form a new feature space, so that the purpose of reducing the dimension is achieved. The commonly used feature selection methods include a text frequency method (DF), a Mutual Information Method (MI), an Information gain method (IG), an X2 statistical method (CHI), and the like, wherein the Information gain method is most widely used in text classification, and the Information gain method is used in the present invention for feature selection.
4. And (4) text representation. The text representation is to perform formalization processing on the text and represent the text into numbers which can be used for calculation by a computer so as to achieve the purpose that the computer can understand the natural language text. The general text representation model used today is the space vector model (VSM), which is the most efficient in text classification. The selection of the text representation directly affects the effect of the text classification. The basic idea of the VSM is to represent a large amount of text as a feature word matrix, so that comparison of text similarity is converted into similarity comparison of feature vectors in space, and the comparison is clear and easy to understand. In the feature word matrix, each feature corresponds to one dimension in the feature space, the number of rows of the matrix represents the number of texts to be classified, each text is represented as one row in the matrix, and each column represents one feature word. In practical applications, the space vector model often uses TFIDF as the weight value. TFIDF weight calculation formula is as follows:
5. and (3) training the data set preprocessed in the step (1) by using a neural network language model (Google open-source Word2vec architecture), wherein the data set adopted by the method is relatively small, and the quantity of the data set is enlarged by copying the data set by L times. Training results in a word bank, each word in the word bank is a vector, and the word vectors capture context information. The invention uses the vector space model to combine with the word vector, and the text representation method improves the classification effect.
6. And (4) obtaining a TFIDF weight matrix of the space vector model in the step (4), wherein each feature in the feature word matrix corresponds to one dimension in the feature space, the line number of the matrix represents the number of all texts to be classified, each text is represented as one line in the matrix, and each column represents one feature word. There are many feature words in this matrix with TFIDF weights of zero, and these zero feature weights affect the classification effect. The invention considers the word vector obtained in step 5, proposes a feature word with zero TFIDF weight, uses the word vector to search its similar words, and uses the weight value of the similar words with non-zero TFIDF value to approximately represent the feature word with zero TFIDF value. The specific implementation is as follows: for the obtained space vector model, the corresponding TFIDF weight matrix, a certain feature word t in a certain row of the TFIDF weight matrix, if the feature weight W of the TFIDF weight matrixtTo zero, one can use:
(1) feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W oft1,Wt2,Wt3,...,WtnTo approximate WtAs for the number of similar words n, the size of the similarity threshold m of the feature words can be controlled, as shown in formula (1).
(2) Feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W of the nearest word iniTo approximate WtAs shown in equation (2).
7. The text model established by the invention is classified by using a random forest classifier, a forest is established by using a random mode as the name of the random forest, a lot of decision trees are arranged in the forest, and each decision tree of the random forest is not related. After a forest is obtained, when a new input sample enters, each decision tree in the forest is judged, the class to which the sample belongs is seen (for a classification algorithm), and then the class is selected most, so that the sample is predicted to be the class. For classification data set adopting SST (Standard deformation laboratory treebankdataset), the classification accuracy of the bag-of-words model and the model modified by the invention is compared, and the classification accuracy of the text representation processing method based on the bag-of-words model is higher.
Claims (3)
1. A processing method of text representation based on a bag of words model is characterized by comprising the following steps:
firstly, preprocessing;
performing word segmentation, stop word removal and low-frequency word removal on the text data set, and then performing feature word selection;
secondly, representing the preprocessed text data set by using a bag-of-words model; the bag-of-words model is a text representation model taking TFIDF as weight;
thirdly, training the preprocessed text data set by using a neural network natural language processing model to obtain word vectors;
fourthly, modifying the weight of the feature words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model; the specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is providedtIs zero, then the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W oft1,Wt2,Wt3,...,WtnTo approximate WtThe number n of similar words is controlled by controlling the size of the similarity threshold m of the characteristic words.
2. The method as claimed in claim 1, wherein in the second step, the preprocessed text data set is copied by L times, where L is a positive integer for enlarging the size of the data set, and then word vectors are obtained by training using a neural network natural language processing model.
3. The method as claimed in claim 1 or 2, wherein W is the fourth steptThe representation can also be represented in the following way: modifying the weight of the feature words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model; the specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is providedtIs zero, then the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W of the nearest word iniTo approximate Wt。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710005310 | 2017-01-05 | ||
CN2017100053106 | 2017-01-05 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357895A CN107357895A (en) | 2017-11-17 |
CN107357895B true CN107357895B (en) | 2020-05-19 |
Family
ID=60292842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710569638.0A Expired - Fee Related CN107357895B (en) | 2017-01-05 | 2017-07-14 | Text representation processing method based on bag-of-words model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357895B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362815A (en) * | 2018-04-11 | 2019-10-22 | 北京京东尚科信息技术有限公司 | Text vector generation method and device |
CN109284382B (en) * | 2018-09-30 | 2021-05-28 | 武汉斗鱼网络科技有限公司 | Text classification method and computing device |
CN109543036A (en) * | 2018-11-20 | 2019-03-29 | 四川长虹电器股份有限公司 | Text Clustering Method based on semantic similarity |
CN110096591A (en) * | 2019-04-04 | 2019-08-06 | 平安科技(深圳)有限公司 | Long text classification method, device, computer equipment and storage medium based on bag of words |
CN111859901A (en) * | 2020-07-15 | 2020-10-30 | 大连理工大学 | English repeated text detection method, system, terminal and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104809131A (en) * | 2014-01-27 | 2015-07-29 | 董靖 | Automatic classification system and method of electronic documents |
CN104881400A (en) * | 2015-05-19 | 2015-09-02 | 上海交通大学 | Semantic dependency calculating method based on associative network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150026104A1 (en) * | 2013-07-17 | 2015-01-22 | Christopher Tambos | System and method for email classification |
-
2017
- 2017-07-14 CN CN201710569638.0A patent/CN107357895B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
CN104809131A (en) * | 2014-01-27 | 2015-07-29 | 董靖 | Automatic classification system and method of electronic documents |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104881400A (en) * | 2015-05-19 | 2015-09-02 | 上海交通大学 | Semantic dependency calculating method based on associative network |
Non-Patent Citations (1)
Title |
---|
基于Word2Vec主题提取的微博推荐;朱雪梅;《中国优秀硕士学位论文全文数据库 信息科技辑 2016年第03期》;20160315;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN107357895A (en) | 2017-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
Nguyen et al. | Relation extraction: Perspective from convolutional neural networks | |
CN110287328B (en) | Text classification method, device and equipment and computer readable storage medium | |
US11074412B1 (en) | Machine learning classification system | |
CN109299270B (en) | Text data unsupervised clustering method based on convolutional neural network | |
KR102217248B1 (en) | Feature extraction and learning method for summarizing text documents | |
CN113434858B (en) | Malicious software family classification method based on disassembly code structure and semantic features | |
CN112231477A (en) | Text classification method based on improved capsule network | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
Farhoodi et al. | Applying machine learning algorithms for automatic Persian text classification | |
CN109993216B (en) | Text classification method and device based on K nearest neighbor KNN | |
CN109791570B (en) | Efficient and accurate named entity recognition method and device | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN112417153A (en) | Text classification method and device, terminal equipment and readable storage medium | |
CN107832307B (en) | Chinese word segmentation method based on undirected graph and single-layer neural network | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
CN113032253A (en) | Test data feature extraction method, test method and related device | |
Jayady et al. | Theme Identification using Machine Learning Techniques | |
Wang et al. | File fragment type identification with convolutional neural networks | |
Ghosh | Sentiment analysis of IMDb movie reviews: a comparative study on performance of hyperparameter-tuned classification algorithms | |
CN109359090A (en) | File fragmentation classification method and system based on convolutional neural networks | |
Elgeldawi et al. | Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79 | |
CN110348497B (en) | Text representation method constructed based on WT-GloVe word vector | |
Pak et al. | The impact of text representation and preprocessing on author identification | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200519 Termination date: 20210714 |