CN107357895B - Text representation processing method based on bag-of-words model - Google Patents

Text representation processing method based on bag-of-words model Download PDF

Info

Publication number
CN107357895B
CN107357895B CN201710569638.0A CN201710569638A CN107357895B CN 107357895 B CN107357895 B CN 107357895B CN 201710569638 A CN201710569638 A CN 201710569638A CN 107357895 B CN107357895 B CN 107357895B
Authority
CN
China
Prior art keywords
word
weight
model
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710569638.0A
Other languages
Chinese (zh)
Other versions
CN107357895A (en
Inventor
姚念民
牛世雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Publication of CN107357895A publication Critical patent/CN107357895A/en
Application granted granted Critical
Publication of CN107357895B publication Critical patent/CN107357895B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of computer application, and discloses a text representation processing method based on a bag-of-words model, which carries out processing processes such as word segmentation, word pause removal, low-frequency word removal, feature selection and the like on an acquired text data set; then, representing the processed text by using a space vector model; meanwhile, training word vectors of the processed text by using a neural network method; and modifying the weight of the characteristic words of the word bag model according to the similarity of the word vectors to obtain a new text representation model. The method is used for processing the text representation problem, and improves the classification accuracy.

Description

Text representation processing method based on bag-of-words model
Technical Field
The invention belongs to the field of computer application, and particularly relates to a text representation processing method based on a bag-of-words model.
Background
At present, text processing is widely applied to various fields, generally, word segmentation, word pause removal, low-frequency word and feature selection are needed to be carried out on a text, then the text is represented, and finally classification processing is carried out. The results obtained from different countries for the study of text processing are also inconsistent. Compared with other countries, the research and exploration of text processing in China is relatively lagged and starts relatively late.
And (4) word segmentation processing, wherein a blank space is formed between English words as a natural delimiter, so that word segmentation is not required. However, when a computer processes text of a text, the text needs to be participled first, and automatic word segmentation is word segmentation which requires the computer to segment sentences into reasonable words according to the meaning of the words. When natural language is processed, words are used as the minimum unit, and the accuracy of word segmentation directly influences the quality of text classification.
Feature selection, if the text is represented by all feature words in the text, the dimension of the feature space is usually more than one hundred thousand, so that the high-dimensional space causes very low computational efficiency and even fails to complete the computation. In fact, some words in text contribute very weakly, such as the adverb "that appears in almost all texts, and cannot be a feature of a particular text, so it has no meaning for the next classification. Therefore, words capable of representing the text are selected from the text to form a new feature space, so that the purpose of reducing the dimension is achieved.
The text representation, the human-understood text being in character-coded form and the computer architecture being in binary-coded form, functions as how to convert the text coding into computer coding and to enable the computer to perform calculations on the text information. The selection of the text representation directly affects the effect of the text classification. A commonly used text representation model is a vector space model. However, the weight of a plurality of feature words in the space vector model is zero, so that the classification effect is not ideal.
Word vectors are vector representations of each Word which are expected by training texts by using a neural network natural language processing model, and a method called Word2Vec developed by google uses the neural network language model, and can compress data scale while capturing context information. Word2Vec actually includes two different approaches: continuous Bag of Words (CBOW) and Skip-gram. The goal of CBOW is to predict the probability of a current word based on context. Skip-gram is just the opposite: the probability of the context is predicted from the current word (as shown in fig. 2). Both methods utilize artificial neural networks as their classification algorithms. Initially, each word is a random N-dimensional vector. After training, the algorithm obtains the optimal vector of each word by using a CBOW or Skip-gram method. These word vectors now capture contextual information that can be used to predict the emotional state of unknown data.
Disclosure of Invention
The method and the device aim to solve the problem of text representation in the text processing process in the prior art and improve the accuracy of text classification. The invention provides a text representation processing method based on a bag-of-words model, which utilizes a space vector model and a word vector method to establish a text model, thereby classifying the whole text document and improving the classification accuracy. The technical scheme of the invention is as follows:
firstly, preprocessing;
performing word segmentation, stop word removal and low-frequency word removal on the text data set, and then performing feature word selection;
secondly, representing the preprocessed text data set by using a bag-of-words model; the bag-of-words model is a text representation model taking TFIDF (term frequency-inverse document frequency) as weight;
thirdly, training the preprocessed text data set by using a neural network natural language processing model to obtain word vectors;
and fourthly, modifying the weight of the characteristic words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model. In the TFIDF weight matrix of the space vector model, each feature corresponds to one dimension in the feature space, each text is represented as a row in the matrix, and each column represents a feature word. There are many feature words in this matrix with TFIDF weights of zero, and these zero feature weights affect the classification effect. For a zero entry, the zero entry is modified with the TFIDF values of the n similar words based on the similarity of the word vectors trained by the neural network. The specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is providedtIs zero;
in one case, the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W oft1,Wt2,Wt3,...,WtnTo approximate WtThe number n of similar words is controlled by controlling the size of the similarity threshold m of the characteristic wordsAnd (5) preparing.
Figure GDA0002355914770000031
Wherein S is(t,tn)The middle is the similarity between the characteristic word t and the characteristic word tn.
In another case, the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W of the nearest word iniTo approximate Wt
Wt=Wi*S(t,i)(2)
Wherein S is(t,i)The middle is the similarity of the characteristic word t and the characteristic word i.
Further, for a smaller data set, copying the preprocessed text data set by L times, wherein L is a positive integer and is used for enlarging the size of the data set, and training by using a neural network natural language processing model to obtain word vectors, so that the obtained word vectors have better effect.
The method has the advantages that the space vector model is combined with the word vector method to establish the text model, so that the whole text document is classified, and the classification accuracy is improved.
Drawings
FIG. 1 is a schematic diagram of a text representation process based on a bag of words model and word vectors.
FIG. 2 trains the CBOW model and Skip-gram model of the word vector.
FIG. 3 is a comparison graph of the classification effect using the RandomForest classifier.
Detailed Description
The specific embodiments described are merely illustrative of implementations of the invention and do not limit the scope of the invention. The following detailed description of the embodiments of the present invention with reference to the drawings specifically includes the following steps:
1. formatting of the data set. For different data set forms, txt files are adopted to store data, and pkl files are adopted to store data. The implementation of the invention provides a text processing system, a data set is uniformly converted into a CSV file, the CSV is a universal and relatively simple file format plain text, and a certain character set, such as ASCII, Unicode, GB2312 and UTF-8, is used; it consists of records (typically one record per line); each record is separated into fields by separators (typical separators are commas, semicolons, tabs, or spaces); each record has the same field sequence.
2. And (4) preprocessing data. It is often desirable to perform word segmentation, word de-stop, and word de-bass on text.
(1) And (4) word segmentation processing is carried out, and spaces are arranged among English words to serve as natural delimiters, so that word segmentation is not needed, and punctuations and numbers are removed. However, each word in chinese is composed of a different number of characters, and when chinese text is processed, the text needs to be participled first. Automatic word segmentation is the segmentation of sentences into reasonable words that requires a computer to semantically segment the words. When natural language is processed, words are used as the minimum unit, the accuracy of word segmentation directly influences the quality of text classification, therefore, the words of the text are segmented firstly, and Chinese word segmentation is carried out by using a Chinese word segmentation packet.
(2) The words such as the word of "in", "do", "i", etc. appear in each text, and these words do not affect the classification of the documents, so they are removed. For the standard stopwords library in English NLTK, the stop words can be easily removed, and a good effect is obtained. However, for chinese, there is no standard pause word library, and the pause word list needs to be searched and downloaded to remove the pause word.
(3) The influence of low-frequency words on the document is small, and the low-frequency words need to be removed in some cases; but in some cases it is these specific words that are distinguished from other documents.
(4) In the case of english, due to tense and morpheme, it is necessary to stem and restore its word prototype.
3. And (4) selecting the characteristics. The dimensions of the feature space are typically more than one hundred thousand, and such a high dimensional space would make the computation very inefficient or even impossible. While some words in text contribute very weakly, appear in almost all texts and are not characteristic of a particular text, so it has no meaning for the next classification. Therefore, words capable of representing the text are selected from the text to form a new feature space, so that the purpose of reducing the dimension is achieved. The commonly used feature selection methods include a text frequency method (DF), a Mutual Information Method (MI), an Information gain method (IG), an X2 statistical method (CHI), and the like, wherein the Information gain method is most widely used in text classification, and the Information gain method is used in the present invention for feature selection.
4. And (4) text representation. The text representation is to perform formalization processing on the text and represent the text into numbers which can be used for calculation by a computer so as to achieve the purpose that the computer can understand the natural language text. The general text representation model used today is the space vector model (VSM), which is the most efficient in text classification. The selection of the text representation directly affects the effect of the text classification. The basic idea of the VSM is to represent a large amount of text as a feature word matrix, so that comparison of text similarity is converted into similarity comparison of feature vectors in space, and the comparison is clear and easy to understand. In the feature word matrix, each feature corresponds to one dimension in the feature space, the number of rows of the matrix represents the number of texts to be classified, each text is represented as one row in the matrix, and each column represents one feature word. In practical applications, the space vector model often uses TFIDF as the weight value. TFIDF weight calculation formula is as follows:
Figure GDA0002355914770000051
5. and (3) training the data set preprocessed in the step (1) by using a neural network language model (Google open-source Word2vec architecture), wherein the data set adopted by the method is relatively small, and the quantity of the data set is enlarged by copying the data set by L times. Training results in a word bank, each word in the word bank is a vector, and the word vectors capture context information. The invention uses the vector space model to combine with the word vector, and the text representation method improves the classification effect.
6. And (4) obtaining a TFIDF weight matrix of the space vector model in the step (4), wherein each feature in the feature word matrix corresponds to one dimension in the feature space, the line number of the matrix represents the number of all texts to be classified, each text is represented as one line in the matrix, and each column represents one feature word. There are many feature words in this matrix with TFIDF weights of zero, and these zero feature weights affect the classification effect. The invention considers the word vector obtained in step 5, proposes a feature word with zero TFIDF weight, uses the word vector to search its similar words, and uses the weight value of the similar words with non-zero TFIDF value to approximately represent the feature word with zero TFIDF value. The specific implementation is as follows: for the obtained space vector model, the corresponding TFIDF weight matrix, a certain feature word t in a certain row of the TFIDF weight matrix, if the feature weight W of the TFIDF weight matrixtTo zero, one can use:
(1) feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W oft1,Wt2,Wt3,...,WtnTo approximate WtAs for the number of similar words n, the size of the similarity threshold m of the feature words can be controlled, as shown in formula (1).
(2) Feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W of the nearest word iniTo approximate WtAs shown in equation (2).
7. The text model established by the invention is classified by using a random forest classifier, a forest is established by using a random mode as the name of the random forest, a lot of decision trees are arranged in the forest, and each decision tree of the random forest is not related. After a forest is obtained, when a new input sample enters, each decision tree in the forest is judged, the class to which the sample belongs is seen (for a classification algorithm), and then the class is selected most, so that the sample is predicted to be the class. For classification data set adopting SST (Standard deformation laboratory treebankdataset), the classification accuracy of the bag-of-words model and the model modified by the invention is compared, and the classification accuracy of the text representation processing method based on the bag-of-words model is higher.

Claims (3)

1. A processing method of text representation based on a bag of words model is characterized by comprising the following steps:
firstly, preprocessing;
performing word segmentation, stop word removal and low-frequency word removal on the text data set, and then performing feature word selection;
secondly, representing the preprocessed text data set by using a bag-of-words model; the bag-of-words model is a text representation model taking TFIDF as weight;
thirdly, training the preprocessed text data set by using a neural network natural language processing model to obtain word vectors;
fourthly, modifying the weight of the feature words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model; the specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is providedtIs zero, then the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W oft1,Wt2,Wt3,...,WtnTo approximate WtThe number n of similar words is controlled by controlling the size of the similarity threshold m of the characteristic words.
2. The method as claimed in claim 1, wherein in the second step, the preprocessed text data set is copied by L times, where L is a positive integer for enlarging the size of the data set, and then word vectors are obtained by training using a neural network natural language processing model.
3. The method as claimed in claim 1 or 2, wherein W is the fourth steptThe representation can also be represented in the following way: modifying the weight of the feature words of the word bag model obtained in the second step according to the similarity of the word vectors obtained in the third step to obtain a new text representation model; the specific modification mode is as follows: for the text representation model with TFIDF as weight obtained in the second step, a certain feature word t in a certain row of the feature weight matrix corresponding to the text representation model with TFIDF as weight is provided if the feature weight W of the text representation model with TFIDF as weight is providedtIs zero, then the feature weight WtUsing similar words t of characteristic words t1,t2,t3,...,tnWeight W of the nearest word iniTo approximate Wt
CN201710569638.0A 2017-01-05 2017-07-14 Text representation processing method based on bag-of-words model Expired - Fee Related CN107357895B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710005310 2017-01-05
CN2017100053106 2017-01-05

Publications (2)

Publication Number Publication Date
CN107357895A CN107357895A (en) 2017-11-17
CN107357895B true CN107357895B (en) 2020-05-19

Family

ID=60292842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710569638.0A Expired - Fee Related CN107357895B (en) 2017-01-05 2017-07-14 Text representation processing method based on bag-of-words model

Country Status (1)

Country Link
CN (1) CN107357895B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362815A (en) * 2018-04-11 2019-10-22 北京京东尚科信息技术有限公司 Text vector generation method and device
CN109284382B (en) * 2018-09-30 2021-05-28 武汉斗鱼网络科技有限公司 Text classification method and computing device
CN109543036A (en) * 2018-11-20 2019-03-29 四川长虹电器股份有限公司 Text Clustering Method based on semantic similarity
CN110096591A (en) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 Long text classification method, device, computer equipment and storage medium based on bag of words
CN111859901A (en) * 2020-07-15 2020-10-30 大连理工大学 English repeated text detection method, system, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104809131A (en) * 2014-01-27 2015-07-29 董靖 Automatic classification system and method of electronic documents
CN104881400A (en) * 2015-05-19 2015-09-02 上海交通大学 Semantic dependency calculating method based on associative network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150026104A1 (en) * 2013-07-17 2015-01-22 Christopher Tambos System and method for email classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN104809131A (en) * 2014-01-27 2015-07-29 董靖 Automatic classification system and method of electronic documents
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104881400A (en) * 2015-05-19 2015-09-02 上海交通大学 Semantic dependency calculating method based on associative network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Word2Vec主题提取的微博推荐;朱雪梅;《中国优秀硕士学位论文全文数据库 信息科技辑 2016年第03期》;20160315;全文 *

Also Published As

Publication number Publication date
CN107357895A (en) 2017-11-17

Similar Documents

Publication Publication Date Title
CN107357895B (en) Text representation processing method based on bag-of-words model
Nguyen et al. Relation extraction: Perspective from convolutional neural networks
CN110287328B (en) Text classification method, device and equipment and computer readable storage medium
US11074412B1 (en) Machine learning classification system
CN109299270B (en) Text data unsupervised clustering method based on convolutional neural network
KR102217248B1 (en) Feature extraction and learning method for summarizing text documents
CN113434858B (en) Malicious software family classification method based on disassembly code structure and semantic features
CN112231477A (en) Text classification method based on improved capsule network
CN111291177A (en) Information processing method and device and computer storage medium
Farhoodi et al. Applying machine learning algorithms for automatic Persian text classification
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN109791570B (en) Efficient and accurate named entity recognition method and device
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN112417153A (en) Text classification method and device, terminal equipment and readable storage medium
CN107832307B (en) Chinese word segmentation method based on undirected graph and single-layer neural network
CN110990676A (en) Social media hotspot topic extraction method and system
CN113032253A (en) Test data feature extraction method, test method and related device
Jayady et al. Theme Identification using Machine Learning Techniques
Wang et al. File fragment type identification with convolutional neural networks
Ghosh Sentiment analysis of IMDb movie reviews: a comparative study on performance of hyperparameter-tuned classification algorithms
CN109359090A (en) File fragmentation classification method and system based on convolutional neural networks
Elgeldawi et al. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79
CN110348497B (en) Text representation method constructed based on WT-GloVe word vector
Pak et al. The impact of text representation and preprocessing on author identification
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200519

Termination date: 20210714