CN107357895A - A kind of processing method of the text representation based on bag of words - Google Patents

A kind of processing method of the text representation based on bag of words Download PDF

Info

Publication number
CN107357895A
CN107357895A CN201710569638.0A CN201710569638A CN107357895A CN 107357895 A CN107357895 A CN 107357895A CN 201710569638 A CN201710569638 A CN 201710569638A CN 107357895 A CN107357895 A CN 107357895A
Authority
CN
China
Prior art keywords
words
feature
text
weight
bag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710569638.0A
Other languages
Chinese (zh)
Other versions
CN107357895B (en
Inventor
姚念民
牛世雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Publication of CN107357895A publication Critical patent/CN107357895A/en
Application granted granted Critical
Publication of CN107357895B publication Critical patent/CN107357895B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to computer application field, discloses a kind of processing method of the text representation based on bag of words, and this method segments to the text data set collected, goes stop words, goes the processing procedures such as low-frequency word, feature selecting;Then with the text after vector space model expression processing;Term vector is trained with the method for neutral net to the text after processing simultaneously;The weight of the Feature Words of bag of words is changed according to the similitude of term vector, obtains new text representation model.To handle text representation problem, the accuracy of classification is improved.

Description

A kind of processing method of the text representation based on bag of words
Technical field
The invention belongs to computer application field, the processing side of more particularly to a kind of text representation based on bag of words Method.
Background technology
At present, text-processing has been widely used in every field, and in general to text, it is necessary to be segmented, gone Stop words, low-frequency word, feature selecting, text is then represented, finally carry out classification processing.Different countries are for text-processing Research, acquired achievement are equally inconsistent.Relative to other countries, China falls behind relatively to the research and probe of text-processing, rises Step is also than later.
Word segmentation processing, due to having space as nature delimiter between English word, therefore no longer need to segment.However, When computer disposal Chinese text, it is necessary first to text is segmented, automatic word segmentation is to need computer will according to the meaning of one's words Sentence cutting is rational word.When handling natural language, be all using word as minimum unit, participle it is accurate Property directly affects the quality of text classification.
Feature selecting, if representing the text with all Feature Words in text, then the dimension of feature space is usual More than 100,000, the space of such higher-dimension can make computational efficiency very low, or even can not complete to calculate.In fact, have in the text The contribution of a little words is very weak, as adverbial word " " can all occur in nearly all text, can not as the feature of particular text, because This it is nonsensical to ensuing classification.Therefore need to choose the word that can represent text from text and form new feature Space, so as to reach the purpose of dimensionality reduction.
Text representation, the text of human intelligible is character encoding forms, and computer architecture is binary coded form, text The effect of this expression is how text code to be converted into computer code, and enables a computer to carry out text message Calculate.The selection of text representation directly influences the effect of text classification.Conventional text representation model is vector space model. But it is zero to have the weights of many Feature Words in vector space model, it is not so preferable to cause classifying quality, and the present invention proposes The feature weight in vector space model is changed, improves the degree of accuracy of classification.
Term vector is to expect to obtain the vector representation of each word with neutral net Natural Language Processing Models training text, Approach application for being called Word2Vec neutral net language model of Google's exploitation, this method can catch linguistic context letter Compressed data scale while breath.Word2Vec actually includes two kinds of different methods:Continuous Bag of Words And Skip-gram (CBOW).CBOW target is based on context to predict the probability of current term.Skip-gram just phases Instead:The probability of context (as shown in Figure 2) is predicted according to current term.Both approaches all by the use of artificial neural network as Their sorting algorithm.Originally, each word is a random N-dimensional vector.After training, the algorithm using CBOW or Person Skip-gram method obtains the optimal vector of each word.These present term vectors have captured the letter of context Breath, it can be used for predicting the heartbeat conditions of unknown data.
The content of the invention
In order to solve the problems, such as text representation during prior art text-processing, the accuracy of text classification is improved.This Invention provides a kind of processing method of the text representation based on bag of words, utilization space vector model bluebeard compound of the present invention to The method of amount establishes text model, so as to whole text document carry out classification processing, improves the accuracy of classification.This hair Bright technical scheme is:
The first step, pretreatment;
Text data set is segmented, goes stop words, removes low-frequency word, then carries out Feature Words selection;
Second step, text data set after pretreatment, is represented with bag of words;Described bag of words be with TFIDF (term frequency-inverse document frequency) is the text representation model of weight;
3rd step, text data set after pretreatment, train to obtain word with neutral net Natural Language Processing Models Vector;
4th step, the similitude of the term vector obtained according to the 3rd step change the Feature Words for the bag of words that second step obtains Weight, obtain new text representation model.In the TFIDF weight matrix of the vector space model, each feature is corresponding special One-dimensional in sign space, each text representation represents a Feature Words into a line in matrix, each row.Had in this matrix The TFIDF weighted values of many Feature Words are zero, and these feature weights for being zero influence the effect of classification.It is zero for some , the item that this is zero is changed according to the TFIDF values of the similitude of the term vector of neural metwork training word similar in n. Concrete modification mode is:The TFIDF obtained for second step is the text representation model of weight, its corresponding feature weight matrix Certain a line in some Feature Words t, if its feature weight WtIt is zero;
A kind of situation, then feature weight WtWith Feature Words t close word t1, t2, t3..., tnWeight Wt1, Wt2, Wt3..., WtnCarry out approximate representation Wt, the similarity threshold m that the quantity n of similar word passes through controlling feature word size controls.
Wherein, S(t,tn)In be characterized word t and Feature Words tn similarity.
Another situation, then feature weight WtWith Feature Words t close word t1, t2, t3..., tnIn most close word weight WiCarry out approximate representation Wt
Wt=Wi*S(t,i) (2)
Wherein, S(t,i)In be characterized word t and Feature Words i similarity.
Further, for less data set, text data set after pretreatment is replicated n times, n is positive integer, For the size of dilated data set, then train with neutral net Natural Language Processing Models to obtain term vector, it is so obtained Term vector effect is more excellent.
The beneficial effects of the present invention are the method for, utilization space vector model combination term vector to establish text model, So as to carry out classification processing to whole text document, the accuracy of classification is improved.
Brief description of the drawings
Text representation processing procedure schematic diagrames of the Fig. 1 based on bag of words and term vector.
Fig. 2 trains the CBOW models and Skip-gram models of term vector.
Fig. 3 uses the classifying quality comparison diagram of RandomForest graders.
Embodiment
Described specific embodiment is merely to illustrate the implementation of the present invention, and does not limit the scope of the invention.Below Embodiments of the present invention are described in detail with reference to accompanying drawing, specifically include following steps:
1st, the formatting of data set.It is different for data set form, have using txt file data storage, have using pkl texts Part data storage.The present invention implements to provide text processing system, and data set is uniformly converted into csv file, and CSV is a kind of general , relatively simple file format plain text, it uses some character set, such as ASCII, Unicode, GB2312, UTF-8; It is made up of (typically often one record of row) record;Every record is that (typical separation symbol has field by separators Comma, branch, tab or space);Every record has same field sequence.
2nd, the pretreatment of data.It is generally necessary to text is segmented, removes stop words, low-frequency word.
(1) word segmentation processing, there is space as nature delimiter between English word, therefore no longer need to segment, it is only necessary to Punctuate and numeral are removed can.But each word of Chinese is made up of the word of different numbers, in processes during text Segmented firstly the need of to text.Automatic word segmentation is that to need computer according to the meaning of one's words be rational word by sentence cutting. All it is the unit using word as minimum when handling natural language, the accuracy of participle directly affects the good of text classification It is bad, therefore segmented firstly the need of to text, the present invention implements to have used stammerer participle bag to carry out Chinese word segmentation.
(2) go stop words, in text " ", " ", " ", the word such as " I " occurs in each text, these words for Distinguishing the classification of document will not have an impact, thus remove them.For there is the stopwords storehouses of standard in English NLTK, Stop words are easily removed, obtain good effect.But for Chinese, because the pause dictionary of no standard is, it is necessary to search down Pause vocabulary is carried, removes stop words.
(3) influence of the low-frequency word for document is smaller, needs to remove low-frequency word in some cases;But it is exactly in some cases These specific words are different from other documents.
(4) English reduces its word prototype due to tense, voice be present, it is necessary to stemmed in this case.
3rd, feature selecting.The dimension of feature space is usually more than 100,000, and the space of such higher-dimension will make computational efficiency very It is low, or even calculate and can not carry out.And the contribution of some words is very weak in the text, can all occur in nearly all text, nothing Feature of the method as particular text, therefore it is nonsensical to ensuing classification.Therefore selection being capable of generation from text for needs The word of table text forms new feature space, so as to reach the purpose of dimensionality reduction.Conventional feature selection approach has text frequency Method (Document frequency, DF), mutual information method (Mutual information, MI), information gain method (Information gain, IG), X2Statistic law (CHI) etc., wherein the most widely used in text classification is that information increases Beneficial method, present invention uses information gain method to carry out feature selecting.
4th, text representation.Text representation is exactly that formalization processing is carried out to text, represents to be enough in meter as computer capacity The numeral of calculation, to reach computer it will be appreciated that the purpose of natural language text.The general text representation model used now for Vector space model (VSM), maximally efficient in text classification is vector space model.The selection of text representation directly affects To the effect of text classification.VSM basic thought is that substantial amounts of text representation is characterized into word matrix, so as to similar to text The comparison of degree is converted into the similarity-rough set of characteristic vector spatially, than more visible and be readily appreciated that.In this feature word matrix In, one-dimensional in each feature character pair space, the line number of matrix represents all textual datas to be sorted, by each text table The a line being shown as in matrix, each row represent a Feature Words.In actual applications, vector space model is passed through frequently with TFIDF For weighted value.TFIDF weight calculation formula are as follows:
5th, go training step 1 pretreated with neutral net language model (the Word2vec frameworks that Google increases income) Data set, the data set for the use that the present invention is implemented is relatively small, using the n times of quantity come dilated data set of replicate data collection. Training obtains a dictionary, and each word is a vector in dictionary, and these term vectors have captured the information of context.This hair It is bright to use vector space model combination term vector, this document representation method, improve classifying quality.
6th, for obtaining the TFIDF weight matrix of vector space model in step 4, in this feature word matrix, Mei Gete Levy one-dimensional in character pair space, the line number of matrix represents all textual datas to be sorted, by each text representation into matrix In a line, each row represent a Feature Words.The TFIDF weighted values that many Feature Words are had in this matrix are zero, these The feature weight for being zero influences the effect of classification.The present invention considers the term vector obtained using step 5, it is proposed that for TFIDF The Feature Words that weight is zero, its similar word is searched with term vector, the weighted value for the similar word being not zero with these TFIDF values Carry out the Feature Words that approximate representation this TFIDF value is zero.Specific implementation is as follows:For obtained vector space model, its is right The TFIDF weight matrix answered, some Feature Words t in its certain a line, if its feature weight WtIt is zero, can uses:
(1) feature weight WtWith Feature Words t close word t1, t2, t3..., tnWeight Wt1, Wt2, Wt3..., WtnCome Approximate representation Wt, can be controlled as similar word n quantity by the similarity threshold m of controlling feature word size, such as formula (1) shown in.
(2) feature weight WtWith Feature Words t close word t1, t2, t3..., tnIn most close word weight WiCarry out approximate table Show Wt, as shown in formula (2).
7th, the text model established for the present invention is classified using RandomForest graders, and random forest cares for name Think justice, be to establish a forest with random manner, be made up of inside forest many decision trees, each of random forest is certainly Be between plan tree do not have it is related.After forest is obtained, when thering is a new input sample to enter, just allow in forest Each decision tree once judged respectively, look at which kind of (for sorting algorithm) this sample should belong to, then It is most to look at which kind of is chosen, just predicts that this sample is a kind of for that.SST (Standford are used for categorized data set Sentiment treebankdataset), the classification accuracy for the model that contrast bag of words and the present invention change, the present invention The classification degree of accuracy of the processing method of the text representation based on bag of words proposed is higher.

Claims (3)

1. a kind of processing method of the text representation based on bag of words, it is characterised in that comprise the following steps:
The first step, pretreatment;
Text data set is segmented, goes stop words, removes low-frequency word, then carries out Feature Words selection;
Second step, text data set after pretreatment, is represented with bag of words;Described bag of words be using TFIDF as The text representation model of weight;
3rd step, text data set after pretreatment, train to obtain term vector with neutral net Natural Language Processing Models;
4th step, the similitude of the term vector obtained according to the 3rd step change the power of the Feature Words for the bag of words that second step obtains Weight, obtains new text representation model;Concrete modification mode is:The TFIDF obtained for second step is the text representation of weight Model, some Feature Words t in certain a line of its corresponding feature weight matrix, if its feature weight WtBe zero, then feature Weight WtWith Feature Words t close word t1, t2, t3..., tnWeight Wt1, Wt2, Wt3..., WtnCarry out approximate representation Wt, it is similar The similarity threshold m that the quantity n of word passes through controlling feature word size controls.
2. the processing method of a kind of text representation based on bag of words according to claim 1, it is characterised in that second In step, text data set after pretreatment is replicated n times, n is positive integer, for the size of dilated data set, then with god Term vector is obtained through nature network Language Processing model training.
A kind of 3. processing method of text representation based on bag of words according to claim 1 or 2, it is characterised in that 4th step, the similitude of the term vector obtained according to the 3rd step change the weight of the Feature Words for the bag of words that second step obtains, Obtain new text representation model;Concrete modification mode is:The TFIDF obtained for second step is the text representation mould of weight Type, some Feature Words t in certain a line of its corresponding feature weight matrix, if its feature weight WtIt is zero, then feature is weighed Weight WtWith Feature Words t close word t1, t2, t3..., tnIn most close word weight WiCarry out approximate representation Wt
CN201710569638.0A 2017-01-05 2017-07-14 Text representation processing method based on bag-of-words model Expired - Fee Related CN107357895B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710005310 2017-01-05
CN2017100053106 2017-01-05

Publications (2)

Publication Number Publication Date
CN107357895A true CN107357895A (en) 2017-11-17
CN107357895B CN107357895B (en) 2020-05-19

Family

ID=60292842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710569638.0A Expired - Fee Related CN107357895B (en) 2017-01-05 2017-07-14 Text representation processing method based on bag-of-words model

Country Status (1)

Country Link
CN (1) CN107357895B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284382A (en) * 2018-09-30 2019-01-29 武汉斗鱼网络科技有限公司 A kind of file classification method and computing device
CN109543036A (en) * 2018-11-20 2019-03-29 四川长虹电器股份有限公司 Text Clustering Method based on semantic similarity
CN110362815A (en) * 2018-04-11 2019-10-22 北京京东尚科信息技术有限公司 Text vector generation method and device
WO2020199595A1 (en) * 2019-04-04 2020-10-08 平安科技(深圳)有限公司 Long text classification method and device employing bag-of-words model, computer apparatus, and storage medium
CN111859901A (en) * 2020-07-15 2020-10-30 大连理工大学 English repeated text detection method, system, terminal and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
US20150026104A1 (en) * 2013-07-17 2015-01-22 Christopher Tambos System and method for email classification
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104809131A (en) * 2014-01-27 2015-07-29 董靖 Automatic classification system and method of electronic documents
CN104881400A (en) * 2015-05-19 2015-09-02 上海交通大学 Semantic dependency calculating method based on associative network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
US20150026104A1 (en) * 2013-07-17 2015-01-22 Christopher Tambos System and method for email classification
CN104809131A (en) * 2014-01-27 2015-07-29 董靖 Automatic classification system and method of electronic documents
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104881400A (en) * 2015-05-19 2015-09-02 上海交通大学 Semantic dependency calculating method based on associative network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱雪梅: "基于Word2Vec主题提取的微博推荐", 《中国优秀硕士学位论文全文数据库 信息科技辑 2016年第03期》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362815A (en) * 2018-04-11 2019-10-22 北京京东尚科信息技术有限公司 Text vector generation method and device
CN109284382A (en) * 2018-09-30 2019-01-29 武汉斗鱼网络科技有限公司 A kind of file classification method and computing device
CN109284382B (en) * 2018-09-30 2021-05-28 武汉斗鱼网络科技有限公司 Text classification method and computing device
CN109543036A (en) * 2018-11-20 2019-03-29 四川长虹电器股份有限公司 Text Clustering Method based on semantic similarity
WO2020199595A1 (en) * 2019-04-04 2020-10-08 平安科技(深圳)有限公司 Long text classification method and device employing bag-of-words model, computer apparatus, and storage medium
CN111859901A (en) * 2020-07-15 2020-10-30 大连理工大学 English repeated text detection method, system, terminal and storage medium

Also Published As

Publication number Publication date
CN107357895B (en) 2020-05-19

Similar Documents

Publication Publication Date Title
Jain et al. Application of machine learning techniques to sentiment analysis
CN108197098B (en) Method, device and equipment for generating keyword combination strategy and expanding keywords
CN106844424B (en) LDA-based text classification method
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN105335352A (en) Entity identification method based on Weibo emotion
CN110580292A (en) Text label generation method and device and computer readable storage medium
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN109582794A (en) Long article classification method based on deep learning
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN107180084A (en) Word library updating method and device
CN109446423B (en) System and method for judging sentiment of news and texts
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN110990676A (en) Social media hotspot topic extraction method and system
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113505200A (en) Sentence-level Chinese event detection method combining document key information
CN114416979A (en) Text query method, text query equipment and storage medium
CN110110087A (en) A kind of Feature Engineering method for Law Text classification based on two classifiers
CN115238040A (en) Steel material science knowledge graph construction method and system
CN109446299A (en) The method and system of searching email content based on event recognition
CN104182463A (en) Semantic-based text classification method
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN112989058B (en) Information classification method, test question classification method, device, server and storage medium
CN104572632A (en) Method for determining translation direction of word with proper noun translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200519

Termination date: 20210714