CN105335446A - Short text classification model generation method and classification method based on word vector - Google Patents

Short text classification model generation method and classification method based on word vector Download PDF

Info

Publication number
CN105335446A
CN105335446A CN201410398780.XA CN201410398780A CN105335446A CN 105335446 A CN105335446 A CN 105335446A CN 201410398780 A CN201410398780 A CN 201410398780A CN 105335446 A CN105335446 A CN 105335446A
Authority
CN
China
Prior art keywords
data
training
word vector
text
short text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410398780.XA
Other languages
Chinese (zh)
Inventor
张艳
马成龙
潘接林
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201410398780.XA priority Critical patent/CN105335446A/en
Publication of CN105335446A publication Critical patent/CN105335446A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a short text classification model generation method and classification method based on a word vector. The short text classification model generation method based on the word vector comprises the following steps: collecting data, carrying out field labeling on the collected data, and taking the labeled data as training data; preprocessing the training data; looking up a word vector dictionary, converting text data contained in the training data into vector data, and separating the vector data according to fields; carrying out model training on the vector data by a Gaussian model in each field to obtain an optimal value of a Gaussian model parameter so as to obtain the Gaussian field corresponding to the field; and forming the classification model by the Gaussian models corresponding to all fields of the training data.

Description

A kind of short text method of generating classification model based on word vector and sorting technique
Technical field
The present invention relates to text mining field, particularly a kind of short text method of generating classification model based on word vector and sorting technique.
Background technology
Along with the develop rapidly of Internet technology, large amount of text information and data are emerged in large numbers.In order to these information of management and use effectively, content-based information retrieval and data mining become the field received much concern gradually.Wherein, Text Classification is the important foundation of information retrieval and text mining, and its main task is under category label set given in advance, judges its classification according to content of text.Text classification plays an important role in natural language processing and understanding, Information Organization and the field such as management, content information filtration.
But, recently due to the development of social networks and ecommerce, the text data explosion type ground of the short text forms such as such as microblogging, instant messages, commodity evaluation, film review increases, a so-called short text normally simple simon says, it is less that it has the number of words comprised, and is not easy to the features such as statistics.How can extract useful information from these short texts, according to these useful informations better for user provides service to become the key of Internet service.Such as, if user often sends out some states in computing machine in microblogging, so we automatically can recommend the product, article, comment etc. of some computing machine aspects to him, meet the demand of user better.Traditional file classification method is normally by calculating the number of times that occurs under specific area of each word, word or phrase and probability (being namely several mechanism simply) realizes text classification, but for new text data, because some word or word were not occurring before, be so often left in the basket.This simple counter mechanism does not consider the information in text semantic aspect fully.
Summary of the invention
The object of the invention is to overcome the defect that file classification method of the prior art is not suitable for short text, thus a kind of sorting technique being applicable to short text is provided.
To achieve these goals, the invention provides a kind of short text method of generating classification model based on word vector, comprising:
Step 101), image data, and carry out field mark to gathered data, the data these marked are as training data;
Step 102), pre-service is done to training data;
Step 103), query word vector dictionary, the text data comprised in training data is converted into vector data, and described vector data is separated according to field;
Step 104), in each field vector data adopt Gauss model carry out model training, obtain the optimal value of Gauss model parameter, thus obtain the Gauss model corresponding to this field; Gauss model composition disaggregated model corresponding to the every field of all training datas.
In technique scheme, also comprise:
Step 105), gather and labeled data, the data these marked are as test data; Described test data is applied to step 104) training pattern that obtains, the validity of the result verification training pattern generated by described training pattern, if training pattern is improper, carries out arameter optimization.
In technique scheme, in step 101) also comprise before:
From internet, capture a large amount of webpage text file, the training of word vector is carried out to the text data in webpage text file, obtain the dictionary that includes descriptor and vector corresponding relation.
In technique scheme, in step 102) in, described pre-service comprises: reject the invalid data in training data, removes stop words.
In technique scheme, in step 102) in, described pre-service also comprises does participle operation to Chinese data.
In technique scheme, the parameter of described Gauss model comprises Gaussian mean and variance, and the optimal value of Gauss model parameter refers to and rate of accuracy reached can be made to arrive the highest parameter value.
Present invention also offers a kind of short text classification method based on word vector, comprising:
Step 201), input the text data that will detect, the text data to be detected to these does pre-service;
Step 202), by Gauss model corresponding with every field in described for the text data to be detected input training pattern obtained based on the short text method of generating classification model of word vector, obtain the posterior probability of this text data by generating after each Gauss model, using the classification results of the realm information corresponding to that maximum for posterior probability Gauss model as text data to be detected.
In technique scheme, described pre-service comprises: reject the invalid data in training data, removes stop words.
In technique scheme, described pre-service also comprises does participle operation to Chinese data.
The invention has the advantages that:
Method of the present invention realizes the classification of short text based on the disaggregated model of word vector by setting up, have higher good, the advantage that resolution is high of classification.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of method of generating classification model of the present invention;
Fig. 2 is the process flow diagram of sorting technique of the present invention.
Embodiment
For the ease of understanding, first concept involved in the present invention is illustrated.
Word vector: represent a word by a mathematical column vector.Column vector corresponding to a word, by training large batch of language material, then utilizes these language materials of Open-Source Tools process of such as word2vec to obtain.
Word vector dictionary: for recording the dictionary of word vector.
Now the invention will be further described by reference to the accompanying drawings.
Method of the present invention comprises training stage and sorting phase, and described sorting phase mainly utilizes the data train classification models marked, and then utilizes the disaggregated model of having trained to classify to the text data that will detect at sorting phase.Respectively the work that this two stages will complete is illustrated respectively below.
With reference to figure 1, method of the present invention comprised the following steps: in the training stage
Step 101), image data, and mark gathered data, the data these marked are as training data.
This step, when image data, can determine the type of institute's image data according to the demand of application.Such as, if method of the present invention need be applied to an application relevant with financial circles, then the short text of some financial fields should be gathered as much as possible when image data.The quantity of institute's image data can be determined as required, and in general, the collection capacity of data is larger, trains the disaggregated model obtained more accurate.
Mark gathered data and refer to and stamp field label to the short text collected, described field label can reflect the field residing for data.Such as, for following short text: " Fitbit releases WP application: become first to support the Intelligent bracelet of WP " can mark " computing machine " field label.
Step 102), pre-service is done to training data, described pre-service comprises: reject the invalid data (as punctuate, format character etc.) in training data, removes stop words (some do not have the word of essential meaning as " ", " this ", " that " etc.).
Especially, Chinese data is also needed to do participle operation, how participle is done to Chinese data and be operating as conventionally known to one of skill in the art, no longer repeat herein.
Step 103), query word vector dictionary, the text data comprised in training data is converted into vector data, and separates according to field.
In previous step 101) in, the text data comprised in training data is with field label, after text data is converted into vector data, these vector datas still remain with these field labels, therefore according to the realm information comprised in the label of field, these vector datas can be separated according to field, the vector data belonged in same field (i.e. word vector) is polymerized, all vector datas in a field can be described in a vector file.
Step 104), for the vector file in each field, Gauss model is utilized to carry out model training, obtain the optimal value of Gauss model parameter, namely the Gauss model in each field carries out simulation generation, the Gauss model composition disaggregated model of every field.
The parameter of described Gauss model comprises Gaussian mean and variance, and the optimal value of Gauss model parameter refers to and rate of accuracy reached can be made to arrive the highest parameter value.Training data in each field can generate a corresponding Gauss model, and each Gauss model comprises the parameter such as Gaussian mean and variance.The corresponding Gauss model that all training datas obtain constitutes the disaggregated model that the training stage will obtain.
As the preferred implementation of one, method of the present invention also can comprise in the training stage:
Step 105), gather and labeled data, the data these marked are as test data; Described test data is applied to step 104) training pattern that obtains, the validity of the result verification training pattern generated by described training pattern, if training pattern is improper, carries out arameter optimization.
Collection described in this step, labeling operation and step 101) in associative operation there is no essential distinction, no longer repeat herein.
In this step, when verifying the validity of training pattern, training pattern can be utilized to obtain the field label of test data, then the true field label (can obtain its true field label when marking test data) of obtained field label and test data is compared, if both are consistent, prove that training pattern is effective.Such as, test 100 data, have the prediction label of 75 consistent with true tag, the validity of that corresponding training pattern is exactly 75%.
Arameter optimization described in this step travels through parameter exactly, trial as much as possible, looks at that the accuracy under which parameter is high, just chooses which parameter.Word vector dictionary used in the training stage can be existing word vector dictionary, also can in step 101) before generated.If desired word vector dictionary is generated, from internet, a large amount of webpage text file (data as G up to a hundred or upper T level) is captured by web crawlers instrument, if these webpage text file comprise Chinese language material, then need to carry out participle to Chinese language material; Then the training of word vector being carried out to the text data in webpage text file, as adopted existing word2vec instrument, thus obtaining the dictionary that includes descriptor and vector corresponding relation.
As shown in Figure 2, method of the present invention comprises the following steps at sorting phase:
Step 201), input the text data that will detect, the text data to be detected to these does pre-service, comprise: reject the invalid data (as punctuate, format character etc.) in training data, remove stop words (some do not have the word of essential meaning as " ", " this ", " that " etc.); Especially, Chinese data is also needed to do participle operation.
Step 202), text data to be detected is inputted Gauss model corresponding with every field in described training pattern, obtain the posterior probability of this text data by generating after each Gauss model, using the classification results of the realm information corresponding to that maximum for posterior probability Gauss model as text data to be detected.
It should be noted last that, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted.Although with reference to embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, modify to technical scheme of the present invention or equivalent replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims (9)

1., based on a short text method of generating classification model for word vector, comprising:
Step 101), image data, and carry out field mark to gathered data, the data these marked are as training data;
Step 102), pre-service is done to training data;
Step 103), query word vector dictionary, the text data comprised in training data is converted into vector data, and described vector data is separated according to field;
Step 104), in each field vector data adopt Gauss model carry out model training, obtain the optimal value of Gauss model parameter, thus obtain the Gauss model corresponding to this field; Gauss model composition disaggregated model corresponding to the every field of all training datas.
2. the short text method of generating classification model based on word vector according to claim 1, is characterized in that, also comprise:
Step 105), gather and labeled data, the data these marked are as test data; Described test data is applied to step 104) training pattern that obtains, the validity of the result verification training pattern generated by described training pattern, if training pattern is improper, carries out arameter optimization.
3. the short text method of generating classification model based on word vector according to claim 1 and 2, is characterized in that, in step 101) also comprise before:
From internet, capture a large amount of webpage text file, the training of word vector is carried out to the text data in webpage text file, obtain the dictionary that includes descriptor and vector corresponding relation.
4. the short text method of generating classification model based on word vector according to claim 1 or 2 or 3, is characterized in that, in step 102) in, described pre-service comprises: reject the invalid data in training data, removes stop words.
5. the short text method of generating classification model based on word vector according to claim 4, is characterized in that, in step 102) in, described pre-service also comprises does participle operation to Chinese data.
6. the short text method of generating classification model based on word vector according to claim 1 or 2 or 3, it is characterized in that, the parameter of described Gauss model comprises Gaussian mean and variance, and the optimal value of Gauss model parameter refers to and rate of accuracy reached can be made to arrive the highest parameter value.
7., based on a short text classification method for word vector, comprising:
Step 201), input the text data that will detect, the text data to be detected to these does pre-service;
Step 202), by Gauss model corresponding with every field in described for text data to be detected input one of the claim 1-6 training pattern obtained based on the short text method of generating classification model of word vector, obtain the posterior probability of this text data by generating after each Gauss model, using the classification results of the realm information corresponding to that maximum for posterior probability Gauss model as text data to be detected.
8. the short text classification method based on word vector according to claim 7, is characterized in that, described pre-service comprises: reject the invalid data in training data, removes stop words.
9. the short text classification method based on word vector according to claim 8, is characterized in that, described pre-service also comprises does participle operation to Chinese data.
CN201410398780.XA 2014-08-13 2014-08-13 Short text classification model generation method and classification method based on word vector Pending CN105335446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410398780.XA CN105335446A (en) 2014-08-13 2014-08-13 Short text classification model generation method and classification method based on word vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410398780.XA CN105335446A (en) 2014-08-13 2014-08-13 Short text classification model generation method and classification method based on word vector

Publications (1)

Publication Number Publication Date
CN105335446A true CN105335446A (en) 2016-02-17

Family

ID=55285974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410398780.XA Pending CN105335446A (en) 2014-08-13 2014-08-13 Short text classification model generation method and classification method based on word vector

Country Status (1)

Country Link
CN (1) CN105335446A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528642A (en) * 2016-10-13 2017-03-22 广东广业开元科技有限公司 TF-IDF feature extraction based short text classification method
CN106777361A (en) * 2017-01-20 2017-05-31 清华大学 Microblogging text mood sorting technique and categorizing system based on vector paragraph model
CN107844476A (en) * 2017-10-19 2018-03-27 广州索答信息科技有限公司 A kind of part-of-speech tagging method of enhancing
CN109635110A (en) * 2018-11-30 2019-04-16 北京百度网讯科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN109977424A (en) * 2017-12-27 2019-07-05 北京搜狗科技发展有限公司 A kind of training method and device of Machine Translation Model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719436A (en) * 2004-07-09 2006-01-11 中国科学院自动化研究所 A kind of method and device of new proper vector weight towards text classification
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
US8131539B2 (en) * 2007-03-07 2012-03-06 International Business Machines Corporation Search-based word segmentation method and device for language without word boundary tag
CN103455581A (en) * 2013-08-26 2013-12-18 北京理工大学 Mass short message information filtering method based on semantic extension

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719436A (en) * 2004-07-09 2006-01-11 中国科学院自动化研究所 A kind of method and device of new proper vector weight towards text classification
US8131539B2 (en) * 2007-03-07 2012-03-06 International Business Machines Corporation Search-based word segmentation method and device for language without word boundary tag
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103455581A (en) * 2013-08-26 2013-12-18 北京理工大学 Mass short message information filtering method based on semantic extension

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宁亚辉,樊兴华,吴渝: "基于领域词语本体的短文本分类", 《计算机科学》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528642A (en) * 2016-10-13 2017-03-22 广东广业开元科技有限公司 TF-IDF feature extraction based short text classification method
CN106528642B (en) * 2016-10-13 2018-05-25 广东广业开元科技有限公司 A kind of short text classification method based on TF-IDF feature extractions
CN106777361A (en) * 2017-01-20 2017-05-31 清华大学 Microblogging text mood sorting technique and categorizing system based on vector paragraph model
CN107844476A (en) * 2017-10-19 2018-03-27 广州索答信息科技有限公司 A kind of part-of-speech tagging method of enhancing
CN109977424A (en) * 2017-12-27 2019-07-05 北京搜狗科技发展有限公司 A kind of training method and device of Machine Translation Model
CN109977424B (en) * 2017-12-27 2023-08-08 北京搜狗科技发展有限公司 Training method and device for machine translation model
CN109635110A (en) * 2018-11-30 2019-04-16 北京百度网讯科技有限公司 Data processing method, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN103336766B (en) Short text garbage identification and modeling method and device
CN106095996A (en) Method for text classification
CN109299258A (en) A kind of public sentiment event detecting method, device and equipment
CN103064971A (en) Scoring and Chinese sentiment analysis based review spam detection method
CN105335446A (en) Short text classification model generation method and classification method based on word vector
CN104850617B (en) Short text processing method and processing device
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN107491435A (en) Method and device based on Computer Automatic Recognition user feeling
CN111309910A (en) Text information mining method and device
CN109002443A (en) A kind of classification method and device of text information
CN102880631A (en) Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
CN104346326A (en) Method and device for determining emotional characteristics of emotional texts
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN106250398A (en) A kind of complaint classifying content decision method complaining event and device
CN101470699B (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN107273546A (en) Counterfeit application detection method and system
CN110008473A (en) A kind of medical text name Entity recognition mask method based on alternative manner
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium
CN107688594B (en) The identifying system and method for risk case based on social information
CN103177126B (en) For pornographic user query identification method and the equipment of search engine
CN109670162A (en) The determination method, apparatus and terminal device of title
CN108460049A (en) A kind of method and system of determining information category
CN110533466A (en) Method, system and storage medium based on big data auxiliary product development
CN104991920A (en) Label generation method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160217

RJ01 Rejection of invention patent application after publication