CN105335446A

CN105335446A - Short text classification model generation method and classification method based on word vector

Info

Publication number: CN105335446A
Application number: CN201410398780.XA
Authority: CN
Inventors: 张艳; 马成龙; 潘接林; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2014-08-13
Filing date: 2014-08-13
Publication date: 2016-02-17

Abstract

The invention relates to a short text classification model generation method and classification method based on a word vector. The short text classification model generation method based on the word vector comprises the following steps: collecting data, carrying out field labeling on the collected data, and taking the labeled data as training data; preprocessing the training data; looking up a word vector dictionary, converting text data contained in the training data into vector data, and separating the vector data according to fields; carrying out model training on the vector data by a Gaussian model in each field to obtain an optimal value of a Gaussian model parameter so as to obtain the Gaussian field corresponding to the field; and forming the classification model by the Gaussian models corresponding to all fields of the training data.

Description

A kind of short text method of generating classification model based on word vector and sorting technique

Technical field

The present invention relates to text mining field, particularly a kind of short text method of generating classification model based on word vector and sorting technique.

Background technology

Along with the develop rapidly of Internet technology, large amount of text information and data are emerged in large numbers.In order to these information of management and use effectively, content-based information retrieval and data mining become the field received much concern gradually.Wherein, Text Classification is the important foundation of information retrieval and text mining, and its main task is under category label set given in advance, judges its classification according to content of text.Text classification plays an important role in natural language processing and understanding, Information Organization and the field such as management, content information filtration.

But, recently due to the development of social networks and ecommerce, the text data explosion type ground of the short text forms such as such as microblogging, instant messages, commodity evaluation, film review increases, a so-called short text normally simple simon says, it is less that it has the number of words comprised, and is not easy to the features such as statistics.How can extract useful information from these short texts, according to these useful informations better for user provides service to become the key of Internet service.Such as, if user often sends out some states in computing machine in microblogging, so we automatically can recommend the product, article, comment etc. of some computing machine aspects to him, meet the demand of user better.Traditional file classification method is normally by calculating the number of times that occurs under specific area of each word, word or phrase and probability (being namely several mechanism simply) realizes text classification, but for new text data, because some word or word were not occurring before, be so often left in the basket.This simple counter mechanism does not consider the information in text semantic aspect fully.

Summary of the invention

The object of the invention is to overcome the defect that file classification method of the prior art is not suitable for short text, thus a kind of sorting technique being applicable to short text is provided.

To achieve these goals, the invention provides a kind of short text method of generating classification model based on word vector, comprising:

Step 101), image data, and carry out field mark to gathered data, the data these marked are as training data;

Step 102), pre-service is done to training data;

Step 103), query word vector dictionary, the text data comprised in training data is converted into vector data, and described vector data is separated according to field;

Step 104), in each field vector data adopt Gauss model carry out model training, obtain the optimal value of Gauss model parameter, thus obtain the Gauss model corresponding to this field; Gauss model composition disaggregated model corresponding to the every field of all training datas.

In technique scheme, also comprise:

Step 105), gather and labeled data, the data these marked are as test data; Described test data is applied to step 104) training pattern that obtains, the validity of the result verification training pattern generated by described training pattern, if training pattern is improper, carries out arameter optimization.

In technique scheme, in step 101) also comprise before:

From internet, capture a large amount of webpage text file, the training of word vector is carried out to the text data in webpage text file, obtain the dictionary that includes descriptor and vector corresponding relation.

In technique scheme, in step 102) in, described pre-service comprises: reject the invalid data in training data, removes stop words.

In technique scheme, in step 102) in, described pre-service also comprises does participle operation to Chinese data.

In technique scheme, the parameter of described Gauss model comprises Gaussian mean and variance, and the optimal value of Gauss model parameter refers to and rate of accuracy reached can be made to arrive the highest parameter value.

Present invention also offers a kind of short text classification method based on word vector, comprising:

Step 201), input the text data that will detect, the text data to be detected to these does pre-service;

Step 202), by Gauss model corresponding with every field in described for the text data to be detected input training pattern obtained based on the short text method of generating classification model of word vector, obtain the posterior probability of this text data by generating after each Gauss model, using the classification results of the realm information corresponding to that maximum for posterior probability Gauss model as text data to be detected.

In technique scheme, described pre-service comprises: reject the invalid data in training data, removes stop words.

In technique scheme, described pre-service also comprises does participle operation to Chinese data.

The invention has the advantages that:

Method of the present invention realizes the classification of short text based on the disaggregated model of word vector by setting up, have higher good, the advantage that resolution is high of classification.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of method of generating classification model of the present invention;

Fig. 2 is the process flow diagram of sorting technique of the present invention.

Embodiment

For the ease of understanding, first concept involved in the present invention is illustrated.

Word vector: represent a word by a mathematical column vector.Column vector corresponding to a word, by training large batch of language material, then utilizes these language materials of Open-Source Tools process of such as word2vec to obtain.

Word vector dictionary: for recording the dictionary of word vector.

Now the invention will be further described by reference to the accompanying drawings.

Method of the present invention comprises training stage and sorting phase, and described sorting phase mainly utilizes the data train classification models marked, and then utilizes the disaggregated model of having trained to classify to the text data that will detect at sorting phase.Respectively the work that this two stages will complete is illustrated respectively below.

With reference to figure 1, method of the present invention comprised the following steps: in the training stage

Step 101), image data, and mark gathered data, the data these marked are as training data.

This step, when image data, can determine the type of institute's image data according to the demand of application.Such as, if method of the present invention need be applied to an application relevant with financial circles, then the short text of some financial fields should be gathered as much as possible when image data.The quantity of institute's image data can be determined as required, and in general, the collection capacity of data is larger, trains the disaggregated model obtained more accurate.

Mark gathered data and refer to and stamp field label to the short text collected, described field label can reflect the field residing for data.Such as, for following short text: " Fitbit releases WP application: become first to support the Intelligent bracelet of WP " can mark " computing machine " field label.

Step 102), pre-service is done to training data, described pre-service comprises: reject the invalid data (as punctuate, format character etc.) in training data, removes stop words (some do not have the word of essential meaning as " ", " this ", " that " etc.).

Especially, Chinese data is also needed to do participle operation, how participle is done to Chinese data and be operating as conventionally known to one of skill in the art, no longer repeat herein.

Step 103), query word vector dictionary, the text data comprised in training data is converted into vector data, and separates according to field.

In previous step 101) in, the text data comprised in training data is with field label, after text data is converted into vector data, these vector datas still remain with these field labels, therefore according to the realm information comprised in the label of field, these vector datas can be separated according to field, the vector data belonged in same field (i.e. word vector) is polymerized, all vector datas in a field can be described in a vector file.

Step 104), for the vector file in each field, Gauss model is utilized to carry out model training, obtain the optimal value of Gauss model parameter, namely the Gauss model in each field carries out simulation generation, the Gauss model composition disaggregated model of every field.

The parameter of described Gauss model comprises Gaussian mean and variance, and the optimal value of Gauss model parameter refers to and rate of accuracy reached can be made to arrive the highest parameter value.Training data in each field can generate a corresponding Gauss model, and each Gauss model comprises the parameter such as Gaussian mean and variance.The corresponding Gauss model that all training datas obtain constitutes the disaggregated model that the training stage will obtain.

As the preferred implementation of one, method of the present invention also can comprise in the training stage:

Collection described in this step, labeling operation and step 101) in associative operation there is no essential distinction, no longer repeat herein.

In this step, when verifying the validity of training pattern, training pattern can be utilized to obtain the field label of test data, then the true field label (can obtain its true field label when marking test data) of obtained field label and test data is compared, if both are consistent, prove that training pattern is effective.Such as, test 100 data, have the prediction label of 75 consistent with true tag, the validity of that corresponding training pattern is exactly 75%.

Arameter optimization described in this step travels through parameter exactly, trial as much as possible, looks at that the accuracy under which parameter is high, just chooses which parameter.Word vector dictionary used in the training stage can be existing word vector dictionary, also can in step 101) before generated.If desired word vector dictionary is generated, from internet, a large amount of webpage text file (data as G up to a hundred or upper T level) is captured by web crawlers instrument, if these webpage text file comprise Chinese language material, then need to carry out participle to Chinese language material; Then the training of word vector being carried out to the text data in webpage text file, as adopted existing word2vec instrument, thus obtaining the dictionary that includes descriptor and vector corresponding relation.

As shown in Figure 2, method of the present invention comprises the following steps at sorting phase:

Step 201), input the text data that will detect, the text data to be detected to these does pre-service, comprise: reject the invalid data (as punctuate, format character etc.) in training data, remove stop words (some do not have the word of essential meaning as " ", " this ", " that " etc.); Especially, Chinese data is also needed to do participle operation.

Step 202), text data to be detected is inputted Gauss model corresponding with every field in described training pattern, obtain the posterior probability of this text data by generating after each Gauss model, using the classification results of the realm information corresponding to that maximum for posterior probability Gauss model as text data to be detected.

It should be noted last that, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted.Although with reference to embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, modify to technical scheme of the present invention or equivalent replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims

1., based on a short text method of generating classification model for word vector, comprising:

Step 102), pre-service is done to training data;

2. the short text method of generating classification model based on word vector according to claim 1, is characterized in that, also comprise:

3. the short text method of generating classification model based on word vector according to claim 1 and 2, is characterized in that, in step 101) also comprise before:

4. the short text method of generating classification model based on word vector according to claim 1 or 2 or 3, is characterized in that, in step 102) in, described pre-service comprises: reject the invalid data in training data, removes stop words.

5. the short text method of generating classification model based on word vector according to claim 4, is characterized in that, in step 102) in, described pre-service also comprises does participle operation to Chinese data.

6. the short text method of generating classification model based on word vector according to claim 1 or 2 or 3, it is characterized in that, the parameter of described Gauss model comprises Gaussian mean and variance, and the optimal value of Gauss model parameter refers to and rate of accuracy reached can be made to arrive the highest parameter value.

7., based on a short text classification method for word vector, comprising:

Step 202), by Gauss model corresponding with every field in described for text data to be detected input one of the claim 1-6 training pattern obtained based on the short text method of generating classification model of word vector, obtain the posterior probability of this text data by generating after each Gauss model, using the classification results of the realm information corresponding to that maximum for posterior probability Gauss model as text data to be detected.

8. the short text classification method based on word vector according to claim 7, is characterized in that, described pre-service comprises: reject the invalid data in training data, removes stop words.

9. the short text classification method based on word vector according to claim 8, is characterized in that, described pre-service also comprises does participle operation to Chinese data.