CN105005589A

CN105005589A - Text classification method and text classification device

Info

Publication number: CN105005589A
Application number: CN201510364152.4A
Authority: CN
Inventors: 邹缘孙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2015-10-28
Anticipated expiration: 2035-06-26
Also published as: CN105005589B

Abstract

The invention discloses a text classification method and a text classification device, and belongs to the technical field of Internet. The method comprises the following steps of: obtaining the term vector, the term frequency, the weight and the inverse document frequency of each term included by a text to be classified; respectively calculating the first membership degree between each term and a first class according to the term vector of each term and the term vector of the first class, wherein the first class is any one class in a class set; calculating the second membership degree between the text and the first class according to the first membership degree between each term and the first class and the term frequency, the weight and the inverse document frequency of each term; and selecting the class whose second membership degree with the text meets the preset condition from the class set, and determining the selected class as the text class. The device comprises a first obtaining module, a first calculation module, a second calculation module and a classification module. The method and the device have the advantage that the text classification accuracy is improved.

Description

A kind of method and apparatus of text classification

Technical field

The present invention relates to Internet technical field, the method and apparatus of particularly a kind of text classification.

Background technology

Along with the development of Internet technology, text on internet gets more and more, a large amount of texts brings very large inconvenience also to while providing convenience searching of user to user, in the face of this problem, text classification has been suggested, and text classification can according to predefined subject categories, for a classification determined by text, text is classified according to classification, thus facilitates user to search.

Prior art provides a kind of method of text classification, Ke Yiwei: server obtains the samples of text of a large amount of artificial mark, and obtain the feature of these samples of text, the feature according to these samples of text is trained sorter; After sorter has been trained, server can adopt this sorter to classify to needing the text of classification, detailed process is: server gets the feature of text to be sorted, according to the feature of text to be sorted, is classified to text to be sorted by the sorter after training.

Realizing in process of the present invention, inventor finds that prior art at least exists following problem:

One in the text that the feature of text to be sorted is to be sorted often crucial word, only text to be sorted is classified obviously inaccurate according to the crucial word of in text to be sorted, such as, one about the text describing game development capital consumption problem, the feature of this text that server obtains may be " game ", determine that the classification of the text is for " game " according to this feature " game ", but the emphasis of the text mainly capital consumption problem, the classification of the text is defined as " finance and economics " more suitable, therefore, the accuracy of being classified to the text by the feature of the text is low.

Summary of the invention

In order to solve the problem of prior art, the invention provides a kind of method and apparatus of text classification.Technical scheme is as follows:

A method for text classification, described method comprises:

Obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;

According to the described term vector of each word and the term vector of first category, calculate the first degree of membership between described each word and described first category respectively, described first category is the arbitrary classification in category set;

According to the word frequency of the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category;

From described category set, select the second degree of membership between described text to meet pre-conditioned classification, the classification of described selection is defined as the classification of described text.

A device for text classification, described device comprises:

First acquisition module, for obtaining the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;

First computing module, for according to the described term vector of each word and the term vector of first category, calculates the first degree of membership between described each word and described first category respectively, and described first category is the arbitrary classification in category set;

Second computing module, for the word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculates the second degree of membership between described text and described first category;

Sort module, for selecting the second degree of membership between described text to meet pre-conditioned classification from described category set, is defined as the classification of described text by the classification of described selection.

In embodiments of the present invention, the term vector of the term vector of each word comprised according to text to be sorted, word frequency, weight and inverse document frequency and first category, calculate the second degree of membership between the text and first category, first category is the arbitrary classification in category set, according to the second degree of membership between the text, from category set, select classification; Because the present invention is when classifying to text to be sorted, considers each word that the text comprises, therefore improve the accuracy of classification.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of a kind of text classification that the embodiment of the present invention 1 provides;

Fig. 2-1 is the method flow diagram of a kind of text classification that the embodiment of the present invention 2 provides;

Fig. 2-2 is a kind of schematic diagram generating the set of words of each classification that the embodiment of the present invention 2 provides;

Fig. 3 is the apparatus structure schematic diagram of a kind of text classification that the embodiment of the present invention 3 provides;

Fig. 4 is the structural representation of a kind of server that the embodiment of the present invention 4 provides.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Embodiment 1

Embodiments provide a kind of method of text classification, see Fig. 1, wherein, the method comprises:

Step 101: obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;

Step 102: according to the term vector of each word and the term vector of first category, calculates the first degree of membership between each word and first category respectively, and first category is the arbitrary classification in category set;

Step 103: according to the word frequency of the first degree of membership between each word and first category and each word, weight and inverse document frequency, calculate the second degree of membership between the text and first category;

Step 104: select the second degree of membership between the text to meet pre-conditioned classification from category set, the classification of selection is defined as the classification of the text.

Embodiment 2

Embodiments provide a kind of method of text classification, when server is to when needing the text of classification to classify, in order to improve the accuracy of classification, the method of the text classification that server can adopt the embodiment of the present invention to provide is classified to text to be sorted, thus improves the accuracy of classification.The executive agent of the method is server; See Fig. 2-1, wherein, the method comprises:

Step 201: obtain multiple samples of text;

The set of words of samples of text for training each classification in category set corresponding; And, the classification that each samples of text in multiple samples of text is corresponding, multiple samples of text can be the samples of text of arbitrary classification in embodiments of the present invention, and in order to improve the accuracy of classification, multiple samples of text can comprise samples of text corresponding to each classification in category set.Such as, category set comprises: finance and economics, amusement, physical culture, fashion, automobile, house property, science and technology, education etc.When selecting samples of text, multiple samples of text can comprise the samples of text that classification is finance and economics, classification is the samples of text of amusement, classification is the samples of text of physical culture, classification is the samples of text of fashion, and classification is the samples of text of automobile, and classification is the samples of text of house property, classification is the samples of text of house property, and classification is the samples of text of education.

In embodiments of the present invention, user can select multiple samples of text, then inputs multiple samples of text to server; Multiple samples of text of server receives user input.

Step 202: each samples of text in multiple samples of text is carried out participle, by the word composition training set obtained;

Utilize existing participle instrument, each samples of text in multiple samples of text is carried out participle, obtains the word that each samples of text comprises; The word composition training set that each text is comprised.

Wherein, utilize participle instrument samples of text to be carried out to the process of participle for prior art, no longer describe in detail at this.

After obtaining training set, perform step 203, adopt existing clustering method to carry out cluster to the word in training set.

Step 203: cluster is carried out to the word in training set, obtains the classification of each set of words in multiple set of words and multiple set of words;

Wherein, this step can be passed through following steps (1) to (3) and realize, and comprising:

(1): the term vector obtaining each word in training set;

Wherein, the term vector of word is stated for the vector of words of description characteristic, and the term vector of word refers in particular to the statement of the word vectors based on word embedded technology structure in embodiments of the present invention.

The method of arbitrary acquisition term vector can be adopted in embodiments of the present invention to obtain the term vector of each word in training set, such as, use word embedded technology word2vec method in neural network language model, obtain the term vector of this word.And use word embedded technology word2vec method in neural network language model, the term vector detailed process obtaining this word is prior art, no longer describes in detail at this.

Wherein, the term vector of each word in training set is all n-dimensional vector, can be expressed as Wi=(w ₁, w ₂..., w _n).Wi is the term vector of i-th word, W _nit is the vector value of the n-th dimensional vector.

Modal particle due to " ", " " and " " and so on does not play a crucial role when classifying to text, therefore, in order to reduce operand and improve the accuracy of classification, the modal particle of " ", " " " " and so on can be removed in this step, only obtain the term vector remaining word in training set, then this step can be:

From training set, obtain the word of preset kind, from training set, remove the word of this acquisition, obtain remaining word in training set, obtain the term vector of residue word.

Wherein, the word of preset kind can be modal particle or auxiliary word etc.And the process obtaining the term vector of residue word is identical with the process of the term vector of each word obtained in training set, does not repeat them here.

Further, after getting the term vector of each word in training set, the term vector of each word and each word is stored in the corresponding relation of the term vector of word and word, so that when classifying to text to be sorted, when obtaining the term vector of the word that the text comprises, from the corresponding relation of the term vector of word and word, directly obtain the term vector of word, save the time of the term vector obtaining word, improve the efficiency to text classification.

(2) distance between any two words in each word: according to the word vectors of each word, is calculated;

For any two words in each word, respectively according to the term vector of these two words, calculate the distance between these two words according to following formula (1).

d i s t (W_{i}, W_{j}) = \frac{Σ_{k = 1}^{n} W_{i, k} \cdot W_{j, k}}{| W_{i} | * | W_{j} |} - - - (1)

Wherein, Wi is the term vector of i-th word, | W _i| be the vector of i-th word absolute value; Wj is the term vector of a jth word, | W _j| be the absolute value of a vector of a jth word, dist (Wi, Wj) is the distance between i-th word and a jth word.

Wherein, if only obtain the term vector of each words of description in training set in step (1), then step can be:

According to the term vector of each words of description in training set, calculate the distance between any two words of description in each words of description.

: multiple words distance being less than predeterminable range form a set of words, and obtain the classification of this set of words of user annotation (3).

Distance between two words is for representing the similarity between two words, if the distance between two words is less than predeterminable range, then determine that these two words are close word, these two words are put in a set of words, and determine that these two words belong to same classification.Each word participle in training set can be carried out classifying and forms multiple set of words by this method; The word that user comprises according to each set of words in multiple set of words, determines the classification of each set of words; Each set of words is marked, obtains the classification of each set of words, then input the classification of each set of words to server; The classification of each set of words of server receives user input.

Predeterminable range can carry out arranging and changing as required, does not do concrete restriction in embodiments of the present invention to predeterminable range; Such as, predeterminable range can be 0.2 or 0.5 etc.

It should be noted that, arbitrary clustering method can be adopted in embodiments of the present invention to carry out cluster to the word in training set and obtain multiple set of words; Such as, adopt the method for hierarchical cluster, then can obtain the relation of multiple set of words and multiple set of words, as shown in Fig. 2-2, each circle represents a word, different levels represent the hierarchical structure of cluster, in cluster result, mark this layer of corresponding set of words by manually browsing word that each hierarchical structure comprises.

Clustering is adopted to carry out cluster to the word that multiple samples of text comprises in embodiments of the present invention, become by the multiple samples of text of change mark and multiple set of words is marked, obtain the set of words of each classification, therefore, the present invention only needs to mark on a small quantity, save human resources, and shorten label time, improve classification effectiveness.Further, when obtaining the set of words of each classification in embodiments of the present invention, only need to obtain a small amount of samples of text, do not need to mark samples of text yet, thus save time and human resources, thus reach classification effectiveness faster, especially in internet industry, usual text categories is many, enormous amount, in order to classify to text fast, can adopt the method that the embodiment of the present invention provides, shorten the classification time, improve classification effectiveness.

In embodiments of the present invention by the corresponding relation of configuration categories and set of words, thus realize the migration of disaggregated model, given different business scenario text may be the longer news of length, also may be the text such as the shorter title of video or the microblogging of user, the classification that different business may be paid close attention to is different, text based thought, only need to increase classification in category set, and set up the set of words of the classification of this increase, thus the migration of disaggregated model can be realized, solve the classification problem of model reply new scene, classification demand under making disaggregated model can respond different business scene fast.

Further, also Clustering can not be adopted in embodiments of the present invention to obtain set of words corresponding to each classification, the mode of user's Direct Mark is adopted to obtain set of words corresponding to each classification, then step 201-203 can replace with: user obtains multiple word composition training set, and according to the word in training set, word in training set is classified, obtain multiple set of words, and each set of words in multiple set of words is marked, obtain the classification of each set of words, then the classification of each set of words and each set of words is inputted to server, the classification of each set of words of server receives user input and the classification of each set of words.

Further, when getting the classification of each set of words, according to the set of words of each classification, calculate the term vector of each classification, the term vector of each classification and each classification is stored in the corresponding relation of classification and term vector, so that when obtaining the term vector of classification afterwards, does not need to carry out double counting, directly this classification, obtains such other term vector from the corresponding relation of classification and term vector.

Wherein, for each classification, the process calculating such other term vector can be:

Obtain the term vector of each word in such other set of words; Calculate obtain term vector average term vector and using this average term vector as such other term vector.

It should be noted that, step 201-203 is the process of the set of words of training each classification, therefore, step 201-203 only needs to perform once, when classifying according to the text of set of words to needs classification of each classification afterwards, do not need to perform step 201-203, only need to perform step 204 and to 208, the text of needs classification is classified.

Step 204: according to the set of words of first category, obtain the term vector of first category, first category is the arbitrary classification in category set;

Particularly, the term vector of each word in the set of words of first category is obtained; Calculate obtain term vector average term vector and using the term vector of this average term vector as first category.Or, according to first category, from the corresponding relation of classification and term vector, obtain the term vector of first category.

Wherein, word embedded technology word2vec method in neural network language model is used to obtain the term vector of each word in the set of words of first category; Or, according to each word in the set of words of first category, from the corresponding relation of the term vector of word and word, obtain the term vector of each word; And obtain the term vector of each word in the set of words of each classification in category set by this method.

Step 205: obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;

Utilize existing participle instrument, participle is carried out to text to be sorted, obtain each word that the text comprises; Use word embedded technology word2vec method in neural network language model, obtain the term vector of each word, or according to each word, from the corresponding relation of the term vector of word and word, obtain the term vector of each word; For each word that the text comprises, add up number of times that this word occurs in the text word frequency as this word; Obtain the position of this word in the text, according to the position of this word in the text, obtain the weight of this word; And obtain the inverse document frequency of this word in training set.

Wherein, inverse document frequency, also known as anti-document frequency, is the inverse of document frequency; The process that server obtains the inverse document frequency of this word in training set can be:

Obtain the number of times that this word occurs in training set, obtain the number that the word comprised is gathered in training; The ratio of the word frequency and this number of times that calculate this word obtains the first numerical value, calculates the first numerical value and this number and obtains this word and training the inverse document frequency in gathering.

Wherein, Servers installed word position in the text and the corresponding relation of weight, then according to the position of this word in the text, the step obtaining the weight of this word can be:

According to the position of this word in the text, from the corresponding relation of position and weight, obtain the weight of this word.

Wherein, during the corresponding relation of Servers installed word position in the text and weight, can be the title of the text, the word of summary or other important positions arranges higher weight, is that the word in the text of the text arranges lower weight.

Step 206: according to the term vector of each word and the term vector of first category, calculates the first degree of membership between each word and first category respectively;

Weigh the first degree of membership between each word and first category by the distance between the term vector of each word and the term vector of first category in embodiments of the present invention, then this step can be:

Calculate the distance between the term vector of each word and the term vector of first category respectively, using the distance between the term vector of each word and the term vector of first category as the first degree of membership between each word and first category.

It should be noted that, by above method, the first degree of membership between each word and this classification is calculated for each classification in category set.Further, calculate the process calculating the distance between two vectors in the process of the distance between the term vector of each word and the term vector of first category and step 203 identical, do not repeat them here.

Step 207: according to the word frequency of the first degree of membership between each word and first category and each word, weight and inverse document frequency, calculate the second degree of membership between text and first category;

Wherein, this step can pass through following steps (1) and (2) realize, and comprising:

(1): calculate respectively the word frequency of each word, weight, inverse document frequency and and first category between the product of the first degree of membership, obtain the 3rd degree of membership between each word and first category;

For each word, according to the word frequency of each word, weight, inverse document frequency and and first category between the product of the first degree of membership, calculate the 3rd degree of membership between each word and first category according to following formula (2):

f _wi＝p _wi*tf _wi*idf _wi*b _wi，c(2)

Wherein, f _wibe i-th the 3rd degree of membership between word and first category, p _wibe the weight of i-th word, tf _wibe the word frequency of i-th word, idf _wibe the inverse document frequency of i-th word, b _{wi, c}be i-th the first degree of membership between word and first category.

(2): the 3rd degree of membership between each word and first category is added up, obtains the second degree of membership between the text and first category.

According to the 3rd degree of membership between each word and first category, calculate the second degree of membership between the text and first category according to following formula (3):

F = f_{1} + f_{2} + ...... + f_{n} = \underset{w i &Element; c}{Σ} p_{w i} * {tf}_{w i} * {idf}_{w i} * b_{w i, c} - - - (3)

It should be noted that, each classification in corresponding category set calculates the second degree of membership between the text and this classification by above method.

Step 208: select the second degree of membership between the text to meet pre-conditioned classification from category set, the classification of selection is defined as the classification of the text.

Second degree of membership is for representing the similarity between the text and this classification, and pre-conditioned can be the second maximum degree of membership, also can for being greater than the second degree of membership of the first default value; When pre-conditioned be the second maximum degree of membership time, this step can be: the classification selecting maximum the second degree of membership in the second degree of membership between the text from category set, is defined as the classification of the text by the classification of selection.

When pre-conditioned the second degree of membership for being greater than the first default value, then this step can be: from category set, the second degree of membership obtained between the text is greater than the classification of the first default value, Stochastic choice classification from the classification obtained, is defined as the classification of the text by the classification of selection.

Further, when getting the second degree of membership between each classification in the text and category set, also adopt following formula (4) to be normalized the second degree of membership, obtain the second degree of membership between after normalization and the text;

F = \frac{\underset{W i &Element; c, c &Element; m (T)}{Σ} w i * t f w i * i d f w i * b w i, c}{\underset{c &Element; C}{Σ} {\underset{W i &Element; c, c &Element; m}{Σ}}_{(T)} w i * t f w i * i d f w i * b w i, c} - - - (4)

Then this step can be: the second degree of membership select the normalization between the text from category set after meets pre-conditioned classification, the classification of selection is defined as the classification of the text.

Now, pre-conditioned can be the second degree of membership after maximum normalization, also can for be greater than the second default value normalization after the second degree of membership.

When pre-conditioned be maximum normalization after the second degree of membership, then this step can be: the classification of the degree of membership after the normalization selecting the second degree of membership after the normalization between the text maximum from category set, is defined as the classification of the text by the classification of selection.

The second degree of membership after the pre-conditioned normalization for being greater than the second default value, then this step can be: the second degree of membership obtained from category set after the normalization between the text is greater than the classification of the second default value, Stochastic choice classification from the classification obtained, is defined as the classification of the text by the classification of selection.

First default value and the second default value can carry out arranging and changing as required, do not do concrete restriction in embodiments of the present invention to the first default value and the second default value.

Embodiment 3

Embodiments provide a kind of device of text classification, see Fig. 3, wherein, this device comprises:

First acquisition module 301, for obtaining the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;

First computing module 302, for according to the term vector of each word and the term vector of first category, calculates the first degree of membership between each word and first category respectively, and first category is the arbitrary classification in category set;

Second computing module 303, for the word frequency according to the first degree of membership between each word and first category and each word, weight and inverse document frequency, calculates the second degree of membership between the text and first category;

Sort module 303, for selecting the second degree of membership between the text to meet pre-conditioned classification from category set, is defined as the classification of the text by the classification of selection.

Further, the first computing module 302, comprising:

First acquiring unit, for obtaining the term vector of each word in set of words corresponding to first category;

First computing unit, for calculating the average term vector that gets term vector and using the term vector of this average term vector as first category;

Second computing unit, for calculating the distance between the term vector of each word and the term vector of first category respectively, using the distance between the term vector of each word and the term vector of first category as the first degree of membership between each word and first category.

Further, this device also comprises:

Second acquisition module, for obtaining multiple samples of text;

Word-dividing mode, for each samples of text in multiple samples of text is carried out participle, by the word composition training set obtained;

Cluster module, for carrying out cluster to the word in training set, obtains the classification of each set of words in multiple set of words and multiple set of words.

Further, cluster module, comprising:

Second acquisition unit, for obtaining the term vector of each word in training set;

3rd computing unit, for the word vectors according to each word, calculates the distance between any two words in each word;

Cluster cell, the multiple words for distance being less than predeterminable range form a set of words;

3rd acquiring unit, for obtaining the classification of this set of words of user annotation.

Further, the second computing module 303, comprising:

4th computing unit, for calculate respectively the word frequency of each word, weight, inverse document frequency and and first category between the product of the first degree of membership, obtain the 3rd degree of membership between each word and first category;

Summing elements, for the 3rd degree of membership between each word and first category being added up, obtains the second degree of membership between the text and first category.

Embodiment 4

Fig. 4 is the structural representation of the server that the embodiment of the present invention provides.This server 1900 can produce larger difference because of configuration or performance difference, one or more central processing units (centralprocessing units can be comprised, CPU) 1922 (such as, one or more processors) and storer 1932, one or more store the storage medium 1930 (such as one or more mass memory units) of application program 1942 or data 1944.Wherein, storer 1932 and storage medium 1930 can be of short duration storages or store lastingly.The program being stored in storage medium 1930 can comprise one or more modules (diagram does not mark), and each module can comprise a series of command operatings in server.Further, central processing unit 1922 can be set to communicate with storage medium 1930, and server 1900 performs a series of command operatings in storage medium 1930.

Server 1900 can also comprise one or more power supplys 1926, one or more wired or wireless network interfaces 1950, one or more IO interface 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM etc.

Server 1900 can include storer, and one or more than one program, one of them or more than one program are stored in storer, and are configured to perform described more than one or one routine package containing the instruction for carrying out following operation by more than one or one processor:

Further, described according to the described term vector of each word and the term vector of first category, calculate the first degree of membership between described each word and described first category respectively, comprising:

Obtain the term vector of each word in set of words corresponding to first category;

Get described in calculating term vector average term vector and using the term vector of described average term vector as described first category;

Calculate the distance between the term vector of described each word and the term vector of described first category respectively, using the distance between the term vector of described each word and the term vector of described first category as the first degree of membership between described each word and described first category.

Further, described method also comprises:

Obtain multiple samples of text;

Each samples of text in described multiple samples of text is carried out participle, by the word composition training set obtained;

Cluster is carried out to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words.

Further, described to described training set in word carry out cluster, obtain the classification of each set of words in multiple set of words and described multiple set of words, comprising:

Obtain the term vector of each word in described training set;

According to the word vectors of described each word, calculate the distance between any two words in described each word;

Multiple words distance being less than predeterminable range form a set of words, and the classification of the described set of words of acquisition user annotation.

Further, the described word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category, comprising:

Calculate respectively the word frequency of described each word, weight, inverse document frequency and and described first category between the product of the first degree of membership, obtain the 3rd degree of membership between described each word and described first category;

The 3rd degree of membership between described each word and described first category is added up, obtains the second degree of membership between described text and described first category.

It should be noted that: the device of the text classification that above-described embodiment provides is when text classification, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the device of the text classification that above-described embodiment provides and the embodiment of the method for text classification belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a method for text classification, is characterized in that, described method comprises:

2. the method for claim 1, is characterized in that, described according to the described term vector of each word and the term vector of first category, calculates the first degree of membership between described each word and described first category respectively, comprising:

3. method as claimed in claim 2, it is characterized in that, described method also comprises:

Obtain multiple samples of text;

4. method as claimed in claim 3, is characterized in that, describedly carries out cluster to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words, comprising:

Obtain the term vector of each word in described training set;

5. the method for claim 1, it is characterized in that, the described word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category, comprising:

6. a device for text classification, is characterized in that, described device comprises:

7. device as claimed in claim 6, it is characterized in that, described first computing module, comprising:

First computing unit, for get described in calculating term vector average term vector and using the term vector of described average term vector as described first category;

Second computing unit, for calculating the distance between the term vector of described each word and the term vector of described first category respectively, using the distance between the term vector of described each word and the term vector of described first category as the first degree of membership between described each word and described first category.

8. device as claimed in claim 7, it is characterized in that, described device also comprises:

Second acquisition module, for obtaining multiple samples of text;

Word-dividing mode, for each samples of text in described multiple samples of text is carried out participle, by the word composition training set obtained;

Cluster module, for carrying out cluster to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words.

9. device as claimed in claim 8, it is characterized in that, described cluster module, comprising:

Second acquisition unit, for obtaining the term vector of each word in described training set;

3rd computing unit, for the word vectors according to described each word, calculates the distance between any two words in described each word;

3rd acquiring unit, for obtaining the classification of the described set of words of user annotation.

10. device as claimed in claim 6, it is characterized in that, described second computing module, comprising:

4th computing unit, for calculate respectively the word frequency of described each word, weight, inverse document frequency and and described first category between the product of the first degree of membership, obtain the 3rd degree of membership between described each word and described first category;

Summing elements, for the 3rd degree of membership between described each word and described first category being added up, obtains the second degree of membership between described text and described first category.