CN105005589A - Text classification method and text classification device - Google Patents

Text classification method and text classification device Download PDF

Info

Publication number
CN105005589A
CN105005589A CN201510364152.4A CN201510364152A CN105005589A CN 105005589 A CN105005589 A CN 105005589A CN 201510364152 A CN201510364152 A CN 201510364152A CN 105005589 A CN105005589 A CN 105005589A
Authority
CN
China
Prior art keywords
word
category
text
term vector
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510364152.4A
Other languages
Chinese (zh)
Other versions
CN105005589B (en
Inventor
邹缘孙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201510364152.4A priority Critical patent/CN105005589B/en
Publication of CN105005589A publication Critical patent/CN105005589A/en
Application granted granted Critical
Publication of CN105005589B publication Critical patent/CN105005589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method and a text classification device, and belongs to the technical field of Internet. The method comprises the following steps of: obtaining the term vector, the term frequency, the weight and the inverse document frequency of each term included by a text to be classified; respectively calculating the first membership degree between each term and a first class according to the term vector of each term and the term vector of the first class, wherein the first class is any one class in a class set; calculating the second membership degree between the text and the first class according to the first membership degree between each term and the first class and the term frequency, the weight and the inverse document frequency of each term; and selecting the class whose second membership degree with the text meets the preset condition from the class set, and determining the selected class as the text class. The device comprises a first obtaining module, a first calculation module, a second calculation module and a classification module. The method and the device have the advantage that the text classification accuracy is improved.

Description

A kind of method and apparatus of text classification
Technical field
The present invention relates to Internet technical field, the method and apparatus of particularly a kind of text classification.
Background technology
Along with the development of Internet technology, text on internet gets more and more, a large amount of texts brings very large inconvenience also to while providing convenience searching of user to user, in the face of this problem, text classification has been suggested, and text classification can according to predefined subject categories, for a classification determined by text, text is classified according to classification, thus facilitates user to search.
Prior art provides a kind of method of text classification, Ke Yiwei: server obtains the samples of text of a large amount of artificial mark, and obtain the feature of these samples of text, the feature according to these samples of text is trained sorter; After sorter has been trained, server can adopt this sorter to classify to needing the text of classification, detailed process is: server gets the feature of text to be sorted, according to the feature of text to be sorted, is classified to text to be sorted by the sorter after training.
Realizing in process of the present invention, inventor finds that prior art at least exists following problem:
One in the text that the feature of text to be sorted is to be sorted often crucial word, only text to be sorted is classified obviously inaccurate according to the crucial word of in text to be sorted, such as, one about the text describing game development capital consumption problem, the feature of this text that server obtains may be " game ", determine that the classification of the text is for " game " according to this feature " game ", but the emphasis of the text mainly capital consumption problem, the classification of the text is defined as " finance and economics " more suitable, therefore, the accuracy of being classified to the text by the feature of the text is low.
Summary of the invention
In order to solve the problem of prior art, the invention provides a kind of method and apparatus of text classification.Technical scheme is as follows:
A method for text classification, described method comprises:
Obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
According to the described term vector of each word and the term vector of first category, calculate the first degree of membership between described each word and described first category respectively, described first category is the arbitrary classification in category set;
According to the word frequency of the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category;
From described category set, select the second degree of membership between described text to meet pre-conditioned classification, the classification of described selection is defined as the classification of described text.
A device for text classification, described device comprises:
First acquisition module, for obtaining the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
First computing module, for according to the described term vector of each word and the term vector of first category, calculates the first degree of membership between described each word and described first category respectively, and described first category is the arbitrary classification in category set;
Second computing module, for the word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculates the second degree of membership between described text and described first category;
Sort module, for selecting the second degree of membership between described text to meet pre-conditioned classification from described category set, is defined as the classification of described text by the classification of described selection.
In embodiments of the present invention, the term vector of the term vector of each word comprised according to text to be sorted, word frequency, weight and inverse document frequency and first category, calculate the second degree of membership between the text and first category, first category is the arbitrary classification in category set, according to the second degree of membership between the text, from category set, select classification; Because the present invention is when classifying to text to be sorted, considers each word that the text comprises, therefore improve the accuracy of classification.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of a kind of text classification that the embodiment of the present invention 1 provides;
Fig. 2-1 is the method flow diagram of a kind of text classification that the embodiment of the present invention 2 provides;
Fig. 2-2 is a kind of schematic diagram generating the set of words of each classification that the embodiment of the present invention 2 provides;
Fig. 3 is the apparatus structure schematic diagram of a kind of text classification that the embodiment of the present invention 3 provides;
Fig. 4 is the structural representation of a kind of server that the embodiment of the present invention 4 provides.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment 1
Embodiments provide a kind of method of text classification, see Fig. 1, wherein, the method comprises:
Step 101: obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
Step 102: according to the term vector of each word and the term vector of first category, calculates the first degree of membership between each word and first category respectively, and first category is the arbitrary classification in category set;
Step 103: according to the word frequency of the first degree of membership between each word and first category and each word, weight and inverse document frequency, calculate the second degree of membership between the text and first category;
Step 104: select the second degree of membership between the text to meet pre-conditioned classification from category set, the classification of selection is defined as the classification of the text.
In embodiments of the present invention, the term vector of the term vector of each word comprised according to text to be sorted, word frequency, weight and inverse document frequency and first category, calculate the second degree of membership between the text and first category, first category is the arbitrary classification in category set, according to the second degree of membership between the text, from category set, select classification; Because the present invention is when classifying to text to be sorted, considers each word that the text comprises, therefore improve the accuracy of classification.
Embodiment 2
Embodiments provide a kind of method of text classification, when server is to when needing the text of classification to classify, in order to improve the accuracy of classification, the method of the text classification that server can adopt the embodiment of the present invention to provide is classified to text to be sorted, thus improves the accuracy of classification.The executive agent of the method is server; See Fig. 2-1, wherein, the method comprises:
Step 201: obtain multiple samples of text;
The set of words of samples of text for training each classification in category set corresponding; And, the classification that each samples of text in multiple samples of text is corresponding, multiple samples of text can be the samples of text of arbitrary classification in embodiments of the present invention, and in order to improve the accuracy of classification, multiple samples of text can comprise samples of text corresponding to each classification in category set.Such as, category set comprises: finance and economics, amusement, physical culture, fashion, automobile, house property, science and technology, education etc.When selecting samples of text, multiple samples of text can comprise the samples of text that classification is finance and economics, classification is the samples of text of amusement, classification is the samples of text of physical culture, classification is the samples of text of fashion, and classification is the samples of text of automobile, and classification is the samples of text of house property, classification is the samples of text of house property, and classification is the samples of text of education.
In embodiments of the present invention, user can select multiple samples of text, then inputs multiple samples of text to server; Multiple samples of text of server receives user input.
Step 202: each samples of text in multiple samples of text is carried out participle, by the word composition training set obtained;
Utilize existing participle instrument, each samples of text in multiple samples of text is carried out participle, obtains the word that each samples of text comprises; The word composition training set that each text is comprised.
Wherein, utilize participle instrument samples of text to be carried out to the process of participle for prior art, no longer describe in detail at this.
After obtaining training set, perform step 203, adopt existing clustering method to carry out cluster to the word in training set.
Step 203: cluster is carried out to the word in training set, obtains the classification of each set of words in multiple set of words and multiple set of words;
Wherein, this step can be passed through following steps (1) to (3) and realize, and comprising:
(1): the term vector obtaining each word in training set;
Wherein, the term vector of word is stated for the vector of words of description characteristic, and the term vector of word refers in particular to the statement of the word vectors based on word embedded technology structure in embodiments of the present invention.
The method of arbitrary acquisition term vector can be adopted in embodiments of the present invention to obtain the term vector of each word in training set, such as, use word embedded technology word2vec method in neural network language model, obtain the term vector of this word.And use word embedded technology word2vec method in neural network language model, the term vector detailed process obtaining this word is prior art, no longer describes in detail at this.
Wherein, the term vector of each word in training set is all n-dimensional vector, can be expressed as Wi=(w 1, w 2..., w n).Wi is the term vector of i-th word, W nit is the vector value of the n-th dimensional vector.
Modal particle due to " ", " " and " " and so on does not play a crucial role when classifying to text, therefore, in order to reduce operand and improve the accuracy of classification, the modal particle of " ", " " " " and so on can be removed in this step, only obtain the term vector remaining word in training set, then this step can be:
From training set, obtain the word of preset kind, from training set, remove the word of this acquisition, obtain remaining word in training set, obtain the term vector of residue word.
Wherein, the word of preset kind can be modal particle or auxiliary word etc.And the process obtaining the term vector of residue word is identical with the process of the term vector of each word obtained in training set, does not repeat them here.
Further, after getting the term vector of each word in training set, the term vector of each word and each word is stored in the corresponding relation of the term vector of word and word, so that when classifying to text to be sorted, when obtaining the term vector of the word that the text comprises, from the corresponding relation of the term vector of word and word, directly obtain the term vector of word, save the time of the term vector obtaining word, improve the efficiency to text classification.
(2) distance between any two words in each word: according to the word vectors of each word, is calculated;
For any two words in each word, respectively according to the term vector of these two words, calculate the distance between these two words according to following formula (1).
d i s t ( W i , W j ) = Σ k = 1 n W i , k · W j , k | W i | * | W j | - - - ( 1 )
Wherein, Wi is the term vector of i-th word, | W i| be the vector of i-th word absolute value; Wj is the term vector of a jth word, | W j| be the absolute value of a vector of a jth word, dist (Wi, Wj) is the distance between i-th word and a jth word.
Wherein, if only obtain the term vector of each words of description in training set in step (1), then step can be:
According to the term vector of each words of description in training set, calculate the distance between any two words of description in each words of description.
: multiple words distance being less than predeterminable range form a set of words, and obtain the classification of this set of words of user annotation (3).
Distance between two words is for representing the similarity between two words, if the distance between two words is less than predeterminable range, then determine that these two words are close word, these two words are put in a set of words, and determine that these two words belong to same classification.Each word participle in training set can be carried out classifying and forms multiple set of words by this method; The word that user comprises according to each set of words in multiple set of words, determines the classification of each set of words; Each set of words is marked, obtains the classification of each set of words, then input the classification of each set of words to server; The classification of each set of words of server receives user input.
Predeterminable range can carry out arranging and changing as required, does not do concrete restriction in embodiments of the present invention to predeterminable range; Such as, predeterminable range can be 0.2 or 0.5 etc.
It should be noted that, arbitrary clustering method can be adopted in embodiments of the present invention to carry out cluster to the word in training set and obtain multiple set of words; Such as, adopt the method for hierarchical cluster, then can obtain the relation of multiple set of words and multiple set of words, as shown in Fig. 2-2, each circle represents a word, different levels represent the hierarchical structure of cluster, in cluster result, mark this layer of corresponding set of words by manually browsing word that each hierarchical structure comprises.
Clustering is adopted to carry out cluster to the word that multiple samples of text comprises in embodiments of the present invention, become by the multiple samples of text of change mark and multiple set of words is marked, obtain the set of words of each classification, therefore, the present invention only needs to mark on a small quantity, save human resources, and shorten label time, improve classification effectiveness.Further, when obtaining the set of words of each classification in embodiments of the present invention, only need to obtain a small amount of samples of text, do not need to mark samples of text yet, thus save time and human resources, thus reach classification effectiveness faster, especially in internet industry, usual text categories is many, enormous amount, in order to classify to text fast, can adopt the method that the embodiment of the present invention provides, shorten the classification time, improve classification effectiveness.
In embodiments of the present invention by the corresponding relation of configuration categories and set of words, thus realize the migration of disaggregated model, given different business scenario text may be the longer news of length, also may be the text such as the shorter title of video or the microblogging of user, the classification that different business may be paid close attention to is different, text based thought, only need to increase classification in category set, and set up the set of words of the classification of this increase, thus the migration of disaggregated model can be realized, solve the classification problem of model reply new scene, classification demand under making disaggregated model can respond different business scene fast.
Further, also Clustering can not be adopted in embodiments of the present invention to obtain set of words corresponding to each classification, the mode of user's Direct Mark is adopted to obtain set of words corresponding to each classification, then step 201-203 can replace with: user obtains multiple word composition training set, and according to the word in training set, word in training set is classified, obtain multiple set of words, and each set of words in multiple set of words is marked, obtain the classification of each set of words, then the classification of each set of words and each set of words is inputted to server, the classification of each set of words of server receives user input and the classification of each set of words.
Further, when getting the classification of each set of words, according to the set of words of each classification, calculate the term vector of each classification, the term vector of each classification and each classification is stored in the corresponding relation of classification and term vector, so that when obtaining the term vector of classification afterwards, does not need to carry out double counting, directly this classification, obtains such other term vector from the corresponding relation of classification and term vector.
Wherein, for each classification, the process calculating such other term vector can be:
Obtain the term vector of each word in such other set of words; Calculate obtain term vector average term vector and using this average term vector as such other term vector.
It should be noted that, step 201-203 is the process of the set of words of training each classification, therefore, step 201-203 only needs to perform once, when classifying according to the text of set of words to needs classification of each classification afterwards, do not need to perform step 201-203, only need to perform step 204 and to 208, the text of needs classification is classified.
Step 204: according to the set of words of first category, obtain the term vector of first category, first category is the arbitrary classification in category set;
Particularly, the term vector of each word in the set of words of first category is obtained; Calculate obtain term vector average term vector and using the term vector of this average term vector as first category.Or, according to first category, from the corresponding relation of classification and term vector, obtain the term vector of first category.
Wherein, word embedded technology word2vec method in neural network language model is used to obtain the term vector of each word in the set of words of first category; Or, according to each word in the set of words of first category, from the corresponding relation of the term vector of word and word, obtain the term vector of each word; And obtain the term vector of each word in the set of words of each classification in category set by this method.
Step 205: obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
Utilize existing participle instrument, participle is carried out to text to be sorted, obtain each word that the text comprises; Use word embedded technology word2vec method in neural network language model, obtain the term vector of each word, or according to each word, from the corresponding relation of the term vector of word and word, obtain the term vector of each word; For each word that the text comprises, add up number of times that this word occurs in the text word frequency as this word; Obtain the position of this word in the text, according to the position of this word in the text, obtain the weight of this word; And obtain the inverse document frequency of this word in training set.
Wherein, inverse document frequency, also known as anti-document frequency, is the inverse of document frequency; The process that server obtains the inverse document frequency of this word in training set can be:
Obtain the number of times that this word occurs in training set, obtain the number that the word comprised is gathered in training; The ratio of the word frequency and this number of times that calculate this word obtains the first numerical value, calculates the first numerical value and this number and obtains this word and training the inverse document frequency in gathering.
Wherein, Servers installed word position in the text and the corresponding relation of weight, then according to the position of this word in the text, the step obtaining the weight of this word can be:
According to the position of this word in the text, from the corresponding relation of position and weight, obtain the weight of this word.
Wherein, during the corresponding relation of Servers installed word position in the text and weight, can be the title of the text, the word of summary or other important positions arranges higher weight, is that the word in the text of the text arranges lower weight.
Step 206: according to the term vector of each word and the term vector of first category, calculates the first degree of membership between each word and first category respectively;
Weigh the first degree of membership between each word and first category by the distance between the term vector of each word and the term vector of first category in embodiments of the present invention, then this step can be:
Calculate the distance between the term vector of each word and the term vector of first category respectively, using the distance between the term vector of each word and the term vector of first category as the first degree of membership between each word and first category.
It should be noted that, by above method, the first degree of membership between each word and this classification is calculated for each classification in category set.Further, calculate the process calculating the distance between two vectors in the process of the distance between the term vector of each word and the term vector of first category and step 203 identical, do not repeat them here.
Step 207: according to the word frequency of the first degree of membership between each word and first category and each word, weight and inverse document frequency, calculate the second degree of membership between text and first category;
Wherein, this step can pass through following steps (1) and (2) realize, and comprising:
(1): calculate respectively the word frequency of each word, weight, inverse document frequency and and first category between the product of the first degree of membership, obtain the 3rd degree of membership between each word and first category;
For each word, according to the word frequency of each word, weight, inverse document frequency and and first category between the product of the first degree of membership, calculate the 3rd degree of membership between each word and first category according to following formula (2):
f wi=p wi*tf wi*idf wi*b wi,c(2)
Wherein, f wibe i-th the 3rd degree of membership between word and first category, p wibe the weight of i-th word, tf wibe the word frequency of i-th word, idf wibe the inverse document frequency of i-th word, b wi, cbe i-th the first degree of membership between word and first category.
(2): the 3rd degree of membership between each word and first category is added up, obtains the second degree of membership between the text and first category.
According to the 3rd degree of membership between each word and first category, calculate the second degree of membership between the text and first category according to following formula (3):
F = f 1 + f 2 + ...... + f n = Σ w i ∈ c p w i * tf w i * idf w i * b w i , c - - - ( 3 )
It should be noted that, each classification in corresponding category set calculates the second degree of membership between the text and this classification by above method.
Step 208: select the second degree of membership between the text to meet pre-conditioned classification from category set, the classification of selection is defined as the classification of the text.
Second degree of membership is for representing the similarity between the text and this classification, and pre-conditioned can be the second maximum degree of membership, also can for being greater than the second degree of membership of the first default value; When pre-conditioned be the second maximum degree of membership time, this step can be: the classification selecting maximum the second degree of membership in the second degree of membership between the text from category set, is defined as the classification of the text by the classification of selection.
When pre-conditioned the second degree of membership for being greater than the first default value, then this step can be: from category set, the second degree of membership obtained between the text is greater than the classification of the first default value, Stochastic choice classification from the classification obtained, is defined as the classification of the text by the classification of selection.
Further, when getting the second degree of membership between each classification in the text and category set, also adopt following formula (4) to be normalized the second degree of membership, obtain the second degree of membership between after normalization and the text;
F = Σ W i ∈ c , c ∈ m ( T ) w i * t f w i * i d f w i * b w i , c Σ c ∈ C Σ W i ∈ c , c ∈ m ( T ) w i * t f w i * i d f w i * b w i , c - - - ( 4 )
Then this step can be: the second degree of membership select the normalization between the text from category set after meets pre-conditioned classification, the classification of selection is defined as the classification of the text.
Now, pre-conditioned can be the second degree of membership after maximum normalization, also can for be greater than the second default value normalization after the second degree of membership.
When pre-conditioned be maximum normalization after the second degree of membership, then this step can be: the classification of the degree of membership after the normalization selecting the second degree of membership after the normalization between the text maximum from category set, is defined as the classification of the text by the classification of selection.
The second degree of membership after the pre-conditioned normalization for being greater than the second default value, then this step can be: the second degree of membership obtained from category set after the normalization between the text is greater than the classification of the second default value, Stochastic choice classification from the classification obtained, is defined as the classification of the text by the classification of selection.
First default value and the second default value can carry out arranging and changing as required, do not do concrete restriction in embodiments of the present invention to the first default value and the second default value.
In embodiments of the present invention, the term vector of the term vector of each word comprised according to text to be sorted, word frequency, weight and inverse document frequency and first category, calculate the second degree of membership between the text and first category, first category is the arbitrary classification in category set, according to the second degree of membership between the text, from category set, select classification; Because the present invention is when classifying to text to be sorted, considers each word that the text comprises, therefore improve the accuracy of classification.
Embodiment 3
Embodiments provide a kind of device of text classification, see Fig. 3, wherein, this device comprises:
First acquisition module 301, for obtaining the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
First computing module 302, for according to the term vector of each word and the term vector of first category, calculates the first degree of membership between each word and first category respectively, and first category is the arbitrary classification in category set;
Second computing module 303, for the word frequency according to the first degree of membership between each word and first category and each word, weight and inverse document frequency, calculates the second degree of membership between the text and first category;
Sort module 303, for selecting the second degree of membership between the text to meet pre-conditioned classification from category set, is defined as the classification of the text by the classification of selection.
Further, the first computing module 302, comprising:
First acquiring unit, for obtaining the term vector of each word in set of words corresponding to first category;
First computing unit, for calculating the average term vector that gets term vector and using the term vector of this average term vector as first category;
Second computing unit, for calculating the distance between the term vector of each word and the term vector of first category respectively, using the distance between the term vector of each word and the term vector of first category as the first degree of membership between each word and first category.
Further, this device also comprises:
Second acquisition module, for obtaining multiple samples of text;
Word-dividing mode, for each samples of text in multiple samples of text is carried out participle, by the word composition training set obtained;
Cluster module, for carrying out cluster to the word in training set, obtains the classification of each set of words in multiple set of words and multiple set of words.
Further, cluster module, comprising:
Second acquisition unit, for obtaining the term vector of each word in training set;
3rd computing unit, for the word vectors according to each word, calculates the distance between any two words in each word;
Cluster cell, the multiple words for distance being less than predeterminable range form a set of words;
3rd acquiring unit, for obtaining the classification of this set of words of user annotation.
Further, the second computing module 303, comprising:
4th computing unit, for calculate respectively the word frequency of each word, weight, inverse document frequency and and first category between the product of the first degree of membership, obtain the 3rd degree of membership between each word and first category;
Summing elements, for the 3rd degree of membership between each word and first category being added up, obtains the second degree of membership between the text and first category.
In embodiments of the present invention, the term vector of the term vector of each word comprised according to text to be sorted, word frequency, weight and inverse document frequency and first category, calculate the second degree of membership between the text and first category, first category is the arbitrary classification in category set, according to the second degree of membership between the text, from category set, select classification; Because the present invention is when classifying to text to be sorted, considers each word that the text comprises, therefore improve the accuracy of classification.
Embodiment 4
Fig. 4 is the structural representation of the server that the embodiment of the present invention provides.This server 1900 can produce larger difference because of configuration or performance difference, one or more central processing units (centralprocessing units can be comprised, CPU) 1922 (such as, one or more processors) and storer 1932, one or more store the storage medium 1930 (such as one or more mass memory units) of application program 1942 or data 1944.Wherein, storer 1932 and storage medium 1930 can be of short duration storages or store lastingly.The program being stored in storage medium 1930 can comprise one or more modules (diagram does not mark), and each module can comprise a series of command operatings in server.Further, central processing unit 1922 can be set to communicate with storage medium 1930, and server 1900 performs a series of command operatings in storage medium 1930.
Server 1900 can also comprise one or more power supplys 1926, one or more wired or wireless network interfaces 1950, one or more IO interface 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM etc.
Server 1900 can include storer, and one or more than one program, one of them or more than one program are stored in storer, and are configured to perform described more than one or one routine package containing the instruction for carrying out following operation by more than one or one processor:
Obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
According to the described term vector of each word and the term vector of first category, calculate the first degree of membership between described each word and described first category respectively, described first category is the arbitrary classification in category set;
According to the word frequency of the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category;
From described category set, select the second degree of membership between described text to meet pre-conditioned classification, the classification of described selection is defined as the classification of described text.
Further, described according to the described term vector of each word and the term vector of first category, calculate the first degree of membership between described each word and described first category respectively, comprising:
Obtain the term vector of each word in set of words corresponding to first category;
Get described in calculating term vector average term vector and using the term vector of described average term vector as described first category;
Calculate the distance between the term vector of described each word and the term vector of described first category respectively, using the distance between the term vector of described each word and the term vector of described first category as the first degree of membership between described each word and described first category.
Further, described method also comprises:
Obtain multiple samples of text;
Each samples of text in described multiple samples of text is carried out participle, by the word composition training set obtained;
Cluster is carried out to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words.
Further, described to described training set in word carry out cluster, obtain the classification of each set of words in multiple set of words and described multiple set of words, comprising:
Obtain the term vector of each word in described training set;
According to the word vectors of described each word, calculate the distance between any two words in described each word;
Multiple words distance being less than predeterminable range form a set of words, and the classification of the described set of words of acquisition user annotation.
Further, the described word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category, comprising:
Calculate respectively the word frequency of described each word, weight, inverse document frequency and and described first category between the product of the first degree of membership, obtain the 3rd degree of membership between described each word and described first category;
The 3rd degree of membership between described each word and described first category is added up, obtains the second degree of membership between described text and described first category.
In embodiments of the present invention, the term vector of the term vector of each word comprised according to text to be sorted, word frequency, weight and inverse document frequency and first category, calculate the second degree of membership between the text and first category, first category is the arbitrary classification in category set, according to the second degree of membership between the text, from category set, select classification; Because the present invention is when classifying to text to be sorted, considers each word that the text comprises, therefore improve the accuracy of classification.
It should be noted that: the device of the text classification that above-described embodiment provides is when text classification, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the device of the text classification that above-described embodiment provides and the embodiment of the method for text classification belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a method for text classification, is characterized in that, described method comprises:
Obtain the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
According to the described term vector of each word and the term vector of first category, calculate the first degree of membership between described each word and described first category respectively, described first category is the arbitrary classification in category set;
According to the word frequency of the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category;
From described category set, select the second degree of membership between described text to meet pre-conditioned classification, the classification of described selection is defined as the classification of described text.
2. the method for claim 1, is characterized in that, described according to the described term vector of each word and the term vector of first category, calculates the first degree of membership between described each word and described first category respectively, comprising:
Obtain the term vector of each word in set of words corresponding to first category;
Get described in calculating term vector average term vector and using the term vector of described average term vector as described first category;
Calculate the distance between the term vector of described each word and the term vector of described first category respectively, using the distance between the term vector of described each word and the term vector of described first category as the first degree of membership between described each word and described first category.
3. method as claimed in claim 2, it is characterized in that, described method also comprises:
Obtain multiple samples of text;
Each samples of text in described multiple samples of text is carried out participle, by the word composition training set obtained;
Cluster is carried out to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words.
4. method as claimed in claim 3, is characterized in that, describedly carries out cluster to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words, comprising:
Obtain the term vector of each word in described training set;
According to the word vectors of described each word, calculate the distance between any two words in described each word;
Multiple words distance being less than predeterminable range form a set of words, and the classification of the described set of words of acquisition user annotation.
5. the method for claim 1, it is characterized in that, the described word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculate the second degree of membership between described text and described first category, comprising:
Calculate respectively the word frequency of described each word, weight, inverse document frequency and and described first category between the product of the first degree of membership, obtain the 3rd degree of membership between described each word and described first category;
The 3rd degree of membership between described each word and described first category is added up, obtains the second degree of membership between described text and described first category.
6. a device for text classification, is characterized in that, described device comprises:
First acquisition module, for obtaining the term vector of each word that text to be sorted comprises, word frequency, weight and inverse document frequency;
First computing module, for according to the described term vector of each word and the term vector of first category, calculates the first degree of membership between described each word and described first category respectively, and described first category is the arbitrary classification in category set;
Second computing module, for the word frequency according to the first degree of membership between described each word and described first category and described each word, weight and inverse document frequency, calculates the second degree of membership between described text and described first category;
Sort module, for selecting the second degree of membership between described text to meet pre-conditioned classification from described category set, is defined as the classification of described text by the classification of described selection.
7. device as claimed in claim 6, it is characterized in that, described first computing module, comprising:
First acquiring unit, for obtaining the term vector of each word in set of words corresponding to first category;
First computing unit, for get described in calculating term vector average term vector and using the term vector of described average term vector as described first category;
Second computing unit, for calculating the distance between the term vector of described each word and the term vector of described first category respectively, using the distance between the term vector of described each word and the term vector of described first category as the first degree of membership between described each word and described first category.
8. device as claimed in claim 7, it is characterized in that, described device also comprises:
Second acquisition module, for obtaining multiple samples of text;
Word-dividing mode, for each samples of text in described multiple samples of text is carried out participle, by the word composition training set obtained;
Cluster module, for carrying out cluster to the word in described training set, obtains the classification of each set of words in multiple set of words and described multiple set of words.
9. device as claimed in claim 8, it is characterized in that, described cluster module, comprising:
Second acquisition unit, for obtaining the term vector of each word in described training set;
3rd computing unit, for the word vectors according to described each word, calculates the distance between any two words in described each word;
Cluster cell, the multiple words for distance being less than predeterminable range form a set of words;
3rd acquiring unit, for obtaining the classification of the described set of words of user annotation.
10. device as claimed in claim 6, it is characterized in that, described second computing module, comprising:
4th computing unit, for calculate respectively the word frequency of described each word, weight, inverse document frequency and and described first category between the product of the first degree of membership, obtain the 3rd degree of membership between described each word and described first category;
Summing elements, for the 3rd degree of membership between described each word and described first category being added up, obtains the second degree of membership between described text and described first category.
CN201510364152.4A 2015-06-26 2015-06-26 A kind of method and apparatus of text classification Active CN105005589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510364152.4A CN105005589B (en) 2015-06-26 2015-06-26 A kind of method and apparatus of text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510364152.4A CN105005589B (en) 2015-06-26 2015-06-26 A kind of method and apparatus of text classification

Publications (2)

Publication Number Publication Date
CN105005589A true CN105005589A (en) 2015-10-28
CN105005589B CN105005589B (en) 2017-12-29

Family

ID=54378265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510364152.4A Active CN105005589B (en) 2015-06-26 2015-06-26 A kind of method and apparatus of text classification

Country Status (1)

Country Link
CN (1) CN105005589B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608068A (en) * 2014-11-17 2016-05-25 三星电子株式会社 Display apparatus and method for summarizing of document
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Retrieval method and system based on word vector similarity
CN106021578A (en) * 2016-06-01 2016-10-12 南京邮电大学 Improved text classification algorithm based on integration of cluster and membership degree
CN106469192A (en) * 2016-08-30 2017-03-01 北京奇艺世纪科技有限公司 A kind of determination method and device of text relevant
CN106503153A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text classification system, system and text classification method thereof
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106874295A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 A kind of method and device for determining service parameter
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN107229636A (en) * 2016-03-24 2017-10-03 腾讯科技(深圳)有限公司 A kind of method and device of word's kinds
CN107766426A (en) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 A kind of file classification method, device and electronic equipment
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN108062954A (en) * 2016-11-08 2018-05-22 科大讯飞股份有限公司 Audio recognition method and device
CN108363716A (en) * 2017-12-28 2018-08-03 广州索答信息科技有限公司 Realm information method of generating classification model, sorting technique, equipment and storage medium
CN108415903A (en) * 2018-03-12 2018-08-17 武汉斗鱼网络科技有限公司 Judge evaluation method, storage medium and the equipment of search intention identification validity
CN108628875A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of extracting method of text label, device and server
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN109388693A (en) * 2018-09-13 2019-02-26 武汉斗鱼网络科技有限公司 A kind of method and relevant device of determining subregion intention
CN109740152A (en) * 2018-12-25 2019-05-10 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the computer equipment of text classification
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN110968690A (en) * 2018-09-30 2020-04-07 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN112149414A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
US20110213777A1 (en) * 2010-02-01 2011-09-01 Alibaba Group Holding Limited Method and Apparatus of Text Classification
US20120288207A1 (en) * 2010-02-02 2012-11-15 Alibaba Group Holding Limited Method and System for Text Classification
CN104679860A (en) * 2015-02-27 2015-06-03 北京航空航天大学 Classifying method for unbalanced data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110213777A1 (en) * 2010-02-01 2011-09-01 Alibaba Group Holding Limited Method and Apparatus of Text Classification
US20120288207A1 (en) * 2010-02-02 2012-11-15 Alibaba Group Holding Limited Method and System for Text Classification
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN104679860A (en) * 2015-02-27 2015-06-03 北京航空航天大学 Classifying method for unbalanced data

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608068A (en) * 2014-11-17 2016-05-25 三星电子株式会社 Display apparatus and method for summarizing of document
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN106874295A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 A kind of method and device for determining service parameter
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Retrieval method and system based on word vector similarity
WO2017107566A1 (en) * 2015-12-25 2017-06-29 广州视源电子科技股份有限公司 Retrieval method and system based on word vector similarity
CN107229636B (en) * 2016-03-24 2021-08-13 腾讯科技(深圳)有限公司 Word classification method and device
CN107229636A (en) * 2016-03-24 2017-10-03 腾讯科技(深圳)有限公司 A kind of method and device of word's kinds
CN106021578B (en) * 2016-06-01 2019-07-23 南京邮电大学 A kind of modified text classification algorithm based on cluster and degree of membership fusion
CN106021578A (en) * 2016-06-01 2016-10-12 南京邮电大学 Improved text classification algorithm based on integration of cluster and membership degree
CN106469192A (en) * 2016-08-30 2017-03-01 北京奇艺世纪科技有限公司 A kind of determination method and device of text relevant
CN106469192B (en) * 2016-08-30 2021-07-30 北京奇艺世纪科技有限公司 Text relevance determining method and device
CN106503153A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text classification system, system and text classification method thereof
CN106503153B (en) * 2016-10-21 2019-05-10 江苏理工学院 Computer text classification system
CN108062954B (en) * 2016-11-08 2020-12-08 科大讯飞股份有限公司 Speech recognition method and device
CN108062954A (en) * 2016-11-08 2018-05-22 科大讯飞股份有限公司 Audio recognition method and device
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
CN108628875A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of extracting method of text label, device and server
CN107766426A (en) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 A kind of file classification method, device and electronic equipment
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN108363716A (en) * 2017-12-28 2018-08-03 广州索答信息科技有限公司 Realm information method of generating classification model, sorting technique, equipment and storage medium
CN108363716B (en) * 2017-12-28 2020-04-24 广州索答信息科技有限公司 Domain information classification model generation method, classification method, device and storage medium
CN108415903B (en) * 2018-03-12 2021-09-07 武汉斗鱼网络科技有限公司 Evaluation method, storage medium, and apparatus for judging validity of search intention recognition
CN108415903A (en) * 2018-03-12 2018-08-17 武汉斗鱼网络科技有限公司 Judge evaluation method, storage medium and the equipment of search intention identification validity
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN108959453B (en) * 2018-06-14 2021-08-27 中南民族大学 Information extraction method and device based on text clustering and readable storage medium
CN109388693B (en) * 2018-09-13 2021-04-27 武汉斗鱼网络科技有限公司 Method for determining partition intention and related equipment
CN109388693A (en) * 2018-09-13 2019-02-26 武汉斗鱼网络科技有限公司 A kind of method and relevant device of determining subregion intention
CN110968690A (en) * 2018-09-30 2020-04-07 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN110968690B (en) * 2018-09-30 2023-05-23 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN109740152A (en) * 2018-12-25 2019-05-10 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the computer equipment of text classification
CN109740152B (en) * 2018-12-25 2023-02-17 腾讯科技(深圳)有限公司 Text category determination method and device, storage medium and computer equipment
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN112149414A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium
CN112149414B (en) * 2020-09-23 2023-06-23 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN105005589B (en) 2017-12-29

Similar Documents

Publication Publication Date Title
CN105005589A (en) Text classification method and text classification device
CN109213863B (en) Learning style-based adaptive recommendation method and system
Zhu et al. Heterogeneous hypergraph embedding for document recommendation
CN107506480B (en) Double-layer graph structure recommendation method based on comment mining and density clustering
Zhang et al. Comparison of text sentiment analysis based on machine learning
CN106251174A (en) Information recommendation method and device
CN104252456B (en) A kind of weight method of estimation, apparatus and system
CN105224699A (en) A kind of news recommend method and device
CN113139134B (en) Method and device for predicting popularity of user-generated content in social network
CN104750798A (en) Application program recommendation method and device
CN102141977A (en) Text classification method and device
CN107357793A (en) Information recommendation method and device
Gao et al. Text classification research based on improved Word2vec and CNN
CN111191099B (en) User activity type identification method based on social media
CN105138577A (en) Big data based event evolution analysis method
CN103473128A (en) Collaborative filtering method for mashup application recommendation
Lim et al. Mitigating online product rating biases through the discovery of optimistic, pessimistic, and realistic reviewers
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
CN101789000A (en) Method for classifying modes in search engine
Park et al. Phrase embedding and clustering for sub-feature extraction from online data
CN103544299A (en) Construction method for commercial intelligent cloud computing system
CN107943947A (en) A kind of parallel KNN network public-opinion sorting algorithms of improvement based on Hadoop platform
CN118093962A (en) Data retrieval method, device, system, electronic equipment and readable storage medium
Yuan Big data recommendation research based on travel consumer sentiment analysis
CN103761246A (en) Link network based user domain identifying method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190802

Address after: Shenzhen Futian District City, Guangdong province 518000 Zhenxing Road, SEG Science Park 2 East Room 403

Co-patentee after: Tencent cloud computing (Beijing) limited liability company

Patentee after: Tencent Technology (Shenzhen) Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518000 Zhenxing Road, SEG Science Park 2 East Room 403

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.

TR01 Transfer of patent right