CN110083828A

CN110083828A - A kind of Text Clustering Method and device

Info

Publication number: CN110083828A
Application number: CN201910250896.1A
Authority: CN
Inventors: 王晓琳
Original assignee: Zhuhai Yuanguang Mobile Interconnection Technology Co Ltd
Current assignee: Zhuhai Yuanguang Mobile Interconnection Technology Co.,Ltd.; Yuanguang Software Co Ltd
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2019-08-02

Abstract

The present invention relates to a kind of Text Clustering Method and device, solve the problems, such as that text cluster time length, low efficiency, effect existing for existing text cluster are poor.Text Clustering Method in the present invention is the following steps are included: acquisition data construct text library, obtain all Feature Words in the text library, the frequency occurred in all Feature Words of text library according to each Feature Words, the weight of each Feature Words is obtained, Feature Words and corresponding term weight function are saved into database；Each text to be clustered is acquired, the Feature Words in each text to be clustered are obtained；According to the Feature Words in each text to be clustered and its weight in the database, the feature vector of the term vector of each Feature Words, the sentence vector of each text to be clustered and all texts to be clustered is obtained；Using the feature vector of the text to be clustered, the text to be clustered is clustered.In the present invention method can effectively shorten the text cluster time, promoted cluster efficiency, reach preferable Clustering Effect.

Description

A kind of Text Clustering Method and device

Technical field

The present invention relates to natural language text Intellectual Analysis Technology field more particularly to a kind of Text Clustering Methods and dress It sets.

Background technique

Text cluster is a kind of application in natural language text Intellectual Analysis Technology field, by utilizing the phase between text The aggregation of Similar Text is realized like degree, and the analysis of generic text data is handled convenient for user.

Current Text Clustering Method mainly includes supervised learning and unsupervised learning.Wherein, supervised learning method It needs to know in advance classification belonging to text in training set, the pass between training set text and generic is obtained by modeling System, and then realize the classification of unknown classification text data.But the disadvantages of this method is, for being not belonging to above-mentioned classification Text data is unable to get its generic.

On the other hand, if without labeled text data, the problems such as text classification, sentiment analysis, just only The method of some traditional unsupervised learnings of energy, unsupervised method are largely to calculate sentence vector using term vector, then It is clustered according to sentence similarity, is formed with the text data set of label, obtain text cluster result.But existing text This clustering method is required to count the word frequency of Feature Words in text to be clustered every time, corresponding weight is obtained, when text to be clustered When scale is bigger, which can extend the duration of text cluster, reduce the efficiency of text cluster；Meanwhile existing power In re-computation method, the high Feature Words relative weighting of word frequency is also high, is unable to fully consider other spies in addition to main feature word Influence of the word to entire text to be clustered is levied, Clustering Effect is relatively poor.

Summary of the invention

In view of above-mentioned analysis, the present invention is intended to provide a kind of Text Clustering Method and device, to solve existing text Cluster time length, low efficiency, the problem of effect difference.

The purpose of the present invention is mainly achieved through the following technical solutions:

On the one hand, a kind of Text Clustering Method is provided, comprising the following steps:

It acquires data and constructs text library, all Feature Words in the text library are obtained, according to each Feature Words in text library The frequency occurred in all Feature Words obtains the weight of each Feature Words, and Feature Words and corresponding term weight function are saved to number According in library；

Each text to be clustered is acquired, the Feature Words in each text to be clustered are obtained；

According to the Feature Words in each text to be clustered and its weight in the database, each Feature Words are obtained The feature vector of term vector, the sentence vector of each text to be clustered and all texts to be clustered；

Using the feature vector of the text to be clustered, the text to be clustered is clustered.

On the basis of above scheme, the present invention has also done following improvement:

Further, the frequency occurred in all Feature Words of text library according to each Feature Words, obtains each Feature Words Weight is specific to execute following operation:

If the frequency that Feature Words occur is less than frequency threshold value, such Feature Words are rejected；

By the inverse of remaining each Feature Words word frequency, as the corresponding weight of individual features word.

After further, obtaining text library or text to be clustered, the data in text library or text to be clustered are segmented, It goes stop words to handle, obtains all Feature Words in text library or text to be clustered.

Further, the Feature Words according in each text to be clustered and its weight in the database, obtain It is specific to execute following operation to the term vector of each Feature Words:

Using the Feature Words training word2vec model in the text to be clustered, and utilize trained described Word2vec model obtains the corresponding term vector of each Feature Words, and the term vector of each Feature Words is expressed as v_1×D, D is the sky of term vector Between dimension.

Further, it executes following operation and obtains the sentence vector of each text to be clustered:

According to the Feature Words for including in each text to be clustered, calculate the sentence vector of each text to be clustered, wherein s-th to Cluster text sentence vector V_sIt is expressed as follows:

Wherein, N_sIndicate the term vector number for including in s-th of text sentence to be clustered；v_s,iIndicate s-th of text to be clustered I-th of term vector of this sentence；w_s,iThe weight for indicating s-th of sentence, i-th of term vector is that the specific word is right in the database The weight answered.

Further, the feature vector of all texts to be clustered in the following manner；

According to the sentence vector of sentence each in text to be clustered, the feature vector S of text to be clustered is constructed_N*D:

S_N*D=[V₁,V₂...,V_N]^T (2)

Wherein, N indicates the quantity of all text sentences to be clustered, and D indicates the dimension of sentence vector, with the dimension of term vector.

Further, the feature vector using the text to be clustered, clusters the text to be clustered, executes It operates below:

To the feature vector S of the text to be clustered_N*DCarry out singular value decomposition, obtain smoothed out entire text sentence to Moment matrix S '_N*D；

According to smoothed out entire text sentence vector matrix S '_N*D, using clustering algorithm, text to be clustered is clustered.

Further, using hierarchical clustering algorithm, the cluster of text to be clustered is realized:

By vector matrix S '_N*DIn each vector as an individual cluster；

The COS distance between different clusters is calculated, the sentence vector that the COS distance is less than certain threshold value is merged into one Cluster；The step is repeated, the classification until realizing all vectors in text to be clustered.

On the other hand, a kind of text cluster device corresponding with above-mentioned Text Clustering Method is provided, described device includes:

Term weight function computing module constitutes text library for acquiring data, obtains all features in the text library Word obtains the weight of each Feature Words according to the frequency that each Feature Words occur in all Feature Words of text library, by Feature Words and right The term weight function answered is saved into database；

Text feature word to be clustered obtains module, for acquiring each text to be clustered and obtaining in each text to be clustered Feature Words；

Text eigenvector to be clustered obtains module, for according to the Feature Words in each text to be clustered and its in institute The weight in database is stated, the term vector of each Feature Words, the sentence vector of each text to be clustered and all texts to be clustered are obtained Feature vector:

Text cluster module gathers the text to be clustered for the feature vector using the text to be clustered Class.

The present invention has the beneficial effect that: the present invention utilizes disparate networks by acquiring a large amount of disparate networks data in advance Data obtain a large amount of Feature Words, the weight informations of these Feature Words can Efficient Characterization its appear in it is general in general sentence Rate, and by the weight of these Feature Words directly as the weight of term vector corresponding in text to be clustered, can effectively shorten calculating The scale of time, text to be clustered are bigger, and the effect that the method for the present invention shortens the calculating time is more obvious.Meanwhile through the invention The term weight function of middle method setting, the frequency of occurrences is higher, and corresponding weight is smaller, so that calculating text sentence vector to be clustered During, the weight of main feature word is reduced, fully considers other Feature Words in addition to main feature word to entirely to poly- The influence of class text, effectively improves Clustering Effect.

It in the present invention, can also be combined with each other between above-mentioned each technical solution, to realize more preferred assembled schemes.This Other feature and advantage of invention will illustrate in the following description, also, certain advantages can become from specification it is aobvious and It is clear to, or understand through the implementation of the invention.The objectives and other advantages of the invention can by specification, claims with And it is achieved and obtained in specifically noted content in attached drawing.

Detailed description of the invention

Attached drawing is only used for showing the purpose of specific embodiment, and is not to be construed as limiting the invention, in entire attached drawing In, identical reference symbol indicates identical component.

Fig. 1 is the Text Clustering Method flow chart in first embodiment of the invention；

Fig. 2 is the part text to be clustered in second embodiment of the invention；

Fig. 3 is the part cluster result in second embodiment of the invention；

Fig. 4 is the text cluster schematic device in third embodiment of the invention.

Specific embodiment

Specifically describing the preferred embodiment of the present invention with reference to the accompanying drawing, wherein attached drawing constitutes the application a part, and Together with embodiments of the present invention for illustrating the principle of the present invention, it is not intended to limit the scope of the present invention.

In the first embodiment of the present invention, a kind of Text Clustering Method is disclosed, flow chart is as shown in Figure 1, include following Step:

Step S1: the various data on acquisition network constitute text library, obtain all Feature Words in the text library, root According to the frequency that each Feature Words occur in all Feature Words of text library, the weight of each Feature Words is obtained, by Feature Words and corresponding Term weight function is saved into database；

Wherein, the present embodiment acquires the various network datas such as all kinds of news, encyclopaedia, store by web crawlers algorithm and constitutes Text library, the data have that coverage is wide, data volume is big, the features such as representative, and guarantee calculates in this way The frequency of Feature Words out can represent the frequency that Feature Words occur under general nature language environment；

After obtaining text library, the data in text library are segmented, stop words is gone to handle, is obtained all in text library Feature Words.

Wherein, the participle refers to the means divided according to morpheme to text, mode of the present invention to word segmentation processing With no restrictions, as long as the Feature Words in text to be clustered can be obtained.

The stop words refers to the function word of not physical meaning, such as " ", " ", " ", " the " word, by going Stop words achievees the purpose that lifting feature word quality and this paper treatment effeciency.

After obtaining all Feature Words in text library, the frequency that each Feature Words occur in all Feature Words of text library is calculated It is secondary:

If the frequency that certain Feature Words occur is less than frequency threshold value, indicates that few people use these words, reject such spy Levy word；Do so the vocabulary that on the one hand can reduce vocabulary；On the other hand, when calculating sentence vector can ignore these words Term vector prevents from influencing the expression of sentence vector since these word weights are big；

By the inverse of remaining each Feature Words word frequency, as the corresponding weight of individual features word；

It can determine that weight size of the current signature word in text library, weight are equivalent to more greatly the specific word by word frequency Significance level in text library is bigger, otherwise significance level is smaller.

Step S2: each text to be clustered of acquisition obtains the Feature Words in each text to be clustered；

The text to be clustered acquired in the present invention, content are the subset of data in the text library.That is, guaranteeing text library In contain each Feature Words of text to be clustered.The present invention does not do any restrictions to the concrete form of text to be clustered；It is to be clustered Text can be the text of any subject matter, such as: the Internet news data that are obtained by web crawlers algorithm, Chinese wikipedia number According to etc.；The file format of text to be clustered is not also required, as long as text data to be clustered can normally be read；

By participle same as described above, stop words is gone to handle, obtains all Feature Words in each text to be clustered.

Step S3: it according to the Feature Words in each text to be clustered and its weight in the database, obtains each The feature vector of the term vector of Feature Words, the sentence vector of each text to be clustered and all texts to be clustered:

Step S31: using the Feature Words training word2vec model in the text to be clustered, and trained institute is utilized It states word2vec model and obtains the corresponding term vector of each Feature Words；

Word2vec is a opening for generating the software tool of term vector, it passes through according to given corpus Each of sentence word is quickly and effectively mapped to the vector with true value in D dimension space by the training pattern after optimization, And these vectors obtain grammer, semantic feature, and core architecture includes CBOW and Skip-gram.

The term vector for each Feature Words that the present invention obtains is expressed as v_1×D, D is the Spatial Dimension of term vector.

Step S32: according to the Feature Words for including in each text to be clustered, the sentence vector of each text to be clustered is calculated, wherein S-th of text sentence vector V to be clustered_sIt is expressed as follows:

Step S33: according to the sentence vector of sentence each in text to be clustered, the feature vector S of text to be clustered is constructed_N*D:

S_N*D=[V₁,V₂...,V_N]^T (2)

Step S4: using the feature vector of the text to be clustered, the text to be clustered is clustered:

Step S41: to the feature vector S of the text to be clustered_N*DSingular value decomposition is carried out, is obtained smoothed out entire Text sentence vector matrix S '_N*D；

The part main shaft of the feature vector of text to be clustered is found out by singular value decomposition, and institute is removed from feature vector Part main shaft is stated, smooth effect is reached.

Step S42: according to smoothed out entire text sentence vector matrix S '_N*D, using hierarchical clustering algorithm, to be clustered Text is clustered.

By vector matrix S '_N*DIn each vector as an individual cluster；

The present invention obtains a large amount of feature using disparate networks data by acquiring a large amount of disparate networks data in advance Word, the weight informations of these Feature Words can Efficient Characterization its appear in the probability in general sentence, and by these Feature Words Weight can effectively shorten directly as the weight of term vector corresponding in text to be clustered and calculate time, the rule of text to be clustered Mould is bigger, and the effect that the method for the present invention shortens the calculating time is more obvious.Meanwhile the Feature Words of middle method setting are weighed through the invention Weight, the frequency of occurrences is higher, and corresponding weight is smaller, so that reducing main special during calculating text sentence vector to be clustered The weight for levying word fully considers influence of other Feature Words to entire text to be clustered in addition to main feature word, effectively mentions Clustering Effect is risen.

In the second embodiment of the present invention, a kind of application example of Text Clustering Method is disclosed, steps are as follows:

The database for being stored with Feature Words and corresponding term weight function is obtained first with the above method；

Using web crawlers algorithm, crawl the data in Sohu's news, as the text to be clustered of the present embodiment, partially to The content for clustering text is as shown in Figure 2；

Classified using above-mentioned Text Clustering Method to text to be clustered, obtains cluster result, part cluster result is such as Shown in Fig. 3；

By the application example it can be proved that the Text Clustering Method in the application can be realized the cluster of Similar Text, And cluster result is more accurate.

In the third embodiment of the present invention, provide a kind of text cluster device, schematic device as shown in figure 4, with Above-mentioned Text Clustering Method is corresponding, and described device includes:

The specific implementation process of Installation practice is referring to above method embodiment in the present invention, and the present embodiment is herein not It repeats again.

Since the present embodiment is identical as above method embodiment principle, so this system also has above method embodiment phase The technical effect answered.

It will be understood by those skilled in the art that realizing all or part of the process of above-described embodiment method, meter can be passed through Calculation machine program instruction relevant hardware is completed, and the program can be stored in computer readable storage medium.Wherein, described Computer readable storage medium is disk, CD, read-only memory or random access memory etc..

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.

Claims

1. a kind of Text Clustering Method, which comprises the following steps:

It acquires data and constructs text library, obtain all Feature Words in the text library, it is all in text library according to each Feature Words The frequency occurred in Feature Words obtains the weight of each Feature Words, and Feature Words and corresponding term weight function are saved to database In；

According to the Feature Words in each text to be clustered and its weight in the database, obtain the words of each Feature Words to Amount, the feature vector of the sentence vector of each text to be clustered and all texts to be clustered；

2. the method according to claim 1, wherein it is described according to each Feature Words in all Feature Words of text library The frequency of appearance obtains the weight of each Feature Words, specific to execute following operation:

3. method according to claim 1 or 2, which is characterized in that after obtaining text library or text to be clustered, to text library Or the data in text to be clustered are segmented, stop words are gone to handle, and all features in text library or text to be clustered are obtained Word.

4. according to the method described in claim 3, it is characterized in that, the Feature Words according in each text to be clustered and Its weight in the database, obtains the term vector of each Feature Words, specific to execute following operation:

Using the Feature Words training word2vec model in the text to be clustered, and utilize the trained word2vec mould Type obtains the corresponding term vector of each Feature Words, and the term vector of each Feature Words is expressed as v_1×D, D is the Spatial Dimension of term vector.

5. according to the method described in claim 4, it is characterized in that, execute following operation obtain the sentence of each text to be clustered to Amount:

According to the Feature Words for including in each text to be clustered, the sentence vector of each text to be clustered is calculated, wherein s-th to be clustered Text sentence vector V_sIt is expressed as follows:

Wherein, N_sIndicate the term vector number for including in s-th of text sentence to be clustered；v_s,iIndicate s-th of text sentence to be clustered Sub i-th of term vector；w_s,iThe weight for indicating s-th of sentence, i-th of term vector is that the specific word is corresponding in the database Weight.

6. according to the method described in claim 5, it is characterized in that, in the following manner the feature of all texts to be clustered to Amount；

S_N*D=[V₁,V₂...,V_N]^T (2)

7. method according to claim 1 or 6, which is characterized in that the feature vector using the text to be clustered, The text to be clustered is clustered, following operation is executed:

To the feature vector S of the text to be clustered_N*DSingular value decomposition is carried out, smoothed out entire text sentence moment of a vector is obtained Battle array S '_N*D；

8. the method according to the description of claim 7 is characterized in that realizing the poly- of text to be clustered using hierarchical clustering algorithm Class:

By vector matrix S '_N*DIn each vector as an individual cluster；

The COS distance between different clusters is calculated, the sentence vector that the COS distance is less than certain threshold value is merged into a cluster； The step is repeated, the classification until realizing all vectors in text to be clustered.

9. a kind of text cluster device using any Text Clustering Method in claim 1-8, which is characterized in that described device Include:

Term weight function computing module constitutes text library for acquiring data, obtains all Feature Words in the text library, root According to the frequency that each Feature Words occur in all Feature Words of text library, the weight of each Feature Words is obtained, by Feature Words and corresponding Term weight function is saved into database；

Text feature word to be clustered obtains module, for acquiring each text to be clustered and obtaining the spy in each text to be clustered Levy word；

Text eigenvector to be clustered obtains module, for according to the Feature Words in each text to be clustered and its in the number According to the weight in library, the spy of the term vector of each Feature Words, the sentence vector of each text to be clustered and all texts to be clustered is obtained Levy vector:

Text cluster module clusters the text to be clustered for the feature vector using the text to be clustered.

10. device according to claim 9, which is characterized in that it is described according to each Feature Words in all Feature Words of text library The frequency of middle appearance obtains the weight of each Feature Words, specific to execute following operation: