CN107180075A

CN107180075A - The label automatic generation method of text classification integrated level clustering

Info

Publication number: CN107180075A
Application number: CN201710249462.0A
Authority: CN
Inventors: 刘东升; 许翀寰
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2017-04-17
Filing date: 2017-04-17
Publication date: 2017-09-19

Abstract

The invention discloses a kind of label automatic generation method of text classification integrated level clustering, comprise the following steps：Text Pretreatment, text representation, Feature Dimension Reduction chooses candidate collection, clustering：According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained；Sort clustering cluster, and selection cluster represents word and obtains label：Choose highest fraction in clustering cluster and represent word as cluster, then the cluster after cluster is ranked up, its corresponding cluster representative word string, the label order being just automatically generated.The present invention is used as candidate collection by artificial constructed training corpus using target classification；Keyword is used as to candidate collection by clustering again, similarity is calculated, clustering, sequence clustering cluster, selection represents word and finally obtains user tag.Method integration based on classification and keyword can generate more accurate label, for large-scale data also or openness data, complex data processing have more significant effect.

Description

The label automatic generation method of text classification integrated level clustering

Technical field

The present invention relates to big data algorithm field, more particularly to a kind of label of text classification integrated level clustering Automatic generation method.

Background technology

Under the big data epoch, the rise of increasing Internet enterprises, such as microblogging, QQ etc.." label " is due to mutual The information content of magnanimity isomery has been poured in networking, has been produced to strengthen the management and use of information, it is a kind of information Description form.The theme and content of our more effective cognitive all kinds of resources can be helped using label, is also beneficial to information Discovery, management, propagate and utilize.Have at 2 points using the key element of label description information resource：Obtain label and control is marked The quality of label.The quality and quantity of label has large effect to the descriptive power of label.The side automatically generated for label Method can just not generate label, and the quality of label is also an extremely important index.The quality of label can be from two dimensions Degree is explained：One is whether the result generated embodies this part article or personage intrinsic attribute or hobby；Two be generation As a result if appropriate for being used as label.Certainly at present in any case also can be basic using more extensive baseline systems Complete this target.But it is due to that some one-sidedness (such as avoid synonymous label accumulation etc.) of method can not be generated preferably More accurate label.Existing technical requirements can not be met by being also due to some traditional data analyses and digging technology, Difficulty is brought to realization generation label.

Existing label generating method has the generation method based on classification, based on the generation methods such as Baidupedia, also base In the TextRank generation methods of keyword.It is to extract more important word mostly, for generating label；Also favourable word Bar information, chooses the fine granularity classification that can embody certain attribute as label.The most all unavoidable synonymous label of these methods Packing phenomenon.

The content of the invention

The present invention shortcoming accumulated easily occurs for synonymous label in the prior art, and there is provided a kind of text classification is integrated The label automatic generation method of Hierarchical clustering analysis.

In order to solve the above-mentioned technical problem, the present invention is addressed by following technical proposals：

The label automatic generation method of text classification integrated level clustering, comprises the following steps：

To Text Pretreatment：Text Pretreatment is done to English text and/or Chinese text, word is obtained；

Text representation：Feature is determined to by handling obtained word, the text representation mould of text can be described by resettling Type；

Feature Dimension Reduction：Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment；

Choose candidate collection：After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text set is extracted Close, text collection be predicted, choose the number of times occurred it is most be used as candidate collection；

Clustering：According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained；

Sort clustering cluster, and selection cluster represents word and obtains label：Choose highest fraction in clustering cluster and represent word as cluster, so The cluster after cluster is ranked up afterwards, its corresponding cluster representative word string, the label order being just automatically generated.

As a kind of embodiment, the text representation model is obtained by normalized, text representation mould Type is expressed as

In formula (1),For the word frequency of lexical item u in the text, n_uTo include lexical item u text in training corpus Number, N is the sum of training corpus Chinese version,Represent text representation model.

As a kind of embodiment, the carry out feature selecting, the feature to selection carries out dimension-reduction treatment, detailed process For：

The degree of correlation between characteristic item and text categories is calculated, formula is as follows：

Formula 2 is used for calculating the degree of correlation, wherein χ²(u, e) represents the degree of correlation between characteristic item and text categories,

Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to feature Item e textual data, Y is represented to belong to characteristic item e but is not included u textual data, and Z is both to be not belonging to characteristic item e or the text not comprising u This number, N is total for the text of training corpus；

Classification processing is carried out to the text categories e having determined, the detailed process of processing is：

Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set；

For document w, sampling obtains the multinomial distribution of k implicit themes on document；

Consider each word in document, further obtain more accurate text categories.

As a kind of embodiment, the selection candidate collection, detailed process is：

According to the size degree of more accurate text categories discrimination, obvious feature is found out；

Text collection V={ v are extracted immediately₁,...,v_n, text number is n, is carried out using the text classifier of training pre- Survey, obtain the corresponding prediction list of categories L={ l of n bar texts₁,...,l_n, define a counter in prediction list of categories Count (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,

Rank (c)=count (c, L), c ∈ C (3)

N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and choose Top (n) expressions is exactly Candidate collection C.

It is used as a kind of embodiment, the detailed process of the clustering：According to resulting Top (n) candidate collections, Hierarchical clustering is carried out to Top (n) candidate collections.

It is described that hierarchical clustering is carried out to Top (n) candidate collections as a kind of embodiment, according to weighing apparatus during hierarchical clustering The difference of amount mode, including single connection algorithm, full-join algorithm and mean distance algorithm；

The single connection algorithmic notation is：

Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes what is set Cluster and terminate when value range, wherein, r₁、r₂It is to belong to cluster P₁、P₂；

The full-join algorithm is expressed as：

Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, is set when distance exceedes Value range when cluster terminate, wherein, r₁、r₂It is to belong to cluster P₁、P₂；

The mean distance algorithmic notation is：

Wherein q₁、q₂It is the average of two clusters, n₁、n₂It is the number of object in two clusters respectively.

As a kind of embodiment, the sequence clustering cluster, selection represents word and obtains label, and detailed process is：

H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, S_iRepresent i-th Node, S_jRepresent j-th of node, calculate node S_iFraction, be calculated as follows formula：

Wherein, v represent from a given node jump to figure in a random node probability, its numerical value be 0-1 it Between value, In (S_i) it is node set, Out (S_i) it is node S_iThe node set of sensing, ω_ijRefer to the power on side between two nodes Weight, that is, being that k word is contained in a text, selects highest fraction as cluster and represents word, it is corresponding that cluster represents word Cluster representative word string, as label.

The present invention is as a result of above technical scheme, with significant technique effect：

The present invention is on the basis of text based label generating method, it is proposed that a kind of cluster based on text classification point The method that automatically generates of analysis label, this method is used as candidate collection by artificial constructed training corpus using target classification；Lead to again Clustering is crossed to candidate collection as keyword, similarity is calculated, clustering, sort clustering cluster, and it is last that selection represents word Obtain user tag.More accurate, higher-quality mark can be so generated by the method integration based on classification and keyword Label, for large-scale data also or openness data, complex data processing have more significant effect.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other accompanying drawings according to these accompanying drawings.

Fig. 1 is the overall flow schematic diagram of the present invention.

Embodiment

With reference to embodiment, the present invention is described in further detail, following examples be explanation of the invention and The invention is not limited in following examples.

Embodiment 1：

The label automatic generation method of text classification integrated level clustering, as shown in figure 1, comprising the following steps：

S1, to Text Pretreatment：Text Pretreatment is done to English text and/or Chinese text, word is obtained；

S2, text representation：Feature is determined to by handling obtained word, the text representation of text can be described by resettling Model；

S3, Feature Dimension Reduction：Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment；

S4, selection candidate collection：After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text is extracted Set, is predicted to text collection, choose the number of times occurred it is most be used as candidate collection；

S5, clustering：According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained；

S6, sequence clustering cluster, selection cluster represent word and obtain label：Highest fraction in clustering cluster is chosen to represent as cluster Word, is then ranked up to the cluster after cluster, its corresponding cluster representative word string, the label order being just automatically generated.

In S2, the text representation model is obtained by normalized, and text representation model is expressed as

The problem of too high intrinsic dimensionality and Deta sparseness are commonly encountered in text classification, therefore first feature is dropped Further the feature after processing is classified after dimension, dimensionality reduction, step S3 detailed process is：

Classification processing is carried out to the text categories e having determined, the detailed process of processing is from three layers of Bayesian probability Model carries out classification processing to word, theme and document, and processing procedure is：

Consider each word in document, further obtain more accurate text categories.

In step s 4, the selection candidate collection, detailed process is：

Rank (c)=count (c, L), c ∈ C (3)

N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and choose Top (n) expressions is exactly Candidate collection C, according to resulting Top (n) candidate collections, hierarchical clustering is carried out to Top (n) candidate collections, described to Top (n) candidate collection carries out hierarchical clustering；

In hierarchical clustering, can according to weigh mode difference, hierarchical clustering include single connection algorithm, full-join algorithm With mean distance algorithm；

The single connection algorithmic notation is：

The full-join algorithm is expressed as：

The mean distance algorithmic notation is：

In step s 6, the sequence clustering cluster, selection represents word and obtains label, and detailed process is：

Furthermore, it is necessary to explanation, the specific embodiment described in this specification, is named the shape of its parts and components Title etc. can be different.The equivalent or simple change that all construction, feature and principles according to described in inventional idea of the present invention are done, is wrapped Include in the protection domain of patent of the present invention.Those skilled in the art can be to described specific implementation Example is made various modifications or supplement or substituted using similar mode, structure without departing from the present invention or surmounts this Scope as defined in the claims, all should belong to protection scope of the present invention.

Claims

1. the label automatic generation method of text classification integrated level clustering, it is characterised in that comprise the following steps：

Text representation：Feature is determined to by handling obtained word, the text representation model of text can be described by resettling；

Choose candidate collection：After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text collection is extracted, it is right Text collection is predicted, choose the number of times occurred it is most be used as candidate collection；

Sort clustering cluster, and selection cluster represents word and obtains label：Choose highest fraction in clustering cluster and represent word as cluster, it is then right Cluster after cluster is ranked up, its corresponding cluster representative word string, the label order being just automatically generated.

2. the label automatic generation method of text classification integrated level clustering according to claim 1, its feature exists In：The text representation model is obtained by normalized, and text representation model is expressed as

<mrow> <mi>w</mi> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>u</mi> <mi>f</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>*</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>u</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> </mrow> <msup> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>u</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>&lsqb;</mo> <mi>u</mi> <mi>f</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>*</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>u</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> </msqrt> <mn>2</mn> </msup> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

In formula (1),For the word frequency of lexical item u in the text, n_uTo include lexical item u text number, N in training corpus For the sum of training corpus Chinese version,Represent text representation model.

3. the label automatic generation method of text classification integrated level clustering according to claim 2, its feature exists In：The carry out feature selecting, the feature to selection carries out dimension-reduction treatment, and detailed process is：

Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to characteristic item e Textual data, Y represents to belong to characteristic item e but the textual data not comprising u, and Z is both to be not belonging to characteristic item e or the text not comprising u Number, N is total for the text of training corpus；

Consider each word in document, further obtain more accurate text categories.

4. the label automatic generation method of text classification integrated level clustering according to claim 3, its feature exists In：The selection candidate collection, detailed process is：

Text collection V={ v are extracted immediately₁,...,v_n, text number is n, is predicted, obtained using the text classifier of training To the corresponding prediction list of categories L={ l of n bar texts₁,...,l_n, define a counter count in prediction list of categories (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,

Rank (c)=count (c, L), c ∈ C (3)

N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and that choose Top (n) expressions is exactly candidate Set C.

5. the label automatic generation method of text classification integrated level clustering according to claim 4, its feature exists In：The detailed process of the clustering：According to resulting Top (n) candidate collections, level is carried out to Top (n) candidate collections Cluster.

6. the label automatic generation method of text classification integrated level clustering according to claim 5, its feature exists In：Described to carry out hierarchical clustering to Top (n) candidate collections, according to the difference of the mode of measurement during hierarchical clustering, including single connection is calculated Method, full-join algorithm and mean distance algorithm；

The single connection algorithmic notation is：

<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>min</mi> <mrow> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>&Element;</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>&Element;</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> </mrow> </msub> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes the scope set Cluster and terminate when value, wherein, r₁、r₂It is to belong to cluster P₁、P₂；

The full-join algorithm is expressed as：

<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>max</mi> <mrow> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>&Element;</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>&Element;</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> </mrow> </msub> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, when distance exceedes the model set Cluster is terminated when enclosing value, wherein, r₁、r₂It is to belong to cluster P₁、P₂；

The mean distance algorithmic notation is：

d(P₁,P₂)=d (q₁,q₂),

7. the label automatic generation method of text classification integrated level clustering according to claim 6, its feature exists In the sequence clustering cluster, selection represents word and obtains label, and detailed process is：

H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, S_iI-th of node is represented, S_jRepresent j-th of node, calculate node S_iFraction, be calculated as follows formula：

<mrow> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>v</mi> <mo>*</mo> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <mi>I</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <msub> <mi>&omega;</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>*</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>S</mi> <mi>k</mi> </msub> <mo>&Element;</mo> <mi>O</mi> <mi>u</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>&omega;</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>

Wherein, v represent from a given node jump to figure in a random node probability, its numerical value is between 0-1 Value, In (S_i) it is node set, Out (S_i) it is node S_iThe node set of sensing, ω_ijThe weight on side between two nodes is referred to, i.e., Be the equal of that k word is contained in a text, selection highest fraction represents word as cluster, and cluster represents word corresponding cluster generation Table word string, as label.