CN107180075A - The label automatic generation method of text classification integrated level clustering - Google Patents

The label automatic generation method of text classification integrated level clustering Download PDF

Info

Publication number
CN107180075A
CN107180075A CN201710249462.0A CN201710249462A CN107180075A CN 107180075 A CN107180075 A CN 107180075A CN 201710249462 A CN201710249462 A CN 201710249462A CN 107180075 A CN107180075 A CN 107180075A
Authority
CN
China
Prior art keywords
mrow
msub
text
cluster
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710249462.0A
Other languages
Chinese (zh)
Inventor
刘东升
许翀寰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN201710249462.0A priority Critical patent/CN107180075A/en
Publication of CN107180075A publication Critical patent/CN107180075A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of label automatic generation method of text classification integrated level clustering, comprise the following steps:Text Pretreatment, text representation, Feature Dimension Reduction chooses candidate collection, clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;Sort clustering cluster, and selection cluster represents word and obtains label:Choose highest fraction in clustering cluster and represent word as cluster, then the cluster after cluster is ranked up, its corresponding cluster representative word string, the label order being just automatically generated.The present invention is used as candidate collection by artificial constructed training corpus using target classification;Keyword is used as to candidate collection by clustering again, similarity is calculated, clustering, sequence clustering cluster, selection represents word and finally obtains user tag.Method integration based on classification and keyword can generate more accurate label, for large-scale data also or openness data, complex data processing have more significant effect.

Description

The label automatic generation method of text classification integrated level clustering
Technical field
The present invention relates to big data algorithm field, more particularly to a kind of label of text classification integrated level clustering Automatic generation method.
Background technology
Under the big data epoch, the rise of increasing Internet enterprises, such as microblogging, QQ etc.." label " is due to mutual The information content of magnanimity isomery has been poured in networking, has been produced to strengthen the management and use of information, it is a kind of information Description form.The theme and content of our more effective cognitive all kinds of resources can be helped using label, is also beneficial to information Discovery, management, propagate and utilize.Have at 2 points using the key element of label description information resource:Obtain label and control is marked The quality of label.The quality and quantity of label has large effect to the descriptive power of label.The side automatically generated for label Method can just not generate label, and the quality of label is also an extremely important index.The quality of label can be from two dimensions Degree is explained:One is whether the result generated embodies this part article or personage intrinsic attribute or hobby;Two be generation As a result if appropriate for being used as label.Certainly at present in any case also can be basic using more extensive baseline systems Complete this target.But it is due to that some one-sidedness (such as avoid synonymous label accumulation etc.) of method can not be generated preferably More accurate label.Existing technical requirements can not be met by being also due to some traditional data analyses and digging technology, Difficulty is brought to realization generation label.
Existing label generating method has the generation method based on classification, based on the generation methods such as Baidupedia, also base In the TextRank generation methods of keyword.It is to extract more important word mostly, for generating label;Also favourable word Bar information, chooses the fine granularity classification that can embody certain attribute as label.The most all unavoidable synonymous label of these methods Packing phenomenon.
The content of the invention
The present invention shortcoming accumulated easily occurs for synonymous label in the prior art, and there is provided a kind of text classification is integrated The label automatic generation method of Hierarchical clustering analysis.
In order to solve the above-mentioned technical problem, the present invention is addressed by following technical proposals:
The label automatic generation method of text classification integrated level clustering, comprises the following steps:
To Text Pretreatment:Text Pretreatment is done to English text and/or Chinese text, word is obtained;
Text representation:Feature is determined to by handling obtained word, the text representation mould of text can be described by resettling Type;
Feature Dimension Reduction:Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment;
Choose candidate collection:After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text set is extracted Close, text collection be predicted, choose the number of times occurred it is most be used as candidate collection;
Clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;
Sort clustering cluster, and selection cluster represents word and obtains label:Choose highest fraction in clustering cluster and represent word as cluster, so The cluster after cluster is ranked up afterwards, its corresponding cluster representative word string, the label order being just automatically generated.
As a kind of embodiment, the text representation model is obtained by normalized, text representation mould Type is expressed as
In formula (1),For the word frequency of lexical item u in the text, nuTo include lexical item u text in training corpus Number, N is the sum of training corpus Chinese version,Represent text representation model.
As a kind of embodiment, the carry out feature selecting, the feature to selection carries out dimension-reduction treatment, detailed process For:
The degree of correlation between characteristic item and text categories is calculated, formula is as follows:
Formula 2 is used for calculating the degree of correlation, wherein χ2(u, e) represents the degree of correlation between characteristic item and text categories,
Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to feature Item e textual data, Y is represented to belong to characteristic item e but is not included u textual data, and Z is both to be not belonging to characteristic item e or the text not comprising u This number, N is total for the text of training corpus;
Classification processing is carried out to the text categories e having determined, the detailed process of processing is:
Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set;
For document w, sampling obtains the multinomial distribution of k implicit themes on document;
Consider each word in document, further obtain more accurate text categories.
As a kind of embodiment, the selection candidate collection, detailed process is:
According to the size degree of more accurate text categories discrimination, obvious feature is found out;
Text collection V={ v are extracted immediately1,...,vn, text number is n, is carried out using the text classifier of training pre- Survey, obtain the corresponding prediction list of categories L={ l of n bar texts1,...,ln, define a counter in prediction list of categories Count (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,
Rank (c)=count (c, L), c ∈ C (3)
N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and choose Top (n) expressions is exactly Candidate collection C.
It is used as a kind of embodiment, the detailed process of the clustering:According to resulting Top (n) candidate collections, Hierarchical clustering is carried out to Top (n) candidate collections.
It is described that hierarchical clustering is carried out to Top (n) candidate collections as a kind of embodiment, according to weighing apparatus during hierarchical clustering The difference of amount mode, including single connection algorithm, full-join algorithm and mean distance algorithm;
The single connection algorithmic notation is:
Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes what is set Cluster and terminate when value range, wherein, r1、r2It is to belong to cluster P1、P2
The full-join algorithm is expressed as:
Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, is set when distance exceedes Value range when cluster terminate, wherein, r1、r2It is to belong to cluster P1、P2
The mean distance algorithmic notation is:
Wherein q1、q2It is the average of two clusters, n1、n2It is the number of object in two clusters respectively.
As a kind of embodiment, the sequence clustering cluster, selection represents word and obtains label, and detailed process is:
H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, SiRepresent i-th Node, SjRepresent j-th of node, calculate node SiFraction, be calculated as follows formula:
Wherein, v represent from a given node jump to figure in a random node probability, its numerical value be 0-1 it Between value, In (Si) it is node set, Out (Si) it is node SiThe node set of sensing, ωijRefer to the power on side between two nodes Weight, that is, being that k word is contained in a text, selects highest fraction as cluster and represents word, it is corresponding that cluster represents word Cluster representative word string, as label.
The present invention is as a result of above technical scheme, with significant technique effect:
The present invention is on the basis of text based label generating method, it is proposed that a kind of cluster based on text classification point The method that automatically generates of analysis label, this method is used as candidate collection by artificial constructed training corpus using target classification;Lead to again Clustering is crossed to candidate collection as keyword, similarity is calculated, clustering, sort clustering cluster, and it is last that selection represents word Obtain user tag.More accurate, higher-quality mark can be so generated by the method integration based on classification and keyword Label, for large-scale data also or openness data, complex data processing have more significant effect.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other accompanying drawings according to these accompanying drawings.
Fig. 1 is the overall flow schematic diagram of the present invention.
Embodiment
With reference to embodiment, the present invention is described in further detail, following examples be explanation of the invention and The invention is not limited in following examples.
Embodiment 1:
The label automatic generation method of text classification integrated level clustering, as shown in figure 1, comprising the following steps:
The label automatic generation method of text classification integrated level clustering, comprises the following steps:
S1, to Text Pretreatment:Text Pretreatment is done to English text and/or Chinese text, word is obtained;
S2, text representation:Feature is determined to by handling obtained word, the text representation of text can be described by resettling Model;
S3, Feature Dimension Reduction:Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment;
S4, selection candidate collection:After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text is extracted Set, is predicted to text collection, choose the number of times occurred it is most be used as candidate collection;
S5, clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;
S6, sequence clustering cluster, selection cluster represent word and obtain label:Highest fraction in clustering cluster is chosen to represent as cluster Word, is then ranked up to the cluster after cluster, its corresponding cluster representative word string, the label order being just automatically generated.
In S2, the text representation model is obtained by normalized, and text representation model is expressed as
In formula (1),For the word frequency of lexical item u in the text, nuTo include lexical item u text in training corpus Number, N is the sum of training corpus Chinese version,Represent text representation model.
The problem of too high intrinsic dimensionality and Deta sparseness are commonly encountered in text classification, therefore first feature is dropped Further the feature after processing is classified after dimension, dimensionality reduction, step S3 detailed process is:
The degree of correlation between characteristic item and text categories is calculated, formula is as follows:
Formula 2 is used for calculating the degree of correlation, wherein χ2(u, e) represents the degree of correlation between characteristic item and text categories,
Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to feature Item e textual data, Y is represented to belong to characteristic item e but is not included u textual data, and Z is both to be not belonging to characteristic item e or the text not comprising u This number, N is total for the text of training corpus;
Classification processing is carried out to the text categories e having determined, the detailed process of processing is from three layers of Bayesian probability Model carries out classification processing to word, theme and document, and processing procedure is:
Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set;
For document w, sampling obtains the multinomial distribution of k implicit themes on document;
Consider each word in document, further obtain more accurate text categories.
In step s 4, the selection candidate collection, detailed process is:
According to the size degree of more accurate text categories discrimination, obvious feature is found out;
Text collection V={ v are extracted immediately1,...,vn, text number is n, is carried out using the text classifier of training pre- Survey, obtain the corresponding prediction list of categories L={ l of n bar texts1,...,ln, define a counter in prediction list of categories Count (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,
Rank (c)=count (c, L), c ∈ C (3)
N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and choose Top (n) expressions is exactly Candidate collection C, according to resulting Top (n) candidate collections, hierarchical clustering is carried out to Top (n) candidate collections, described to Top (n) candidate collection carries out hierarchical clustering;
In hierarchical clustering, can according to weigh mode difference, hierarchical clustering include single connection algorithm, full-join algorithm With mean distance algorithm;
The single connection algorithmic notation is:
Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes what is set Cluster and terminate when value range, wherein, r1、r2It is to belong to cluster P1、P2
The full-join algorithm is expressed as:
Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, is set when distance exceedes Value range when cluster terminate, wherein, r1、r2It is to belong to cluster P1、P2
The mean distance algorithmic notation is:
Wherein q1、q2It is the average of two clusters, n1、n2It is the number of object in two clusters respectively.
In step s 6, the sequence clustering cluster, selection represents word and obtains label, and detailed process is:
H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, SiRepresent i-th Node, SjRepresent j-th of node, calculate node SiFraction, be calculated as follows formula:
Wherein, v represent from a given node jump to figure in a random node probability, its numerical value be 0-1 it Between value, In (Si) it is node set, Out (Si) it is node SiThe node set of sensing, ωijRefer to the power on side between two nodes Weight, that is, being that k word is contained in a text, selects highest fraction as cluster and represents word, it is corresponding that cluster represents word Cluster representative word string, as label.
Furthermore, it is necessary to explanation, the specific embodiment described in this specification, is named the shape of its parts and components Title etc. can be different.The equivalent or simple change that all construction, feature and principles according to described in inventional idea of the present invention are done, is wrapped Include in the protection domain of patent of the present invention.Those skilled in the art can be to described specific implementation Example is made various modifications or supplement or substituted using similar mode, structure without departing from the present invention or surmounts this Scope as defined in the claims, all should belong to protection scope of the present invention.

Claims (7)

1. the label automatic generation method of text classification integrated level clustering, it is characterised in that comprise the following steps:
To Text Pretreatment:Text Pretreatment is done to English text and/or Chinese text, word is obtained;
Text representation:Feature is determined to by handling obtained word, the text representation model of text can be described by resettling;
Feature Dimension Reduction:Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment;
Choose candidate collection:After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text collection is extracted, it is right Text collection is predicted, choose the number of times occurred it is most be used as candidate collection;
Clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;
Sort clustering cluster, and selection cluster represents word and obtains label:Choose highest fraction in clustering cluster and represent word as cluster, it is then right Cluster after cluster is ranked up, its corresponding cluster representative word string, the label order being just automatically generated.
2. the label automatic generation method of text classification integrated level clustering according to claim 1, its feature exists In:The text representation model is obtained by normalized, and text representation model is expressed as
<mrow> <mi>w</mi> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>u</mi> <mi>f</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>*</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>u</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> </mrow> <msup> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>u</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>&amp;lsqb;</mo> <mi>u</mi> <mi>f</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>*</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>u</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> </msqrt> <mn>2</mn> </msup> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
In formula (1),For the word frequency of lexical item u in the text, nuTo include lexical item u text number, N in training corpus For the sum of training corpus Chinese version,Represent text representation model.
3. the label automatic generation method of text classification integrated level clustering according to claim 2, its feature exists In:The carry out feature selecting, the feature to selection carries out dimension-reduction treatment, and detailed process is:
The degree of correlation between characteristic item and text categories is calculated, formula is as follows:
<mrow> <msup> <mi>&amp;chi;</mi> <mn>2</mn> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>,</mo> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>N</mi> <mo>*</mo> <msup> <mrow> <mo>(</mo> <mi>W</mi> <mi>Z</mi> <mo>-</mo> <mi>X</mi> <mi>Y</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <mo>(</mo> <mi>W</mi> <mo>+</mo> <mi>Y</mi> <mo>)</mo> <mo>(</mo> <mi>X</mi> <mo>+</mo> <mi>Z</mi> <mo>)</mo> <mo>(</mo> <mi>W</mi> <mo>+</mo> <mi>X</mi> <mo>)</mo> <mo>(</mo> <mi>Y</mi> <mo>+</mo> <mi>Z</mi> <mo>)</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Formula 2 is used for calculating the degree of correlation, wherein χ2(u, e) represents the degree of correlation between characteristic item and text categories,
Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to characteristic item e Textual data, Y represents to belong to characteristic item e but the textual data not comprising u, and Z is both to be not belonging to characteristic item e or the text not comprising u Number, N is total for the text of training corpus;
Classification processing is carried out to the text categories e having determined, the detailed process of processing is:
Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set;
For document w, sampling obtains the multinomial distribution of k implicit themes on document;
Consider each word in document, further obtain more accurate text categories.
4. the label automatic generation method of text classification integrated level clustering according to claim 3, its feature exists In:The selection candidate collection, detailed process is:
According to the size degree of more accurate text categories discrimination, obvious feature is found out;
Text collection V={ v are extracted immediately1,...,vn, text number is n, is predicted, obtained using the text classifier of training To the corresponding prediction list of categories L={ l of n bar texts1,...,ln, define a counter count in prediction list of categories (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,
Rank (c)=count (c, L), c ∈ C (3)
N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and that choose Top (n) expressions is exactly candidate Set C.
5. the label automatic generation method of text classification integrated level clustering according to claim 4, its feature exists In:The detailed process of the clustering:According to resulting Top (n) candidate collections, level is carried out to Top (n) candidate collections Cluster.
6. the label automatic generation method of text classification integrated level clustering according to claim 5, its feature exists In:Described to carry out hierarchical clustering to Top (n) candidate collections, according to the difference of the mode of measurement during hierarchical clustering, including single connection is calculated Method, full-join algorithm and mean distance algorithm;
The single connection algorithmic notation is:
<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>min</mi> <mrow> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>&amp;Element;</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>&amp;Element;</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> </mrow> </msub> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes the scope set Cluster and terminate when value, wherein, r1、r2It is to belong to cluster P1、P2
The full-join algorithm is expressed as:
<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>max</mi> <mrow> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>&amp;Element;</mo> <msub> <mi>P</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>&amp;Element;</mo> <msub> <mi>P</mi> <mn>2</mn> </msub> </mrow> </msub> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>r</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>r</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, when distance exceedes the model set Cluster is terminated when enclosing value, wherein, r1、r2It is to belong to cluster P1、P2
The mean distance algorithmic notation is:
d(P1,P2)=d (q1,q2),
Wherein q1、q2It is the average of two clusters, n1、n2It is the number of object in two clusters respectively.
7. the label automatic generation method of text classification integrated level clustering according to claim 6, its feature exists In the sequence clustering cluster, selection represents word and obtains label, and detailed process is:
H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, SiI-th of node is represented, SjRepresent j-th of node, calculate node SiFraction, be calculated as follows formula:
<mrow> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>v</mi> <mo>*</mo> <munder> <mi>&amp;Sigma;</mi> <mrow> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <mi>I</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <msub> <mi>&amp;omega;</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>*</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mi>&amp;Sigma;</mi> <mrow> <msub> <mi>S</mi> <mi>k</mi> </msub> <mo>&amp;Element;</mo> <mi>O</mi> <mi>u</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msub> <mi>&amp;omega;</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>
Wherein, v represent from a given node jump to figure in a random node probability, its numerical value is between 0-1 Value, In (Si) it is node set, Out (Si) it is node SiThe node set of sensing, ωijThe weight on side between two nodes is referred to, i.e., Be the equal of that k word is contained in a text, selection highest fraction represents word as cluster, and cluster represents word corresponding cluster generation Table word string, as label.
CN201710249462.0A 2017-04-17 2017-04-17 The label automatic generation method of text classification integrated level clustering Pending CN107180075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710249462.0A CN107180075A (en) 2017-04-17 2017-04-17 The label automatic generation method of text classification integrated level clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710249462.0A CN107180075A (en) 2017-04-17 2017-04-17 The label automatic generation method of text classification integrated level clustering

Publications (1)

Publication Number Publication Date
CN107180075A true CN107180075A (en) 2017-09-19

Family

ID=59831984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710249462.0A Pending CN107180075A (en) 2017-04-17 2017-04-17 The label automatic generation method of text classification integrated level clustering

Country Status (1)

Country Link
CN (1) CN107180075A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784105A (en) * 2017-10-26 2018-03-09 平安科技(深圳)有限公司 Construction of knowledge base method, electronic installation and storage medium based on magnanimity problem
CN108062377A (en) * 2017-12-12 2018-05-22 百度在线网络技术(北京)有限公司 The foundation of label picture collection, definite method, apparatus, equipment and the medium of label
CN108446276A (en) * 2018-03-21 2018-08-24 腾讯音乐娱乐科技(深圳)有限公司 The method and apparatus for determining the single keyword of song
CN108595585A (en) * 2018-04-18 2018-09-28 平安科技(深圳)有限公司 Sample data sorting technique, model training method, electronic equipment and storage medium
CN110188189A (en) * 2019-05-21 2019-08-30 浙江工商大学 A kind of method that Knowledge based engineering adaptive event index cognitive model extracts documentation summary
CN110297901A (en) * 2019-05-14 2019-10-01 广州数说故事信息科技有限公司 Extensive Text Clustering Method based on distance parameter
CN111738009A (en) * 2019-03-19 2020-10-02 百度在线网络技术(北京)有限公司 Method and device for generating entity word label, computer equipment and readable storage medium
CN111797945A (en) * 2020-08-21 2020-10-20 成都数联铭品科技有限公司 Text classification method
CN112860900A (en) * 2021-03-23 2021-05-28 上海壁仞智能科技有限公司 Text classification method and device, electronic equipment and storage medium
CN114443850A (en) * 2022-04-06 2022-05-06 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN114676796A (en) * 2022-05-27 2022-06-28 浙江清大科技有限公司 Clustering acquisition and identification system based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
CN104834940A (en) * 2015-05-12 2015-08-12 杭州电子科技大学 Medical image inspection disease classification method based on support vector machine (SVM)
CN106104512A (en) * 2013-09-19 2016-11-09 西斯摩斯公司 System and method for active obtaining social data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN106104512A (en) * 2013-09-19 2016-11-09 西斯摩斯公司 System and method for active obtaining social data
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
CN104834940A (en) * 2015-05-12 2015-08-12 杭州电子科技大学 Medical image inspection disease classification method based on support vector machine (SVM)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吕海燕等: "基于聚类分析的微博用户标签自动生成", 《电子设计工程》 *
宋巍等: "基于微博分类的用户兴趣识别", 《智能计算机与应用》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784105A (en) * 2017-10-26 2018-03-09 平安科技(深圳)有限公司 Construction of knowledge base method, electronic installation and storage medium based on magnanimity problem
CN108062377A (en) * 2017-12-12 2018-05-22 百度在线网络技术(北京)有限公司 The foundation of label picture collection, definite method, apparatus, equipment and the medium of label
CN108446276A (en) * 2018-03-21 2018-08-24 腾讯音乐娱乐科技(深圳)有限公司 The method and apparatus for determining the single keyword of song
CN108595585A (en) * 2018-04-18 2018-09-28 平安科技(深圳)有限公司 Sample data sorting technique, model training method, electronic equipment and storage medium
CN111738009A (en) * 2019-03-19 2020-10-02 百度在线网络技术(北京)有限公司 Method and device for generating entity word label, computer equipment and readable storage medium
CN111738009B (en) * 2019-03-19 2023-10-20 百度在线网络技术(北京)有限公司 Entity word label generation method, entity word label generation device, computer equipment and readable storage medium
CN110297901A (en) * 2019-05-14 2019-10-01 广州数说故事信息科技有限公司 Extensive Text Clustering Method based on distance parameter
CN110297901B (en) * 2019-05-14 2023-11-17 广州数说故事信息科技有限公司 Large-scale text clustering method based on distance parameters
CN110188189B (en) * 2019-05-21 2021-10-08 浙江工商大学 Knowledge-based method for extracting document abstract by adaptive event index cognitive model
CN110188189A (en) * 2019-05-21 2019-08-30 浙江工商大学 A kind of method that Knowledge based engineering adaptive event index cognitive model extracts documentation summary
CN111797945B (en) * 2020-08-21 2020-12-15 成都数联铭品科技有限公司 Text classification method
CN111797945A (en) * 2020-08-21 2020-10-20 成都数联铭品科技有限公司 Text classification method
CN112860900A (en) * 2021-03-23 2021-05-28 上海壁仞智能科技有限公司 Text classification method and device, electronic equipment and storage medium
CN114443850A (en) * 2022-04-06 2022-05-06 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN114443850B (en) * 2022-04-06 2022-07-22 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN114676796A (en) * 2022-05-27 2022-06-28 浙江清大科技有限公司 Clustering acquisition and identification system based on big data
CN114676796B (en) * 2022-05-27 2022-09-06 浙江清大科技有限公司 Clustering acquisition and identification system based on big data

Similar Documents

Publication Publication Date Title
CN107180075A (en) The label automatic generation method of text classification integrated level clustering
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
Wang et al. A hybrid document feature extraction method using latent Dirichlet allocation and word2vec
Papagiannopoulou et al. Local word vectors guiding keyphrase extraction
Abdelhamid et al. Associative classification approaches: review and comparison
Santra et al. Genetic algorithm and confusion matrix for document clustering
US20060288275A1 (en) Method for classifying sub-trees in semi-structured documents
CN109325231A (en) A kind of method that multi task model generates term vector
CN110209808A (en) A kind of event generation method and relevant apparatus based on text information
CN103927302A (en) Text classification method and system
KR20190135129A (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN105205163B (en) A kind of multi-level two sorting technique of the incremental learning of science and technology news
CN106503153B (en) A kind of computer version classification system
CN106339459B (en) The method that Chinese web page is presorted is carried out based on Keywords matching
CN108228612A (en) A kind of method and device for extracting network event keyword and mood tendency
Du et al. Hierarchy construction and text classification based on the relaxation strategy and least information model
CN108733669A (en) A kind of personalized digital media content recommendation system and method based on term vector
Castano et al. Classifying and reusing conceptual schemas
Ding et al. The research of text mining based on self-organizing maps
Marath et al. Large-scale web page classification
Denzler et al. Granular knowledge cube
Kim et al. An intelligent information system for organizing online text documents
CN107391674B (en) New type mining method and device
Golam Sohrab et al. EDGE2VEC: Edge representations for large-scale scalable hierarchical learning
Avancini et al. Organizing digital libraries by automated text categorization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170919

RJ01 Rejection of invention patent application after publication