CN107180075A - The label automatic generation method of text classification integrated level clustering - Google Patents
The label automatic generation method of text classification integrated level clustering Download PDFInfo
- Publication number
- CN107180075A CN107180075A CN201710249462.0A CN201710249462A CN107180075A CN 107180075 A CN107180075 A CN 107180075A CN 201710249462 A CN201710249462 A CN 201710249462A CN 107180075 A CN107180075 A CN 107180075A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- text
- cluster
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of label automatic generation method of text classification integrated level clustering, comprise the following steps:Text Pretreatment, text representation, Feature Dimension Reduction chooses candidate collection, clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;Sort clustering cluster, and selection cluster represents word and obtains label:Choose highest fraction in clustering cluster and represent word as cluster, then the cluster after cluster is ranked up, its corresponding cluster representative word string, the label order being just automatically generated.The present invention is used as candidate collection by artificial constructed training corpus using target classification;Keyword is used as to candidate collection by clustering again, similarity is calculated, clustering, sequence clustering cluster, selection represents word and finally obtains user tag.Method integration based on classification and keyword can generate more accurate label, for large-scale data also or openness data, complex data processing have more significant effect.
Description
Technical field
The present invention relates to big data algorithm field, more particularly to a kind of label of text classification integrated level clustering
Automatic generation method.
Background technology
Under the big data epoch, the rise of increasing Internet enterprises, such as microblogging, QQ etc.." label " is due to mutual
The information content of magnanimity isomery has been poured in networking, has been produced to strengthen the management and use of information, it is a kind of information
Description form.The theme and content of our more effective cognitive all kinds of resources can be helped using label, is also beneficial to information
Discovery, management, propagate and utilize.Have at 2 points using the key element of label description information resource:Obtain label and control is marked
The quality of label.The quality and quantity of label has large effect to the descriptive power of label.The side automatically generated for label
Method can just not generate label, and the quality of label is also an extremely important index.The quality of label can be from two dimensions
Degree is explained:One is whether the result generated embodies this part article or personage intrinsic attribute or hobby;Two be generation
As a result if appropriate for being used as label.Certainly at present in any case also can be basic using more extensive baseline systems
Complete this target.But it is due to that some one-sidedness (such as avoid synonymous label accumulation etc.) of method can not be generated preferably
More accurate label.Existing technical requirements can not be met by being also due to some traditional data analyses and digging technology,
Difficulty is brought to realization generation label.
Existing label generating method has the generation method based on classification, based on the generation methods such as Baidupedia, also base
In the TextRank generation methods of keyword.It is to extract more important word mostly, for generating label;Also favourable word
Bar information, chooses the fine granularity classification that can embody certain attribute as label.The most all unavoidable synonymous label of these methods
Packing phenomenon.
The content of the invention
The present invention shortcoming accumulated easily occurs for synonymous label in the prior art, and there is provided a kind of text classification is integrated
The label automatic generation method of Hierarchical clustering analysis.
In order to solve the above-mentioned technical problem, the present invention is addressed by following technical proposals:
The label automatic generation method of text classification integrated level clustering, comprises the following steps:
To Text Pretreatment:Text Pretreatment is done to English text and/or Chinese text, word is obtained;
Text representation:Feature is determined to by handling obtained word, the text representation mould of text can be described by resettling
Type;
Feature Dimension Reduction:Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment;
Choose candidate collection:After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text set is extracted
Close, text collection be predicted, choose the number of times occurred it is most be used as candidate collection;
Clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;
Sort clustering cluster, and selection cluster represents word and obtains label:Choose highest fraction in clustering cluster and represent word as cluster, so
The cluster after cluster is ranked up afterwards, its corresponding cluster representative word string, the label order being just automatically generated.
As a kind of embodiment, the text representation model is obtained by normalized, text representation mould
Type is expressed as
In formula (1),For the word frequency of lexical item u in the text, nuTo include lexical item u text in training corpus
Number, N is the sum of training corpus Chinese version,Represent text representation model.
As a kind of embodiment, the carry out feature selecting, the feature to selection carries out dimension-reduction treatment, detailed process
For:
The degree of correlation between characteristic item and text categories is calculated, formula is as follows:
Formula 2 is used for calculating the degree of correlation, wherein χ2(u, e) represents the degree of correlation between characteristic item and text categories,
Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to feature
Item e textual data, Y is represented to belong to characteristic item e but is not included u textual data, and Z is both to be not belonging to characteristic item e or the text not comprising u
This number, N is total for the text of training corpus;
Classification processing is carried out to the text categories e having determined, the detailed process of processing is:
Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set;
For document w, sampling obtains the multinomial distribution of k implicit themes on document;
Consider each word in document, further obtain more accurate text categories.
As a kind of embodiment, the selection candidate collection, detailed process is:
According to the size degree of more accurate text categories discrimination, obvious feature is found out;
Text collection V={ v are extracted immediately1,...,vn, text number is n, is carried out using the text classifier of training pre-
Survey, obtain the corresponding prediction list of categories L={ l of n bar texts1,...,ln, define a counter in prediction list of categories
Count (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,
Rank (c)=count (c, L), c ∈ C (3)
N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and choose Top (n) expressions is exactly
Candidate collection C.
It is used as a kind of embodiment, the detailed process of the clustering:According to resulting Top (n) candidate collections,
Hierarchical clustering is carried out to Top (n) candidate collections.
It is described that hierarchical clustering is carried out to Top (n) candidate collections as a kind of embodiment, according to weighing apparatus during hierarchical clustering
The difference of amount mode, including single connection algorithm, full-join algorithm and mean distance algorithm;
The single connection algorithmic notation is:
Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes what is set
Cluster and terminate when value range, wherein, r1、r2It is to belong to cluster P1、P2;
The full-join algorithm is expressed as:
Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, is set when distance exceedes
Value range when cluster terminate, wherein, r1、r2It is to belong to cluster P1、P2;
The mean distance algorithmic notation is:
Wherein q1、q2It is the average of two clusters, n1、n2It is the number of object in two clusters respectively.
As a kind of embodiment, the sequence clustering cluster, selection represents word and obtains label, and detailed process is:
H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, SiRepresent i-th
Node, SjRepresent j-th of node, calculate node SiFraction, be calculated as follows formula:
Wherein, v represent from a given node jump to figure in a random node probability, its numerical value be 0-1 it
Between value, In (Si) it is node set, Out (Si) it is node SiThe node set of sensing, ωijRefer to the power on side between two nodes
Weight, that is, being that k word is contained in a text, selects highest fraction as cluster and represents word, it is corresponding that cluster represents word
Cluster representative word string, as label.
The present invention is as a result of above technical scheme, with significant technique effect:
The present invention is on the basis of text based label generating method, it is proposed that a kind of cluster based on text classification point
The method that automatically generates of analysis label, this method is used as candidate collection by artificial constructed training corpus using target classification;Lead to again
Clustering is crossed to candidate collection as keyword, similarity is calculated, clustering, sort clustering cluster, and it is last that selection represents word
Obtain user tag.More accurate, higher-quality mark can be so generated by the method integration based on classification and keyword
Label, for large-scale data also or openness data, complex data processing have more significant effect.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also
To obtain other accompanying drawings according to these accompanying drawings.
Fig. 1 is the overall flow schematic diagram of the present invention.
Embodiment
With reference to embodiment, the present invention is described in further detail, following examples be explanation of the invention and
The invention is not limited in following examples.
Embodiment 1:
The label automatic generation method of text classification integrated level clustering, as shown in figure 1, comprising the following steps:
The label automatic generation method of text classification integrated level clustering, comprises the following steps:
S1, to Text Pretreatment:Text Pretreatment is done to English text and/or Chinese text, word is obtained;
S2, text representation:Feature is determined to by handling obtained word, the text representation of text can be described by resettling
Model;
S3, Feature Dimension Reduction:Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment;
S4, selection candidate collection:After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text is extracted
Set, is predicted to text collection, choose the number of times occurred it is most be used as candidate collection;
S5, clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;
S6, sequence clustering cluster, selection cluster represent word and obtain label:Highest fraction in clustering cluster is chosen to represent as cluster
Word, is then ranked up to the cluster after cluster, its corresponding cluster representative word string, the label order being just automatically generated.
In S2, the text representation model is obtained by normalized, and text representation model is expressed as
In formula (1),For the word frequency of lexical item u in the text, nuTo include lexical item u text in training corpus
Number, N is the sum of training corpus Chinese version,Represent text representation model.
The problem of too high intrinsic dimensionality and Deta sparseness are commonly encountered in text classification, therefore first feature is dropped
Further the feature after processing is classified after dimension, dimensionality reduction, step S3 detailed process is:
The degree of correlation between characteristic item and text categories is calculated, formula is as follows:
Formula 2 is used for calculating the degree of correlation, wherein χ2(u, e) represents the degree of correlation between characteristic item and text categories,
Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to feature
Item e textual data, Y is represented to belong to characteristic item e but is not included u textual data, and Z is both to be not belonging to characteristic item e or the text not comprising u
This number, N is total for the text of training corpus;
Classification processing is carried out to the text categories e having determined, the detailed process of processing is from three layers of Bayesian probability
Model carries out classification processing to word, theme and document, and processing procedure is:
Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set;
For document w, sampling obtains the multinomial distribution of k implicit themes on document;
Consider each word in document, further obtain more accurate text categories.
In step s 4, the selection candidate collection, detailed process is:
According to the size degree of more accurate text categories discrimination, obvious feature is found out;
Text collection V={ v are extracted immediately1,...,vn, text number is n, is carried out using the text classifier of training pre-
Survey, obtain the corresponding prediction list of categories L={ l of n bar texts1,...,ln, define a counter in prediction list of categories
Count (x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,
Rank (c)=count (c, L), c ∈ C (3)
N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and choose Top (n) expressions is exactly
Candidate collection C, according to resulting Top (n) candidate collections, hierarchical clustering is carried out to Top (n) candidate collections, described to Top
(n) candidate collection carries out hierarchical clustering;
In hierarchical clustering, can according to weigh mode difference, hierarchical clustering include single connection algorithm, full-join algorithm
With mean distance algorithm;
The single connection algorithmic notation is:
Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes what is set
Cluster and terminate when value range, wherein, r1、r2It is to belong to cluster P1、P2;
The full-join algorithm is expressed as:
Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, is set when distance exceedes
Value range when cluster terminate, wherein, r1、r2It is to belong to cluster P1、P2;
The mean distance algorithmic notation is:
Wherein q1、q2It is the average of two clusters, n1、n2It is the number of object in two clusters respectively.
In step s 6, the sequence clustering cluster, selection represents word and obtains label, and detailed process is:
H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, SiRepresent i-th
Node, SjRepresent j-th of node, calculate node SiFraction, be calculated as follows formula:
Wherein, v represent from a given node jump to figure in a random node probability, its numerical value be 0-1 it
Between value, In (Si) it is node set, Out (Si) it is node SiThe node set of sensing, ωijRefer to the power on side between two nodes
Weight, that is, being that k word is contained in a text, selects highest fraction as cluster and represents word, it is corresponding that cluster represents word
Cluster representative word string, as label.
Furthermore, it is necessary to explanation, the specific embodiment described in this specification, is named the shape of its parts and components
Title etc. can be different.The equivalent or simple change that all construction, feature and principles according to described in inventional idea of the present invention are done, is wrapped
Include in the protection domain of patent of the present invention.Those skilled in the art can be to described specific implementation
Example is made various modifications or supplement or substituted using similar mode, structure without departing from the present invention or surmounts this
Scope as defined in the claims, all should belong to protection scope of the present invention.
Claims (7)
1. the label automatic generation method of text classification integrated level clustering, it is characterised in that comprise the following steps:
To Text Pretreatment:Text Pretreatment is done to English text and/or Chinese text, word is obtained;
Text representation:Feature is determined to by handling obtained word, the text representation model of text can be described by resettling;
Feature Dimension Reduction:Feature selecting is carried out, the feature to selection carries out dimension-reduction treatment;
Choose candidate collection:After dimension-reduction treatment, respective classes are extracted according to discrimination as feature, text collection is extracted, it is right
Text collection is predicted, choose the number of times occurred it is most be used as candidate collection;
Clustering:According to the most candidate collection of resulting number of times, clustering is carried out, clustering cluster is obtained;
Sort clustering cluster, and selection cluster represents word and obtains label:Choose highest fraction in clustering cluster and represent word as cluster, it is then right
Cluster after cluster is ranked up, its corresponding cluster representative word string, the label order being just automatically generated.
2. the label automatic generation method of text classification integrated level clustering according to claim 1, its feature exists
In:The text representation model is obtained by normalized, and text representation model is expressed as
<mrow>
<mi>w</mi>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>,</mo>
<mover>
<mi>d</mi>
<mo>&OverBar;</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>u</mi>
<mi>f</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>,</mo>
<mover>
<mi>d</mi>
<mo>&OverBar;</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>*</mo>
<mi>l</mi>
<mi>o</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mfrac>
<mi>N</mi>
<msub>
<mi>n</mi>
<mi>u</mi>
</msub>
</mfrac>
<mo>+</mo>
<mn>0.01</mn>
<mo>)</mo>
</mrow>
</mrow>
<msup>
<msqrt>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>u</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<mo>&lsqb;</mo>
<mi>u</mi>
<mi>f</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>,</mo>
<mover>
<mi>d</mi>
<mo>&OverBar;</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>*</mo>
<mi>l</mi>
<mi>o</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mfrac>
<mi>N</mi>
<msub>
<mi>n</mi>
<mi>u</mi>
</msub>
</mfrac>
<mo>+</mo>
<mn>0.01</mn>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
</msqrt>
<mn>2</mn>
</msup>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
In formula (1),For the word frequency of lexical item u in the text, nuTo include lexical item u text number, N in training corpus
For the sum of training corpus Chinese version,Represent text representation model.
3. the label automatic generation method of text classification integrated level clustering according to claim 2, its feature exists
In:The carry out feature selecting, the feature to selection carries out dimension-reduction treatment, and detailed process is:
The degree of correlation between characteristic item and text categories is calculated, formula is as follows:
<mrow>
<msup>
<mi>&chi;</mi>
<mn>2</mn>
</msup>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>,</mo>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>N</mi>
<mo>*</mo>
<msup>
<mrow>
<mo>(</mo>
<mi>W</mi>
<mi>Z</mi>
<mo>-</mo>
<mi>X</mi>
<mi>Y</mi>
<mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
</mrow>
<mrow>
<mo>(</mo>
<mi>W</mi>
<mo>+</mo>
<mi>Y</mi>
<mo>)</mo>
<mo>(</mo>
<mi>X</mi>
<mo>+</mo>
<mi>Z</mi>
<mo>)</mo>
<mo>(</mo>
<mi>W</mi>
<mo>+</mo>
<mi>X</mi>
<mo>)</mo>
<mo>(</mo>
<mi>Y</mi>
<mo>+</mo>
<mi>Z</mi>
<mo>)</mo>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
Formula 2 is used for calculating the degree of correlation, wherein χ2(u, e) represents the degree of correlation between characteristic item and text categories,
Wherein, W is represented comprising characteristic item u and is belonged to classification e textual data, and X is represented comprising characteristic item u but is not belonging to characteristic item e
Textual data, Y represents to belong to characteristic item e but the textual data not comprising u, and Z is both to be not belonging to characteristic item e or the text not comprising u
Number, N is total for the text of training corpus;
Classification processing is carried out to the text categories e having determined, the detailed process of processing is:
Poisson distribution is obeyed to any document w generation Document Lengths L, L in text categories e Chinese version set;
For document w, sampling obtains the multinomial distribution of k implicit themes on document;
Consider each word in document, further obtain more accurate text categories.
4. the label automatic generation method of text classification integrated level clustering according to claim 3, its feature exists
In:The selection candidate collection, detailed process is:
According to the size degree of more accurate text categories discrimination, obvious feature is found out;
Text collection V={ v are extracted immediately1,...,vn, text number is n, is predicted, obtained using the text classifier of training
To the corresponding prediction list of categories L={ l of n bar texts1,...,ln, define a counter count in prediction list of categories
(x, L), x, L ∈ C, C represent candidate collection, return to its number of times occurred in lists,
Rank (c)=count (c, L), c ∈ C (3)
N represents natural number, and c represents candidate word, sorted from high to low, chooses Top (n), and that choose Top (n) expressions is exactly candidate
Set C.
5. the label automatic generation method of text classification integrated level clustering according to claim 4, its feature exists
In:The detailed process of the clustering:According to resulting Top (n) candidate collections, level is carried out to Top (n) candidate collections
Cluster.
6. the label automatic generation method of text classification integrated level clustering according to claim 5, its feature exists
In:Described to carry out hierarchical clustering to Top (n) candidate collections, according to the difference of the mode of measurement during hierarchical clustering, including single connection is calculated
Method, full-join algorithm and mean distance algorithm;
The single connection algorithmic notation is:
<mrow>
<mi>d</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>P</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>P</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<msub>
<mi>min</mi>
<mrow>
<msub>
<mi>r</mi>
<mn>1</mn>
</msub>
<mo>&Element;</mo>
<msub>
<mi>P</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>r</mi>
<mn>2</mn>
</msub>
<mo>&Element;</mo>
<msub>
<mi>P</mi>
<mn>2</mn>
</msub>
</mrow>
</msub>
<mi>d</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>r</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>r</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
Single connection algorithm uses the distance of nearest object in two clusters as the distance between cluster, when distance exceedes the scope set
Cluster and terminate when value, wherein, r1、r2It is to belong to cluster P1、P2;
The full-join algorithm is expressed as:
<mrow>
<mi>d</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>P</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>P</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<msub>
<mi>max</mi>
<mrow>
<msub>
<mi>r</mi>
<mn>1</mn>
</msub>
<mo>&Element;</mo>
<msub>
<mi>P</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>r</mi>
<mn>2</mn>
</msub>
<mo>&Element;</mo>
<msub>
<mi>P</mi>
<mn>2</mn>
</msub>
</mrow>
</msub>
<mi>d</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>r</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>r</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>5</mn>
<mo>)</mo>
</mrow>
</mrow>
Full-join algorithm is to use the distance of farthest object in two clusters as the distance between cluster, when distance exceedes the model set
Cluster is terminated when enclosing value, wherein, r1、r2It is to belong to cluster P1、P2;
The mean distance algorithmic notation is:
d(P1,P2)=d (q1,q2),
Wherein q1、q2It is the average of two clusters, n1、n2It is the number of object in two clusters respectively.
7. the label automatic generation method of text classification integrated level clustering according to claim 6, its feature exists
In the sequence clustering cluster, selection represents word and obtains label, and detailed process is:
H=(S, F) represents the digraph being made up of in text word, and S is the node of word, and F is side, SiI-th of node is represented,
SjRepresent j-th of node, calculate node SiFraction, be calculated as follows formula:
<mrow>
<mi>D</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>v</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>v</mi>
<mo>*</mo>
<munder>
<mi>&Sigma;</mi>
<mrow>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>&Element;</mo>
<mi>I</mi>
<mi>n</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<mfrac>
<mrow>
<msub>
<mi>&omega;</mi>
<mrow>
<mi>j</mi>
<mi>i</mi>
</mrow>
</msub>
<mo>*</mo>
<mi>D</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<munder>
<mi>&Sigma;</mi>
<mrow>
<msub>
<mi>S</mi>
<mi>k</mi>
</msub>
<mo>&Element;</mo>
<mi>O</mi>
<mi>u</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<msub>
<mi>&omega;</mi>
<mrow>
<mi>j</mi>
<mi>k</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>7</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, v represent from a given node jump to figure in a random node probability, its numerical value is between 0-1
Value, In (Si) it is node set, Out (Si) it is node SiThe node set of sensing, ωijThe weight on side between two nodes is referred to, i.e.,
Be the equal of that k word is contained in a text, selection highest fraction represents word as cluster, and cluster represents word corresponding cluster generation
Table word string, as label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710249462.0A CN107180075A (en) | 2017-04-17 | 2017-04-17 | The label automatic generation method of text classification integrated level clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710249462.0A CN107180075A (en) | 2017-04-17 | 2017-04-17 | The label automatic generation method of text classification integrated level clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107180075A true CN107180075A (en) | 2017-09-19 |
Family
ID=59831984
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710249462.0A Pending CN107180075A (en) | 2017-04-17 | 2017-04-17 | The label automatic generation method of text classification integrated level clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107180075A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784105A (en) * | 2017-10-26 | 2018-03-09 | 平安科技(深圳)有限公司 | Construction of knowledge base method, electronic installation and storage medium based on magnanimity problem |
CN108062377A (en) * | 2017-12-12 | 2018-05-22 | 百度在线网络技术(北京)有限公司 | The foundation of label picture collection, definite method, apparatus, equipment and the medium of label |
CN108446276A (en) * | 2018-03-21 | 2018-08-24 | 腾讯音乐娱乐科技(深圳)有限公司 | The method and apparatus for determining the single keyword of song |
CN108595585A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Sample data sorting technique, model training method, electronic equipment and storage medium |
CN110188189A (en) * | 2019-05-21 | 2019-08-30 | 浙江工商大学 | A kind of method that Knowledge based engineering adaptive event index cognitive model extracts documentation summary |
CN110297901A (en) * | 2019-05-14 | 2019-10-01 | 广州数说故事信息科技有限公司 | Extensive Text Clustering Method based on distance parameter |
CN111738009A (en) * | 2019-03-19 | 2020-10-02 | 百度在线网络技术(北京)有限公司 | Method and device for generating entity word label, computer equipment and readable storage medium |
CN111797945A (en) * | 2020-08-21 | 2020-10-20 | 成都数联铭品科技有限公司 | Text classification method |
CN112860900A (en) * | 2021-03-23 | 2021-05-28 | 上海壁仞智能科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN114443850A (en) * | 2022-04-06 | 2022-05-06 | 杭州费尔斯通科技有限公司 | Label generation method, system, device and medium based on semantic similar model |
CN114676796A (en) * | 2022-05-27 | 2022-06-28 | 浙江清大科技有限公司 | Clustering acquisition and identification system based on big data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101174273A (en) * | 2007-12-04 | 2008-05-07 | 清华大学 | News event detecting method based on metadata analysis |
CN104391835A (en) * | 2014-09-30 | 2015-03-04 | 中南大学 | Method and device for selecting feature words in texts |
CN104834940A (en) * | 2015-05-12 | 2015-08-12 | 杭州电子科技大学 | Medical image inspection disease classification method based on support vector machine (SVM) |
CN106104512A (en) * | 2013-09-19 | 2016-11-09 | 西斯摩斯公司 | System and method for active obtaining social data |
-
2017
- 2017-04-17 CN CN201710249462.0A patent/CN107180075A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101174273A (en) * | 2007-12-04 | 2008-05-07 | 清华大学 | News event detecting method based on metadata analysis |
CN106104512A (en) * | 2013-09-19 | 2016-11-09 | 西斯摩斯公司 | System and method for active obtaining social data |
CN104391835A (en) * | 2014-09-30 | 2015-03-04 | 中南大学 | Method and device for selecting feature words in texts |
CN104834940A (en) * | 2015-05-12 | 2015-08-12 | 杭州电子科技大学 | Medical image inspection disease classification method based on support vector machine (SVM) |
Non-Patent Citations (2)
Title |
---|
吕海燕等: "基于聚类分析的微博用户标签自动生成", 《电子设计工程》 * |
宋巍等: "基于微博分类的用户兴趣识别", 《智能计算机与应用》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784105A (en) * | 2017-10-26 | 2018-03-09 | 平安科技(深圳)有限公司 | Construction of knowledge base method, electronic installation and storage medium based on magnanimity problem |
CN108062377A (en) * | 2017-12-12 | 2018-05-22 | 百度在线网络技术(北京)有限公司 | The foundation of label picture collection, definite method, apparatus, equipment and the medium of label |
CN108446276A (en) * | 2018-03-21 | 2018-08-24 | 腾讯音乐娱乐科技(深圳)有限公司 | The method and apparatus for determining the single keyword of song |
CN108595585A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Sample data sorting technique, model training method, electronic equipment and storage medium |
CN111738009A (en) * | 2019-03-19 | 2020-10-02 | 百度在线网络技术(北京)有限公司 | Method and device for generating entity word label, computer equipment and readable storage medium |
CN111738009B (en) * | 2019-03-19 | 2023-10-20 | 百度在线网络技术(北京)有限公司 | Entity word label generation method, entity word label generation device, computer equipment and readable storage medium |
CN110297901A (en) * | 2019-05-14 | 2019-10-01 | 广州数说故事信息科技有限公司 | Extensive Text Clustering Method based on distance parameter |
CN110297901B (en) * | 2019-05-14 | 2023-11-17 | 广州数说故事信息科技有限公司 | Large-scale text clustering method based on distance parameters |
CN110188189B (en) * | 2019-05-21 | 2021-10-08 | 浙江工商大学 | Knowledge-based method for extracting document abstract by adaptive event index cognitive model |
CN110188189A (en) * | 2019-05-21 | 2019-08-30 | 浙江工商大学 | A kind of method that Knowledge based engineering adaptive event index cognitive model extracts documentation summary |
CN111797945B (en) * | 2020-08-21 | 2020-12-15 | 成都数联铭品科技有限公司 | Text classification method |
CN111797945A (en) * | 2020-08-21 | 2020-10-20 | 成都数联铭品科技有限公司 | Text classification method |
CN112860900A (en) * | 2021-03-23 | 2021-05-28 | 上海壁仞智能科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN114443850A (en) * | 2022-04-06 | 2022-05-06 | 杭州费尔斯通科技有限公司 | Label generation method, system, device and medium based on semantic similar model |
CN114443850B (en) * | 2022-04-06 | 2022-07-22 | 杭州费尔斯通科技有限公司 | Label generation method, system, device and medium based on semantic similar model |
CN114676796A (en) * | 2022-05-27 | 2022-06-28 | 浙江清大科技有限公司 | Clustering acquisition and identification system based on big data |
CN114676796B (en) * | 2022-05-27 | 2022-09-06 | 浙江清大科技有限公司 | Clustering acquisition and identification system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107180075A (en) | The label automatic generation method of text classification integrated level clustering | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
Wang et al. | A hybrid document feature extraction method using latent Dirichlet allocation and word2vec | |
Papagiannopoulou et al. | Local word vectors guiding keyphrase extraction | |
Abdelhamid et al. | Associative classification approaches: review and comparison | |
Santra et al. | Genetic algorithm and confusion matrix for document clustering | |
US20060288275A1 (en) | Method for classifying sub-trees in semi-structured documents | |
CN109325231A (en) | A kind of method that multi task model generates term vector | |
CN110209808A (en) | A kind of event generation method and relevant apparatus based on text information | |
CN103927302A (en) | Text classification method and system | |
KR20190135129A (en) | Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning | |
CN105205163B (en) | A kind of multi-level two sorting technique of the incremental learning of science and technology news | |
CN106503153B (en) | A kind of computer version classification system | |
CN106339459B (en) | The method that Chinese web page is presorted is carried out based on Keywords matching | |
CN108228612A (en) | A kind of method and device for extracting network event keyword and mood tendency | |
Du et al. | Hierarchy construction and text classification based on the relaxation strategy and least information model | |
CN108733669A (en) | A kind of personalized digital media content recommendation system and method based on term vector | |
Castano et al. | Classifying and reusing conceptual schemas | |
Ding et al. | The research of text mining based on self-organizing maps | |
Marath et al. | Large-scale web page classification | |
Denzler et al. | Granular knowledge cube | |
Kim et al. | An intelligent information system for organizing online text documents | |
CN107391674B (en) | New type mining method and device | |
Golam Sohrab et al. | EDGE2VEC: Edge representations for large-scale scalable hierarchical learning | |
Avancini et al. | Organizing digital libraries by automated text categorization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170919 |
|
RJ01 | Rejection of invention patent application after publication |