CN110955773A - Discriminant text clustering method and system based on minimum normalized information distance - Google Patents

Discriminant text clustering method and system based on minimum normalized information distance Download PDF

Info

Publication number
CN110955773A
CN110955773A CN201911079897.0A CN201911079897A CN110955773A CN 110955773 A CN110955773 A CN 110955773A CN 201911079897 A CN201911079897 A CN 201911079897A CN 110955773 A CN110955773 A CN 110955773A
Authority
CN
China
Prior art keywords
text
data set
parameter set
clustering
discriminant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911079897.0A
Other languages
Chinese (zh)
Other versions
CN110955773B (en
Inventor
秦家虎
朱英达
付维明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201911079897.0A priority Critical patent/CN110955773B/en
Publication of CN110955773A publication Critical patent/CN110955773A/en
Application granted granted Critical
Publication of CN110955773B publication Critical patent/CN110955773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a discriminant text clustering method and system based on minimum normalized information distance, wherein the method comprises the following steps: vectorizing a text data set, wherein the text data set comprises a plurality of texts, and each text comprises a plurality of keywords; initializing a model parameter set for the vectorized text data set; calculating and updating the parameter set by a gradient descent method through the minimum normalized information distance; setting a termination condition and outputting a final parameter set; and (4) designing a discriminant text clustering algorithm by using the final parameter set to realize text clustering. The discriminant text clustering method and system based on the minimum normalized information distance provided by the invention provide a method using normalized information measure as an objective function aiming at the problem of model selection of the existing discriminant clustering algorithm, so that the algorithm has automatic model selection capability, thereby improving the capability of obtaining better clustering results under the condition that the artificially selected initial model order is unreasonable.

Description

Discriminant text clustering method and system based on minimum normalized information distance
Technical Field
The invention relates to the field of natural language processing and text mining, in particular to a discriminant text clustering method and system based on minimum normalized information distance.
Background
The existing text clustering mostly adopts a K-means algorithm, and the discriminant clustering algorithm mostly adopts a method of maximizing mutual information (or a variant thereof), which easily causes that a model order (the number of clusters, such as K of the K-means) is always equal to an initial value in the clustering process, so that the algorithm does not have the capability of automatic model selection, and the model order of a final clustering result is determined manually to a great extent. However, in text clustering, people hardly give the most reasonable model order, and the model orders with smaller sizes easily cause poor clustering results.
The existing discriminant clustering algorithm based on the maximum mutual information and the k-means algorithm commonly used in the field of text clustering have the following disadvantages:
1. if the initial model order selection is not reasonable, an undesirable clustering result is caused, for example, the selection is too large, an overfitting model is easily generated, namely, the similarity is high, an example which originally belongs to a cluster is further subdivided into models of different clusters, an extreme example is that a training example is divided into a cluster, and the clustering result has no meaning; however, a model that is too small to easily generate under-fitting, i.e., "a model in which examples having low similarity are not sufficiently separated" is selected.
2. For the situation that the number of potential examples in each cluster is very different, the discriminant clustering algorithm based on the maximum mutual information is easy to generate a poor clustering model, namely, a model for dividing data with high similarity into different clusters.
Disclosure of Invention
The present invention provides a discriminant text clustering method and system based on minimum normalized information distance to at least partially solve the above problems.
In view of the above, an aspect of the present invention provides a discriminant text clustering method based on a minimum normalized information distance, including:
vectorizing a text data set, wherein the text data set comprises a plurality of texts, and each text comprises a plurality of keywords;
initializing a model parameter set for the vectorized text data set;
calculating and updating the parameter set by a gradient descent method through the minimum normalized information distance;
setting a termination condition and outputting a final parameter set;
and (4) designing a discriminant text clustering algorithm by using the final parameter set to realize text clustering.
Wherein:
in some embodiments, vectorizing the text data set includes:
performing programmed processing on the text data set, further performing the programmed processing by using a word frequency-inverse document frequency algorithm to obtain the relationship between each keyword in the text data set and a programmed processing value corresponding to each keyword, and marking the relationship as < key, value >;
sorting each keyword according to the dictionary sequence and establishing an index;
arranging the programmed processing value of each text in the text data set into a vector according to the index sequence corresponding to the keyword, taking the vector as the feature vector of the text, integrating the feature vectors of all texts, and recording as:
xi=[value1,value2,…,valueM],
wherein i represents a text serial number, and M is the total number of keywords under corresponding indexes in the text data set;
vectorized text data set { x1,...,xi,...,xNPerforming dimension reduction processing, wherein N is the number of texts in the text data set, and xiA feature vector representing the ith text.
In some embodiments, initializing the set of model parameters includes:
executing a K-means algorithm with the clustering number of K on the vectorized text data set to obtain K clusters { C1,C2,...,CKWill belong to CkMarking the category of the data to be K, wherein K is more than or equal to 1 and less than or equal to K, and obtaining a labeled data set;
aiming at the labeled data set, executing a multi-classification logistic regression method to obtain an initialization model parameter set
Figure BDA0002262267900000022
And, the initialization model parameter set corresponds to a condition model as follows:
Figure BDA0002262267900000021
Figure BDA0002262267900000031
wherein the content of the first and second substances,
Figure BDA0002262267900000032
x*T=[xT,1]∈RD+1
Figure BDA0002262267900000033
representing parameters
Figure BDA0002262267900000034
Transpose of (x)*TRepresenting a vector x*Transpose of RD+1Where D represents the data dimension, R is a set of real numbers, RD+1Representing a real space of dimension (D +1), i.e. representing w*The spatial dimension of (a).
In some embodiments, the calculating and updating the parameter set by the gradient descent method through the minimum normalized information distance includes:
based on parameters in the initial parameter set
Figure BDA0002262267900000035
The empirical distribution of cluster labels in the labeled dataset is calculated by the conditional model:
Figure BDA0002262267900000036
wherein K is more than or equal to 1 and less than or equal to K;
initializing the value of F, recording F2 ═ F;
calculating the value of the objective function F and the parameters of the objective function F
Figure BDA0002262267900000037
Updating the set of parameters, further comprising:
based on initial parameters
Figure BDA0002262267900000038
Calculate the value of the objective function F:
Figure BDA0002262267900000039
calculating parameters of an objective function F
Figure BDA00022622679000000310
Wherein K ∈ { 1.., K }:
Figure BDA00022622679000000311
wherein the content of the first and second substances,
Figure BDA00022622679000000312
to represent
Figure BDA00022622679000000313
The number d element of (a) is,
Figure BDA00022622679000000314
to represent
Figure BDA00022622679000000315
The d element of (2), pk,pkiRespectively represent p (k) and p (k | x)i);
Updating parameters
Figure BDA00022622679000000316
Figure BDA00022622679000000317
Wherein η is a learning step size, η > 0, and is set artificially.
In some embodiments, setting the termination condition to output the final set of parameters comprises:
setting a parameter E, wherein E is more than 0;
if | F-F2| < E, the parameter set is output
Figure BDA00022622679000000318
If | F-F2| ≧ E, F2 ≧ F is recorded, and the above process of updating parameter sets is re-executed until | F-F2| < E, the parameter set is output
Figure BDA00022622679000000319
Based on the above method, another aspect of the present invention further provides a discriminant text clustering system based on a minimum normalized information distance, including:
a text input unit which inputs a text data set to be clustered;
the text processing unit is internally provided with the discriminant text clustering method based on the minimum normalized information distance to realize text clustering of the text data set;
and the text output unit outputs the text clustering result.
The discriminant text clustering method and system based on the minimum normalized information distance provided by the invention have the following beneficial effects:
(1) by initializing the model parameter set, the problem that a poor clustering result is easily generated when the number of examples in a potential cluster is unbalanced in the conventional clustering algorithm is solved;
(2) by the minimum normalized information distance, the problem of overfitting generated when the set model order is too large in the conventional clustering algorithm is solved.
Drawings
FIG. 1 is a flowchart of a discriminant text clustering method for minimum normalized information distance according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
The invention provides a method for using normalized information measure as a target function aiming at the problem of model selection of the existing discriminant clustering algorithm, so that the algorithm has automatic model selection capability, thereby improving the capability of obtaining better clustering results under the condition that the artificially selected initial model order is unreasonable.
Data clustering divides a data object set into a plurality of different classes or clusters, the similarity between data objects in each cluster is higher than the similarity between data objects in other clusters, and the method is widely applied to the fields of text processing, customer group grouping, image segmentation and the like. Text clustering is an application of a clustering algorithm in a text direction, and a common method is to firstly perform word segmentation on a text, divide a continuous text into a string of word sequences, convert word segmentation into word vectors to form high-dimensional space points, then reduce the high-dimensional data space to a relatively low-dimensional data space by using a dimension reduction algorithm, wherein each data point corresponds to a text, and finally cluster the vectorized texts by using the clustering algorithm.
Data clustering techniques can be broadly divided into two categories: generative clustering and discriminant clustering. Generative clustering implements clustering by reconstructing a generation pattern of data; discriminant clustering is achieved by finding the boundaries between classes. Common text clustering algorithms, such as the K-means algorithm, belong to generative clustering, and have a model selection problem (i.e., a selection problem of the number of clusters K), a problem of difficulty in finding clusters of non-gaussian shapes, and a problem of difficulty in handling cases of imbalance in the number of examples within a potential cluster. In the discriminant clustering field, information theory measure is often used as an objective function, and a commonly used information measure is mutual information shared by data and a clustering label, and is specifically defined as follows:
is provided with a data set { x1,x2,...,xNIn which xi∈RD1, 2.. times, N, and a set of cluster labels {1, 2.. times, K }. Random variables X are set and taken from the data set according to uniform distribution, random variables Y are set and taken from the clustering label set according to uniform distribution, and then mutual information shared by the data and the clustering labels is as follows:
I(X,Y;W)=H(Y;W)-H(Y|X;W)
wherein
Figure BDA0002262267900000051
Is information entropy of clustering labels
Figure BDA0002262267900000052
Is the conditional information entropy of the cluster label under the condition of data,
Figure BDA0002262267900000053
represents the empirical distribution of the cluster labels and W represents the model parameters.
The existing maximum mutual information discrimination clustering algorithm is used for determining p (k | x)i(ii) a W), the mutual information can be expressed as a function of the parameter W, and then the gradient ascent method is used to maximize the mutual information to obtain an optimized model consisting of p (k | x)i(ii) a Form of W) and parameter W are completely described]. After obtaining the model, for any test case xiCan substitute for p (k | x)i(ii) a W) calculates the probability that the example belongs to class k, and after calculating the probability that the example belongs to each class, classifies the example into the class with the highest corresponding probability. This is the process of the existing discriminant clustering algorithm based on the maximum mutual information.
The invention can solve two problems of the discriminant clustering algorithm based on the maximum mutual information and the k-means algorithm. For the problem 1, the method for normalizing the information theoretical measure is utilized, and a deviation mechanism for a simple model is provided for the algorithm, namely the algorithm is more prone to training to obtain the simple model. This biasing mechanism may avoid the occurrence of overfitting to some extent. Thus, when setting the model order, if we set the model order larger, both under-fitting and over-fitting can be prevented. For the problem 2, the invention adopts an initialization method which firstly utilizes a k-means algorithm to carry out initial clustering and then utilizes a multi-classification logistic regression method to obtain initial parameters based on clustering labels obtained by the initial clustering, thereby effectively solving the problem.
It is therefore an object of the present invention to provide a method and system for clustering discriminative text based on minimum normalized information distance with automatic model order selection capability, the ability to discover non-gaussian shaped clusters and the ability to generate good clustering models in the event of imbalances in the number of instances within a potential cluster.
In view of this, an embodiment of the present invention provides a discriminant text clustering method based on a minimum normalized information distance, including the following steps:
vectorizing a text data set, the text data set including a plurality of texts, each text including a plurality of keywords;
initializing a model parameter set for the vectorized text data set;
calculating and updating the parameter set by a gradient descent method through the minimum normalized information distance;
setting a termination condition and outputting a final parameter set;
and (4) designing a discriminant text clustering algorithm by using the final parameter set to realize text clustering.
Wherein the model parameter initialization further comprises: executing a k-means algorithm on a training data set to obtain an initial clustering label, and executing a multi-classification logistic regression method to obtain initial parameters based on the initial clustering label.
Fig. 1 is a flowchart of an algorithm of the minimum normalized information distance-based discriminant text clustering method according to this embodiment, and the following is described in detail with reference to the flowchart:
step 1, vectorizing the text data set, and converting each text into a vector with the length of D.
Step 1.1, performing programmed processing on a text data set by using a word frequency-inverse document frequency algorithm (TF-IDF method), and obtaining the corresponding relation between each keyword in the text data set and the TF-IDF value thereof, and marking as (key, value >;
step 1.2, ordering the obtained keywords according to the dictionary sequence, and establishing an index according to the ordering;
step 1.3, for each text in the text data set, arranging TF-IDF values into vectors according to the index sequence corresponding to the keywords of the text, taking the vectors as the feature vectors of the text, and integrating the feature vectors of the texts and marking the vectors as xi=[value1,value2,...,valueM]I represents a text serial number, and M is the total number of keywords under corresponding indexes in the data set of the text;
step 1.4, utilizing PCA technology to carry out vectorization on the text data set { x1,...,xNPerforming dimension reduction, wherein N is the number of texts in the text data set, and xiAnd representing the feature vector of the ith text, and assuming that the dimension of the vector after dimensionality reduction is D < M.
Step 2, initializing a model parameter set aiming at the text data set after vectorization processing
Figure BDA00022622679000000711
Wherein K is the maximum model order, the values of w and b can be initialized randomly, and then optimized and updated by the formula in step 3, and the corresponding condition model is:
Figure BDA0002262267900000071
wherein the content of the first and second substances,
Figure BDA0002262267900000072
x*T=[xT,1]∈RD+1
Figure BDA0002262267900000073
representing parameters
Figure BDA0002262267900000074
Transpose of (x)*TRepresenting a vector x*Transpose of RD+1Where D represents the data dimension, R is a set of real numbers, RD+1Representing a real space of dimension (D +1), i.e. w*E.g. D-3, then w*Is 4, i.e. a 4-dimensional vector is to be initialized.
Step 2.1, executing a K-means algorithm with the clustering number of K on the vectorized text data set to obtain K clusters { C }1,C2,...,CKWill belong to CkMarking the category of the data to be K, wherein K is more than or equal to 1 and less than or equal to K, so as to obtain a labeled data set;
step 2.2, aiming at the labeled data set, executing a multi-classification logistic regression (MLR) method to obtain an initial parameter set
Figure BDA00022622679000000712
Step 3, calculating and updating parameters in the parameter set by a gradient descent method through the minimum normalized information distance
Figure BDA0002262267900000075
Step 3.1, based on the current parameters
Figure BDA0002262267900000076
Passing through type
Figure BDA0002262267900000077
Figure BDA0002262267900000078
Calculating the empirical distribution p (K) of the clustering labels in the labeled data set, wherein K is more than or equal to 1 and less than or equal to K;
step 3.2, initializing the value of F, and recording F2 as F;
step 3.3, calculating the value of the objective function F and the parameters of the objective function F
Figure BDA0002262267900000079
The parameter set is updated.
Step 3.3.1, based on the current parameters
Figure BDA00022622679000000710
The value of the objective function F is calculated by:
Figure BDA0002262267900000081
step 3.3.2, calculate the parameters of the objective function F using the following formula
Figure BDA0002262267900000082
Gradient (2):
Figure BDA0002262267900000083
wherein the content of the first and second substances,
Figure BDA0002262267900000084
to represent
Figure BDA0002262267900000085
The number d element of (a) is,
Figure BDA0002262267900000086
to represent
Figure BDA0002262267900000087
The d element of (2), pk,pkiRespectively represent p (k) and p (k | x)i);
Step 3.3.3 updating parameters by the following equation
Figure BDA0002262267900000088
Figure BDA0002262267900000089
Wherein η (> 0) is the learning step length and is set by human;
and 4, setting a termination condition and outputting a final parameter set.
Step 4.1, if | F-F2| < E, wherein E is a parameter set by human, if 0.001 is taken, stopping the algorithm and outputting the parameter set
Figure BDA00022622679000000810
Class to which ith text belongsRespectively let p (k | x)i) A maximum k designation;
if step 4.2 and step 4.1 are not satisfied, the record F2 is F, and the process returns to step 3.3, and the parameter set is output until | F-F2| < E
Figure BDA00022622679000000811
And 5, designing a discriminant text clustering algorithm by using the final parameter set to realize text clustering.
Based on the above embodiment, another aspect of the present invention provides a discriminant text clustering system based on a minimum normalized information distance, including:
a text input unit which inputs a text data set to be clustered;
the text processing unit is internally provided with the discriminant text clustering method based on the minimum normalized information distance to realize text clustering of the text data set;
and the text output unit outputs the text clustering result.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A discriminant text clustering method based on minimum normalized information distance is characterized by comprising the following steps:
vectorizing a text data set, wherein the text data set comprises a plurality of texts, and each text comprises a plurality of keywords;
initializing a model parameter set for the vectorized text data set;
calculating and updating the parameter set by a gradient descent method through a minimum normalized information distance;
setting a termination condition to output the final parameter set;
and designing a discriminant text clustering algorithm by using the final parameter set to realize text clustering.
2. The minimum normalized information distance-based discriminative text clustering method according to claim 1, wherein the vectorizing a text data set comprises:
performing programmed processing on the text data set to obtain the relationship between each keyword in the text data set and a programmed processing value corresponding to each keyword, and marking as < key, value >;
ordering each keyword according to the dictionary sequence and establishing an index;
arranging the programmed processing value of each text in the text data set into a vector according to the index sequence corresponding to the keyword, taking the vector as the feature vector of the text, and integrating the feature vectors of the texts and recording the vectors as:
xi=[value1,value2,…,valueM],
wherein i represents a text sequence number, and M is the total number of the keywords under the corresponding index in the text data set;
vectorizing the vectorized text data set { x1,...,xi,...,xNPerforming dimension reduction processing, wherein N is the number of texts in the text data set, and xiThe feature vector representing the ith text.
3. The method of claim 2, wherein the procedural process is a word frequency-inverse document frequency algorithm process.
4. The method according to claim 1, wherein initializing the set of model parameters comprises:
performing on the text data set of the vectorization processObtaining K clusters { C by using a K-means algorithm with K cluster numbers1,C2,...,CKWill belong to CkMarking the category of the data to be K, wherein K is more than or equal to 1 and less than or equal to K, and obtaining a labeled data set;
aiming at the labeled data set, executing a multi-classification logistic regression method to obtain an initialization model parameter set
Figure FDA0002262267890000021
5. The minimum normalized information distance-based discriminant text clustering method according to claim 1 or 4, wherein the initialized model parameter set corresponds to a condition model:
Figure FDA0002262267890000022
Figure FDA0002262267890000023
wherein the content of the first and second substances,
Figure FDA0002262267890000024
x*T=[xT,1]∈RD+1
Figure FDA0002262267890000025
representing parameters
Figure FDA0002262267890000026
Transpose of (x)*TRepresenting a vector x*Transpose of RD+1Where D represents the data dimension, R is a set of real numbers, RD+1Representing a real space of dimension (D +1), i.e. representing w*The spatial dimension of (a).
6. The minimum normalized information distance based discriminative text clustering method according to claim 5 wherein the calculating and updating the parameter set by the minimum normalized information distance in a gradient descent method comprises:
based on parameters in the initial parameter set
Figure FDA0002262267890000027
Calculating an empirical distribution of cluster labels in the labeled dataset by the conditional model:
Figure FDA0002262267890000028
wherein K is more than or equal to 1 and less than or equal to K;
initializing the value of F, recording F2 ═ F;
calculating the value of the objective function F and the parameters of the objective function F
Figure FDA0002262267890000029
Updating the parameter set.
7. The minimum normalized information distance-based discriminative text clustering method according to claim 6, wherein updating the parameter set comprises:
based on initial parameters
Figure FDA00022622678900000210
Calculate the value of the objective function F:
Figure FDA00022622678900000211
calculating parameters of an objective function F
Figure FDA00022622678900000212
Wherein K ∈ { 1.., K }:
Figure FDA00022622678900000213
wherein the content of the first and second substances,
Figure FDA0002262267890000031
to represent
Figure FDA0002262267890000032
The number d element of (a) is,
Figure FDA0002262267890000033
to represent
Figure FDA0002262267890000034
The d element of (2), pk,pkiRespectively represent p (k) and p (k | x)i);
Updating parameters
Figure FDA0002262267890000035
Figure FDA0002262267890000036
Wherein η is a learning step size, η > 0, and is set artificially.
8. The method according to claim 7, wherein the setting of the termination condition to output the final parameter set comprises:
setting a parameter E, wherein E is more than 0;
if | F-F2| < E, the parameter set is output
Figure FDA0002262267890000037
9. The method according to claim 8, wherein the setting an end condition to output the final parameter set further comprises:
if | F-F2| ≧ E, F2 ≧ F is recorded, and the process of updating the parameter set is re-executed until | F-F2| <E, the parameter set is output
Figure FDA0002262267890000038
10. A discriminant text clustering system based on minimum normalized information distance, comprising:
a text input unit which inputs a text data set to be clustered;
a text processing unit, which is internally provided with the discriminant text clustering method based on the minimum normalized information distance according to any one of claims 1 to 8, and realizes text clustering of the text data set;
and the text output unit is used for outputting the text clustering result.
CN201911079897.0A 2019-11-06 2019-11-06 Discriminant text clustering method and system based on minimum normalized information distance Active CN110955773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911079897.0A CN110955773B (en) 2019-11-06 2019-11-06 Discriminant text clustering method and system based on minimum normalized information distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911079897.0A CN110955773B (en) 2019-11-06 2019-11-06 Discriminant text clustering method and system based on minimum normalized information distance

Publications (2)

Publication Number Publication Date
CN110955773A true CN110955773A (en) 2020-04-03
CN110955773B CN110955773B (en) 2023-03-31

Family

ID=69976143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911079897.0A Active CN110955773B (en) 2019-11-06 2019-11-06 Discriminant text clustering method and system based on minimum normalized information distance

Country Status (1)

Country Link
CN (1) CN110955773B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234955A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Clustering based text classification
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN110309302A (en) * 2019-05-17 2019-10-08 江苏大学 A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234955A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Clustering based text classification
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN110309302A (en) * 2019-05-17 2019-10-08 江苏大学 A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李钊;李晓;王春梅;李诚;杨春;: "一种基于MapReduce的文本聚类方法研究" *

Also Published As

Publication number Publication date
CN110955773B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
Mensink et al. Distance-based image classification: Generalizing to new classes at near-zero cost
Bouguila Count data modeling and classification using finite mixtures of distributions
Huang et al. Object-location-aware hashing for multi-label image retrieval via automatic mask learning
CN112528928B (en) Commodity identification method based on self-attention depth network
US11210555B2 (en) High-dimensional image feature matching method and device
Champ et al. A comparative study of fine-grained classification methods in the context of the LifeCLEF plant identification challenge 2015
CN105631416A (en) Method for carrying out face recognition by using novel density clustering
CN104112018A (en) Large-scale image retrieval method
Chu et al. Stacked Similarity-Aware Autoencoders.
Bouguila A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity
Gu et al. Image-based hot pepper disease and pest diagnosis using transfer learning and fine-tuning
Ihou et al. A new latent generalized dirichlet allocation model for image classification
CN105389588B (en) Based on multi-semantic meaning code book image feature representation method
Alalyan et al. Model-based hierarchical clustering for categorical data
CN112214570A (en) Cross-modal retrieval method and device based on counterprojection learning hash
Nakayama Linear distance metric learning for large-scale generic image recognition
Liu et al. Flower classification using fusion descriptor and SVM
Bassiou et al. Greek folk music classification into two genres using lyrics and audio via canonical correlation analysis
Foumani et al. A probabilistic topic model using deep visual word representation for simultaneous image classification and annotation
Gilbert et al. Image and video mining through online learning
CN110955773B (en) Discriminant text clustering method and system based on minimum normalized information distance
Hajigholam et al. Using sparse representation classifier (SRC) to calculate dynamic coefficients for multitask joint spatial pyramid matching
Passalis et al. Spatial bag of features learning for large scale face image retrieval
Liang et al. Random forest with suppressed leaves for Hough voting
Alalyan et al. An improved k-medoids algorithm based on binary sequences similarity measures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant