CN111898704A - Method and device for clustering content samples - Google Patents

Method and device for clustering content samples Download PDF

Info

Publication number
CN111898704A
CN111898704A CN202010824726.2A CN202010824726A CN111898704A CN 111898704 A CN111898704 A CN 111898704A CN 202010824726 A CN202010824726 A CN 202010824726A CN 111898704 A CN111898704 A CN 111898704A
Authority
CN
China
Prior art keywords
content
sample
clustering
samples
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010824726.2A
Other languages
Chinese (zh)
Other versions
CN111898704B (en
Inventor
卢东焕
赵俊杰
马锴
郑冶枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010824726.2A priority Critical patent/CN111898704B/en
Publication of CN111898704A publication Critical patent/CN111898704A/en
Application granted granted Critical
Publication of CN111898704B publication Critical patent/CN111898704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application describes a method of clustering content samples, comprising: obtaining a data set comprising a plurality of content samples without tags; clustering a plurality of content samples by using a plurality of clustering methods which are different from each other to determine a category corresponding to the highest confidence in confidence distribution of categories obtained by clustering each content sample under each clustering method; for each content sample, in response to determining: the categories corresponding to all the highest confidence degrees into which each content sample is clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample; training a content sample classifier by using the labeled content sample and the unlabeled content sample to obtain a trained content sample classifier; the content samples to be clustered are clustered using a trained content sample classifier to determine their class.

Description

Method and device for clustering content samples
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for clustering content samples.
Background
Currently, when clustering content samples such as image data samples, voice data samples, text data samples, etc., a two-stage clustering method is generally employed. In a first stage, features are extracted from the content samples using an encoder, and then in a second stage the extracted features are clustered using a basic clustering algorithm, such as a K-means algorithm, to obtain a class for each sample. However, such clustering methods are generally limited by the feature extraction capability of the encoder and are not effective, and the categories of the content samples cannot be obtained end-to-end (i.e., the categories of the content samples are obtained directly from the content samples without using the encoder to extract features from the content samples). In addition, the used basic clustering algorithm itself can affect the accuracy of clustering.
With the development of artificial intelligence, the classifier-based clustering method can achieve the purpose of obtaining the category of a content sample from end to end, but training a classifier with good clustering accuracy is difficult, because the training sample with the known accurate category is seriously insufficient, and the process of obtaining the training sample is also influenced by the feature extraction capability of an encoder and the basic clustering algorithm. Some researches try to comprehensively consider the clustering results of a plurality of basic clustering algorithms, but a simple weighted voting mode is used for fusion, so that the efficiency is very poor.
Disclosure of Invention
In view of the above, the present disclosure provides methods and apparatus for determining a training set, methods and apparatus for training a classifier of content samples, and methods and apparatus for clustering content samples, which desirably overcome some or all of the above-referenced deficiencies and possibly others.
According to a first aspect of the present disclosure, there is provided a method of clustering content samples, comprising: obtaining a data set comprising a plurality of content samples without tags; clustering the plurality of content samples by using each clustering method in a plurality of clustering methods different from each other so as to determine a category corresponding to the highest confidence in confidence distributions of categories into which each content sample is clustered under each clustering method; for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; training the content sample classifier by using the labeled content samples and the unlabeled content samples in the data set to obtain a trained content sample classifier; and clustering the content samples to be clustered by utilizing the trained content sample classifier so as to determine the category of the content samples to be clustered.
In some embodiments, the method further comprises: in response to a highest confidence corresponding category to which the each content sample is clustered having a highest number of identical content samples among all categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods and a highest confidence corresponding category to which the each content sample is clustered when a second clustering method of the every two clustering methods is used, determining that all highest confidence corresponding categories to which the each content sample is clustered under the plurality of clustering methods, respectively, are identical.
In some embodiments, tagging each of the content samples comprises: determining an average value of the highest confidence degrees of the categories into which each content sample is respectively clustered under the plurality of clustering methods, marking a first confidence degree of the category corresponding to the highest confidence degree in the label as the average value, and marking a second confidence degree of each category in the labels of other categories except the category corresponding to the highest confidence degree, so that the sum of the first confidence degree and all the second confidence degrees is 1.
In some embodiments, tagging each of the content samples comprises: the first confidence of the class corresponding to the highest confidence in the label is marked as 1, and the second confidence of each class in the labels of other classes except the class corresponding to the highest confidence is respectively marked as 0.
In some embodiments, at least one of the plurality of clustering methods comprises a classifier-based clustering method, and wherein prior to the training of the content sample classifier with labeled and unlabeled content samples in the dataset, the method further comprises: training a classifier based on the clustering method by using the labeled content samples and the unlabeled content samples; clustering the plurality of content samples by using a clustering method based on a trained classifier to determine the confidence degree distribution of an updated category obtained by clustering each content sample under the clustering method based on the trained classifier; reformulating the labeled content samples based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under the other clustering methods of the plurality of clustering methods, respectively.
In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method, and wherein clustering the plurality of content samples using the K-means clustering method further comprises: extracting feature data of a plurality of dimensions for each of the plurality of content samples; reducing the dimensions of the feature data of the plurality of dimensions; clustering the plurality of content sample pairs based on the reduced-dimension feature data of the plurality of content samples.
In some embodiments, training the content sample classifier using the labeled and unlabeled content samples in the dataset comprises: training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized, wherein the total loss function is a sum of the loss functions for each content sample, and wherein the loss function for each labeled content sample serves to constrain a confidence distribution of a class of each labeled content sample output by the content sample classifier to a proximity of a label of each labeled content sample; wherein the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector, wherein the confidence distributions of the classes include a confidence for each of the plurality of classes.
In some embodiments, training the content sample classifier by adjusting parameters of the content sample classifier such that an overall loss function is minimized comprises: training the content sample classifier in a back-propagation manner, wherein a learning rate for each parameter of the content sample classifier is dynamically adjusted in the back-propagation by calculating a first moment estimate and a second moment estimate of a gradient of a loss function.
In some embodiments, training the content sample classifier using the labeled and unlabeled content samples in the dataset comprises: the classifier is trained in separate runs each time the same number of labeled and unlabeled content samples are selected.
In some embodiments, clustering content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered includes: inputting the content samples to be clustered into a trained content sample classifier, so that the content sample classifier outputs confidence distributions of the content samples to be clustered, which correspond to the plurality of classes; determining the category with the highest confidence in the confidence distribution as the category of the content sample to be clustered.
In some embodiments, each of the plurality of content samples comprises an image data sample, and the structure of the content sample classifier comprises a convolutional neural network.
In some embodiments, each of the plurality of content samples comprises one of a text data sample and a speech data sample, and the structure of the content sample classifier comprises one of a recurrent neural network, a long-short term memory network, and a bidirectional encoder token of a self-transformer.
According to a second aspect of the present disclosure, there is provided an apparatus for clustering content samples, comprising: an acquisition module configured to acquire a dataset comprising a plurality of content samples without tags; a clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method; a tagging module configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; a training module configured to train the content sample classifier using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier; a determination module configured to cluster the content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered.
According to a third aspect of the present disclosure, there is provided a method of determining a training set, comprising: obtaining a data set comprising a plurality of content samples without tags; clustering the plurality of content samples by using each clustering method in a plurality of clustering methods different from each other so as to determine a category corresponding to the highest confidence in confidence distributions of categories into which each content sample is clustered under each clustering method; for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; determining labeled content samples and unlabeled content samples in the dataset together as a training set for the content sample classifier.
According to a fourth aspect of the present disclosure, there is provided an apparatus for determining a training set, comprising: a data set acquisition device configured to acquire a data set including a plurality of content samples without tags; a sample clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method; a sample tagging module configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; a training set determination module configured to determine labeled content samples and unlabeled content samples in the data set together as a training set for the content sample classifier.
According to a fifth aspect of the present disclosure, there is provided a method of training a content sample classifier, comprising: obtaining a training set for a content sample classifier, wherein the training set is determined according to the method of the third aspect of the present disclosure and comprises labeled content samples and unlabeled content samples; training the content sample classifier using the labeled content samples and unlabeled content samples in a training set.
According to a sixth aspect of the present disclosure, there is provided an apparatus for training a content sample classifier, comprising: a training set obtaining module configured to obtain a training set for a content sample classifier, wherein the training set is determined by the apparatus for determining a training set according to the fourth aspect of the present disclosure and includes labeled content samples and unlabeled content samples; a classifier training module configured to train the content sample classifier using the labeled and unlabeled content samples in a training set.
According to a seventh aspect of the present disclosure, there is provided a computing device comprising a processor; and a memory configured to have computer-executable instructions stored thereon that, when executed by the processor, perform any of the methods described above.
According to an eighth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.
In the method and the device for determining the training set, the method and the device for training the content sample classifier, and the method and the device for clustering the content samples, which are claimed by the present disclosure, the 'clean' samples and 'noise' samples in the data set can be accurately determined by fully utilizing the clustering results of the content samples in the data set by a plurality of clustering methods. The label is marked on the 'clean' sample, and the label and the 'noise' sample in the 'clean' sample are trained on the content sample classifier, so that the clustering accuracy and the generalization of the trained content sample classifier can be greatly improved, and meanwhile, the trained content sample classifier can obtain the category of the content sample from end to end.
These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:
fig. 1 illustrates a schematic flow diagram of a method of determining a training set according to one embodiment of the present disclosure;
FIG. 2 illustrates a schematic flow diagram of a method of training a content sample classifier according to one embodiment of the present disclosure;
FIG. 3 illustrates a schematic flow chart diagram of a method of clustering content samples according to one embodiment of the present disclosure;
FIG. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample according to one embodiment of the present disclosure;
FIG. 5 illustrates an exemplary schematic diagram of clustering content samples according to one embodiment of the present disclosure;
FIG. 6 illustrates an exemplary block diagram of an apparatus for determining a training set according to one embodiment of the present disclosure;
FIG. 7 illustrates an exemplary block diagram of an apparatus for training a content sample classifier according to one embodiment of the present disclosure;
FIG. 8 illustrates an exemplary block diagram of an apparatus for clustering content samples according to one embodiment of the present disclosure;
fig. 9 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.
Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art can fully understand and practice the various embodiments of the disclosure. It is understood that aspects of the disclosure may be practiced without some of these details. In some instances, well-known structures or functions are not shown or described in detail in this disclosure to avoid obscuring the description of the embodiments of the disclosure by these unnecessary descriptions. The terminology used in the present disclosure should be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms referred to in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.
Clustering: the process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. As an important method for unsupervised learning, the idea of clustering is to classify samples or objects with similar attributes into one class. The class generated by the clustering is a collection of objects that are similar to objects in the same class and distinct from objects in other classes. Common clustering methods are K-means clustering, mean shift clustering, density-based clustering methods, and the like.
A classifier: the conventional task of a classifier is to learn classification rules using known training data for a given class and then classify or predict unknown data. The classifier is a general term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.
Semi-supervised learning: the method is a training mode/learning mode in machine learning (machine learning), is between supervised learning and unsupervised learning, and is a learning method combining the supervised learning and the unsupervised learning. For semi-supervised learning, one part of the data used for training is labeled and the other part is unlabeled, and the amount of unlabeled data is often much larger than that of labeled data (which is also realistic). The basic rule hidden under semi-supervised learning lies in that: the distribution of the data is not necessarily completely random, and acceptable or even very good classification results can be obtained through some local features of the labeled data and the overall distribution of more unlabeled data.
And (3) back propagation: the gradient descent algorithm is a gradient descent algorithm with a recursive structure, and is widely used as a basic learning training method of a deep neural network.
Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Fig. 1 illustrates a schematic flow diagram of a method 100 of determining a training set according to one embodiment of the present disclosure. The training set may be used to train a content sample classifier for classifying content samples to derive categories of content samples. As shown in fig. 1, the method 100 includes the following steps 101-104.
In step 101, a data set comprising a plurality of content samples without tags is obtained. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.
In step 102, the plurality of content samples are clustered by using each clustering method of a plurality of clustering methods different from each other, so as to determine a category corresponding to the highest confidence in the confidence distribution of the categories obtained by clustering each content sample of the plurality of content samples under each clustering method. The plurality of Clustering methods are different from each other, and the Clustering method may be any suitable Clustering method, such as K-means Clustering, DEC (Deep Embedded Clustering), IDEC (Improved Deep Embedded Clustering), DCEC (Deep Convolutional Embedded Clustering), classifier-based Clustering method using artificial intelligence, and the like, although this is not limitative.
As an example, given
Figure DEST_PATH_IMAGE002
An unlabeled content sample
Figure DEST_PATH_IMAGE004
Clustering these content samples using a plurality of clustering methods different from each other can obtain a confidence distribution of the category
Figure DEST_PATH_IMAGE006
Wherein M is the number of clustering methods,
Figure DEST_PATH_IMAGE008
the number of classes clustered for each clustering method,
Figure DEST_PATH_IMAGE010
and indicates the use of
Figure DEST_PATH_IMAGE012
In the case of individual clustering methods
Figure DEST_PATH_IMAGE014
A sample belongs to
Figure DEST_PATH_IMAGE016
Confidence of individual classes. It should be noted that the number of categories formed by clustering using each clustering method is generally set in advance,and is typically set the same for all clustering methods.
In some embodiments, for a classifier-based clustering method, the classifier may directly output the confidence of each content sample belonging to each of the plurality of classes, i.e., the confidence distribution of the classes. The highest confidence may then be determined from the confidence distributions for the categories for each of the content samples, and the category to which the highest confidence corresponds may be derived therefrom. For example, when using the 1 st clustering method, the classifier outputs the confidence distribution for the class for the 1 st sample
Figure DEST_PATH_IMAGE018
Is composed of
Figure DEST_PATH_IMAGE020
Then, the highest confidence of the samples may be determined to be 0.90, and the class corresponding to the highest confidence of the 1 st sample may be determined to be the 2 nd class.
In some embodiments, for a K-means clustering method such a clustering method that directly embodies the cluster class (i.e., the class corresponding to the highest confidence is directly obtained) rather than the confidence for each class, the confidence of a sample for each class (i.e., the confidence distribution of the class of the sample) may be calculated using the student t-distribution. The t distribution of the student can be expressed as
Figure DEST_PATH_IMAGE022
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE024
the degree of freedom for the distribution of students t, usually set to 1,
Figure DEST_PATH_IMAGE026
to be aligned with
Figure 28256DEST_PATH_IMAGE014
The characteristics of the individual content samples are such that,
Figure DEST_PATH_IMAGE028
obtained for clustering
Figure 945396DEST_PATH_IMAGE016
The sample center point in a category (which is typically the mean of the spatial coordinates corresponding to the features of the content samples in the category).
In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method. In this case, when the plurality of content samples are clustered using the K-means clustering method, feature data of a plurality of dimensions may be extracted for each of the plurality of content samples; then, reducing the dimension of the characteristic data of the plurality of dimensions; and finally, clustering the content sample pairs based on the feature data of the content samples after dimensionality reduction. By extracting the characteristic data of multiple dimensions of the content sample, the accuracy of clustering can be improved. The dimension reduction of the feature data of the plurality of dimensions can reduce the calculation amount of the clustering process. As an example, the feature data of the plurality of dimensions may be subjected to a dimensionality reduction process using a Principal Component Analysis (PCA) method. The principal component analysis method is to convert a group of variables with possible correlation into a group of linearly uncorrelated variables through orthogonal transformation, and the group of converted variables is called principal components. It should be noted that the principal component analysis method is only an example, and that virtually any method that can convert multidimensional feature data into data of fewer dimensions is possible, and is not limited herein.
At step 103, for each content sample of the plurality of content samples, in response to determining: and if the categories corresponding to all the highest confidence degrees into which each content sample is clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than the confidence degree threshold value, labeling each content sample to form a labeled content sample. The label indicates a confidence level for the labeled content sample for each of a plurality of categories corresponding to outputs of a content sample classifier. In an embodiment of the present disclosure, if a content sample satisfies two conditions simultaneously (a) all categories corresponding to the highest confidence levels into which the content sample is clustered under the plurality of clustering methods, respectively, are the same and (b) all the highest confidence levels are greater than a confidence threshold, it may be determined that the categories into which the content sample is clustered under different clustering methods are the same, and such a content sample may be determined to be a "clean" sample and may therefore be labeled. Correspondingly, content samples in the data set that do not satisfy the above two conditions (a) and (b) may be determined to be "noise" samples and therefore not labeled, as the category into which such content samples are clustered is likely to be inaccurate.
Taking the example of using two different clustering methods, if the two categories corresponding to the highest confidence levels into which a content sample is clustered under two different clustering methods, respectively, are the same, and each of the two highest confidence levels is greater than a confidence level threshold, the content sample may be determined to be a "clean" sample and may therefore be labeled. This can be clearly expressed by the following expression:
Figure DEST_PATH_IMAGE030
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE032
is shown as
Figure 849767DEST_PATH_IMAGE014
Whether an individual sample of content is a "clean" sample,
Figure DEST_PATH_IMAGE034
represents the first
Figure 307293DEST_PATH_IMAGE014
One of the samples is a "clean" sample,
Figure DEST_PATH_IMAGE036
represents the first
Figure 121665DEST_PATH_IMAGE014
One sample is not a "clean" sample,
Figure DEST_PATH_IMAGE038
is a confidence threshold.
Figure DEST_PATH_IMAGE040
And
Figure DEST_PATH_IMAGE042
respectively represent
Figure 567297DEST_PATH_IMAGE014
The highest confidence of the class into which the individual content samples are clustered under the 1 st clustering method and the 2 nd clustering method,
Figure DEST_PATH_IMAGE044
and
Figure DEST_PATH_IMAGE046
respectively represent
Figure 896647DEST_PATH_IMAGE014
And clustering the content samples under the 1 st clustering method and the 2 nd clustering method into two categories corresponding to the highest confidence degrees.
Since the classes clustered using the clustering method are randomly generated, different clustering methods do not correspond to the generated classes, and thus there is a certain difficulty in determining the condition (a). In some embodiments of the present disclosure, if, among all the categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods, the category corresponding to the highest confidence with which the each content sample is clustered has the largest number of identical content samples between the category corresponding to the highest confidence with which the each content sample is clustered when using a second clustering method of the every two clustering methods, it is determined that all the categories corresponding to the highest confidence with which the each content sample is clustered under the plurality of clustering methods, respectively, are identical. This provides an efficient, accurate method to determine whether the highest confidence corresponding category to which each of the content samples is respectively clustered under the plurality of clustering methods is the same when two or more clustering methods are used.
In some embodiments, when tagging each of the unlabeled content exemplars, the each unlabeled content exemplar may be tagged
Figure DEST_PATH_IMAGE048
Wherein
Figure DEST_PATH_IMAGE050
Wherein the content of the first and second substances,
Figure 157864DEST_PATH_IMAGE014
a sequence number representing the content sample in the data set,
Figure 390525DEST_PATH_IMAGE008
for the number of categories to be clustered using each clustering method,
Figure 383888DEST_PATH_IMAGE048
is shown as
Figure 138218DEST_PATH_IMAGE014
The content sample is directed to
Figure 937546DEST_PATH_IMAGE016
Confidence for each class and K is less than or equal to K,
Figure DEST_PATH_IMAGE052
is as follows
Figure 460932DEST_PATH_IMAGE014
The category corresponding to the highest confidence of each content sample, J is the total number of clustering methods,
Figure DEST_PATH_IMAGE054
to use the jth clustering method
Figure 953093DEST_PATH_IMAGE014
Highest confidence of individual content samples. It should be noted that
Figure 194718DEST_PATH_IMAGE014
The specific value or identification of the category m corresponding to the highest confidence of each content sample may be predetermined (e.g., determined as the 2 nd category), as long as the determined categories are consistent with the categories corresponding to the confidences output by the content sample classifier.
In other words, when labeling a content sample, an average value of the highest confidence levels of the categories into which the content sample is respectively clustered under the plurality of clustering methods may be determined, and a first confidence level of the category corresponding to the highest confidence level in the label (i.e., a confidence level corresponding to an expected category of the content sample) is labeled as the average value; and marking the second confidence of each category in the label in other categories except the category corresponding to the highest confidence, so that the sum of the first confidence and all the second confidence is 1, namely the second confidence is set as the ratio of the difference value between 1 and the average value to the number of the categories in other categories.
In some embodiments, when labeling a content sample, a first confidence in the label of the category corresponding to the highest confidence (i.e., the confidence corresponding to the expected category of the content sample) may be labeled as 1 (i.e., 100% confidence), and a second confidence in the label of each of the other categories except the category corresponding to the highest confidence may be labeled as 0, respectively. Of course, this is not limiting as long as the category corresponding to the highest confidence can be significantly distinguished from the remaining categories.
In some embodiments, at least one of the plurality of clustering methods is a classifier-based clustering method as described above. Under the condition, the 'clean' sample can be updated, so that the obtained 'clean' sample is more accurate, and the clustering accuracy of the content sample classifier obtained by final training is improved. As an example, fig. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample. As shown in fig. 4, in step 401, the labeled content samples and unlabeled content samples may be used to train a classifier on which the clustering method is based. In step 402, the plurality of content samples may be clustered using a clustering method based on the trained classifier to determine a confidence distribution of an updated category in which each content sample is clustered under the clustering method based on the trained classifier. In step 403, the labeled content samples are reformed based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under other clustering methods of the plurality of clustering methods. Step 403 corresponds to step 103 being performed once again. Optionally, the update procedure may be repeated until the set of "clean" samples no longer changes, although this is not limiting.
At step 104, labeled content samples and unlabeled content samples in the data set are determined together as a training set for the content sample classifier. The content sample with the label is the content sample marked in step 103, and the content sample without the label is the rest content sample without the label in the data set. Together, the labeled content samples and unlabeled content samples in the data set are determined as a training set for the content sample classifier so that the content sample classifier can be trained.
It should be noted that the embodiment of the present disclosure does not limit the specific structure of the content sample classifier, and may be adaptively changed according to the type of the content sample. For example, in a case where each of the plurality of content samples is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (convolutional neural Networks) or the like. In the case where each of the plurality of content samples is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), BERT (bidirectional encoder characterization from Transformers), or the like.
It should be noted that the term "plurality" in the embodiments of the present disclosure includes two as well as more than two, unless otherwise specified. For example, the plurality of clustering methods may include two clustering methods and more than two clustering methods.
In the method for determining a training set described in the embodiments of the present disclosure, all content samples are divided into a set of "clean" samples and a set of "noise" samples by using a clustering result of a plurality of clustering methods on content samples in a data set. Then, marking the 'clean' sample with a label to form a content sample with the label, wherein the label is constructed through output results of a plurality of clustering methods; and the "noise" samples are taken as unlabeled content samples. After the set is divided, the content samples with labels and the content samples without labels can be used as a training set of the content sample classifier to train the content sample classifier. The method has the advantages that the clustering results of the content samples in the data set are fully utilized, and the 'clean' samples and 'noise' samples can be accurately determined. The label is marked on the 'clean' sample, and the label and the 'noise' sample are used as a training set to train the content sample classifier, so that the clustering accuracy and the generalization of the trained content sample classifier can be greatly improved, and meanwhile, the trained content sample classifier can obtain the category of the content sample from end to end.
Fig. 2 illustrates a schematic flow diagram of a method 200 of training a content sample classifier according to one embodiment of the present disclosure. As shown in fig. 2, the method step 200 includes the following steps.
In step 201, a training set for a content sample classifier is obtained. The content sample classifier is used for clustering the content samples to obtain the category of the content samples. The training set is determined, for example, according to the method 100 described with reference to fig. 1 and includes labeled content samples and unlabeled content samples. Each of the labeled content sample and the unlabeled content sample may be an image data sample, a text data sample, or a voice data sample, and the type of the content sample is not limited herein.
It should be noted that the embodiment of the present disclosure does not limit the specific structure of the content sample classifier, and may be adaptively changed according to the type of the content sample. For example, in a case where each of the plurality of content samples is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (convolutional neural Networks) or the like. In the case where each of the plurality of content samples is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), BERT (bidirectional encoder characterization from Transformers), or the like.
At step 202, the content sample classifier is trained using the labeled content samples and unlabeled content samples in a training set. As an example, the content sample classifier may be trained in a semi-supervised learning manner using the labeled and unlabeled content samples in a training set. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
In some embodiments, when training the content sample classifier with the labeled and unlabeled content samples in a training set, the content sample classifier may be trained by adjusting parameters of the content sample classifier such that an overall loss function is minimized. The total loss function is the sum of the loss functions for each content sample (including both tagged and untagged content samples). As an example, a loss function for each labeled content sample is used to constrain the proximity of the confidence distribution of the class of each labeled content sample output by the content sample classifier to the label of each labeled content sample; the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation, including the confidence for each of the plurality of classes, and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector.
As an example, the total loss function may be expressed as the following expression:
Figure DEST_PATH_IMAGE056
wherein the content of the first and second substances,
Figure 532159DEST_PATH_IMAGE014
a sequence number representing the content sample in the data set,
Figure DEST_PATH_IMAGE058
is as follows
Figure 736482DEST_PATH_IMAGE014
The loss function of an individual sample of content,
Figure DEST_PATH_IMAGE060
represents a cross-entropy loss function of the entropy of the sample,
Figure DEST_PATH_IMAGE062
the function of the entropy loss is represented by,
Figure DEST_PATH_IMAGE064
represents a classifier pair
Figure DEST_PATH_IMAGE066
The confidence distribution of the output class or classes,
Figure 133965DEST_PATH_IMAGE032
is shown as
Figure 925204DEST_PATH_IMAGE014
Whether or not each content sample is a "clean" sample, and Ture represents the second
Figure 4018DEST_PATH_IMAGE014
One of the samples is a "clean" sample,
Figure DEST_PATH_IMAGE068
represents the first
Figure 564312DEST_PATH_IMAGE014
The samples are not "clean" samples (i.e., are noise samples),
Figure DEST_PATH_IMAGE070
and
Figure DEST_PATH_IMAGE072
respectively represent to
Figure 899741DEST_PATH_IMAGE014
The content samples are subjected to a first random transformation and a second random transformation which are different from each other.
In the above expression, when
Figure 850380DEST_PATH_IMAGE014
Each content sample is a "clean" sample (i.e.,
Figure DEST_PATH_IMAGE074
) The loss function for the content sample may be
Figure DEST_PATH_IMAGE076
For constraining the closeness of the confidence distribution for the category of the ith content sample output by the content sample classifier to the label of the ith content sample, namely: adjusting parameters of the content sample classifier to enable the confidence degree distribution of the classification of the ith content sample output by the content sample classifier to be matched with the ith content sampleAs close as possible to the labels of the content samples.
In the above expression, when
Figure 795202DEST_PATH_IMAGE014
The individual content samples are "noise" samples (i.e.,
Figure DEST_PATH_IMAGE078
) The loss function for the content sample may be
Figure DEST_PATH_IMAGE080
Wherein, in the step (A),
Figure DEST_PATH_IMAGE082
the method is used for constraining the direct invariance of the confidence coefficient distribution of the category output by the content sample classifier after the ith content sample is subjected to the first random transformation and the confidence coefficient distribution of the category output by the content sample classifier after the ith content sample is subjected to the second random transformation. For "noise" samples, since their labels cannot be determined, embodiments of the present disclosure train based on the point of data enhancement invariance, namely: the confidence distribution of the content sample classifier for the class of the content sample output is not affected by the stochastic transformation.
Figure DEST_PATH_IMAGE084
The confidence distributions used to constrain the output of the content sample classifier to the similarity of the one-hot vectors are: by adjusting the parameters of the content sample classifier, the confidence degree distribution of the classification of the ith content sample output by the content sample classifier is as close as possible to the form of a one-hot vector (one-hot vector). The one-hot vector is a vector formed by one-hot encoding. One-hot encoding, also known as one-bit-efficient encoding, uses an N-bit status register to encode N states, each having its own independent register bit and only one of which is active at any one time. For example, the 6 unique heat vectors formed by the unique heat coding may be 000001, 000010, 000100, 001000, 010000, 100000, respectively, where only one bit is valid in each unique heat vector.
It should be noted that the random transformation described above may be any suitable random transformation and is not limited thereto. For example, when the content sample is an image data sample, the random transformation may be random cropping, random horizontal transformation, color dithering, or randomly combining color channels, etc. When the content sample is a text data sample, the random transformation may be to translate the content sample into another language and then back into the original language (the semantics are unchanged, but the text is changed). It should also be noted that the above-described loss function is not limiting, and any other suitable loss function may be used.
In some embodiments, in training the content sample classifier with the labeled and unlabeled content samples in the training set, the classifier may be trained in separate passes by selecting the same number of labeled and unlabeled content samples from the training set each time. For example, each sampling may be performed
Figure DEST_PATH_IMAGE086
A "clean" sample and
Figure 272319DEST_PATH_IMAGE086
the "noise" samples, B being a positive integer, are then summed as the total loss function per training. By using the same number of labeled content samples and unlabeled content samples, the clustering accuracy and the generalization of the trained content sample classifier can be balanced, so that the trained content sample classifier has a better clustering effect.
As an example, the parameters of the content sample classifier may be adjusted by back propagation (back propagation) according to the total loss function described above. In the back propagation process, the learning rate for each parameter of the content sample classifier can be dynamically adjusted by computing first and second moment estimates of the gradient of the loss function. The back propagation is essentially based on the gradient descent method, and the step length can be 0.001 and the first moment in trainingEstimated exponential decay Rate
Figure DEST_PATH_IMAGE088
Set to 0.9, the second moment estimated exponential decay Rate
Figure DEST_PATH_IMAGE090
Set to 0.999. Batch size at optimization
Figure 775720DEST_PATH_IMAGE086
It can be set to 128, and for all parameters, using L2 regularization, the regularization coefficient is 0.0001. The step size decay is 0.1 times the previous one for each 50 times training of the content sample using the content sample in the training set. Generally, confidence thresholds are used when partitioning "clean" samples
Figure 213654DEST_PATH_IMAGE038
May be set to 0.95. The content sample classifier obtained through the training shows a good clustering effect.
In the method for training the content sample classifier described in the embodiment of the disclosure, the content sample classifier is trained by using the content sample with the label and the content sample without the label as a training set, so that the clustering accuracy and the generalization of the trained content sample classifier are greatly improved, thereby improving the clustering effect, and the trained content sample classifier realizes the purpose of acquiring the category of the content sample from end to end.
Fig. 3 illustrates a schematic flow diagram of a method 300 of clustering content samples according to one embodiment of the present disclosure. As shown in fig. 3, the method step 300 includes the following steps.
In step 301, a data set comprising a plurality of content samples without tags is obtained. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.
In step 302, the plurality of content samples are clustered by using each clustering method of a plurality of clustering methods different from each other, so as to determine a category corresponding to the highest confidence in the confidence distribution of the categories into which each content sample is clustered under each clustering method. The Clustering method may be any suitable Clustering method, such as K-means Clustering, DEC (Deep Embedded Clustering), IDEC (Improved Deep Embedded Clustering), DCEC (Deep Convolutional Embedded Clustering), classifier-based Clustering method using artificial intelligence, and the like, although this is not limiting.
As an example, given
Figure 696588DEST_PATH_IMAGE002
An unlabeled content sample
Figure 169158DEST_PATH_IMAGE004
Clustering content samples using a plurality of clustering methods different from each other can obtain a confidence distribution of a category
Figure 282607DEST_PATH_IMAGE006
Wherein M is the number of clustering methods,
Figure 207838DEST_PATH_IMAGE008
the number of classes clustered for each clustering method,
Figure 494463DEST_PATH_IMAGE010
and indicates the use of
Figure 555960DEST_PATH_IMAGE012
In the case of individual clustering methods
Figure 840311DEST_PATH_IMAGE014
A sample belongs to
Figure 315154DEST_PATH_IMAGE016
Confidence of individual classes. It should be noted that the number of categories formed by clustering using each clustering method is generally set in advance and is directed toAll clustering methods are typically set to be the same.
In some embodiments, for classifier-based clustering methods, the classifier may directly output the confidence of each content sample belonging to each class. Then, the highest confidence may be determined from the confidence of each content sample belonging to each category, and the category corresponding to the highest confidence is obtained accordingly. In some embodiments, for a K-means clustering method such a clustering method that directly embodies the cluster category (i.e., the category corresponding to the highest confidence is directly obtained) rather than the confidence for each category, the confidence of the sample for each category (i.e., the confidence distribution of the category of the sample) may be calculated using the student t-distribution, as described with reference to step 102 of fig. 1.
In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method. In this case, when the plurality of content samples are clustered using the K-means clustering method, feature data of a plurality of dimensions may be extracted for each of the plurality of content samples; then, reducing the dimension of the characteristic data of the plurality of dimensions; and finally, clustering the content sample pairs based on the feature data of the content samples after dimensionality reduction. By extracting the characteristic data of multiple dimensions of the content sample, the accuracy of clustering can be improved. Performing dimension reduction on the feature data of the plurality of dimensions can simplify the calculation amount of the clustering process.
At step 303, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than the confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of the content sample classifier.
In the embodiment of the present disclosure, if a content sample simultaneously satisfies two conditions (a) the categories corresponding to all the highest confidences into which the content sample is respectively clustered under the multiple clustering methods are the same and (b) all the highest confidences are greater than the confidence threshold, it may be determined that the categories into which the content sample is clustered under different clustering methods are all the same and have high accuracy, and such content sample may be determined as a "clean" sample and thus may be labeled. Correspondingly, content samples in the data set that do not satisfy any of the above two conditions (a) and (b) may be determined to be "noise" samples and therefore not labeled, as the category into which such content samples are clustered is likely to be inaccurate.
Taking the example of using two different clustering methods, if the two categories corresponding to the highest confidence levels into which a content sample is clustered under the two different clustering methods, respectively, are the same, and each of the two highest confidence levels is greater than a confidence level threshold, the content sample may be determined to be a "clean" sample and may therefore be labeled.
Since the classes clustered using the clustering method are randomly generated, different clustering methods do not correspond to the generated classes, and thus there is a certain difficulty in determining the condition (a). In some embodiments of the present disclosure, if, among all the categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods, the category corresponding to the highest confidence with which the each content sample is clustered has the largest number of identical content samples between the category corresponding to the highest confidence with which the each content sample is clustered when using a second clustering method of the every two clustering methods, it is determined that all the categories corresponding to the highest confidence with which the each content sample is clustered under the plurality of clustering methods, respectively, are identical. This provides an efficient, accurate method to determine whether the highest confidence corresponding category to which each of the content samples is respectively clustered under the plurality of clustering methods is the same when two or more clustering methods are used.
In some embodiments, when labeling a content sample, an average value of the highest confidence values of the categories into which the content sample is respectively clustered under the plurality of clustering methods may be determined, and a first confidence value of the category corresponding to the highest confidence value in the label (i.e., a confidence value corresponding to an expected category of the content sample) is labeled as the average value; and marking the second confidence of each category in the label in other categories except the category corresponding to the highest confidence, so that the sum of the first confidence and all the second confidence is 1, namely the second confidence is set as the ratio of the difference value between 1 and the average value to the number of the categories in other categories. In some embodiments, when labeling a content sample, a first confidence in the label of the category corresponding to the highest confidence (i.e., the confidence corresponding to the expected category of the content sample) may be labeled as 1 (i.e., 100% confidence), and a second confidence in the label of each of the other categories except the category corresponding to the highest confidence may be labeled as 0, respectively. Of course, this is not limiting as long as the category corresponding to the highest confidence can be significantly distinguished from the remaining categories.
At step 304, the content sample classifier is trained using the labeled content samples and unlabeled content samples in the dataset to obtain a trained content sample classifier. As an example, the content sample classifier can be trained in a semi-supervised learning manner using the labeled and unlabeled content samples. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
In some embodiments, when training the content sample classifier with the labeled and unlabeled content samples in a training set, the content sample classifier may be trained by adjusting parameters of the content sample classifier such that an overall loss function is minimized. The total loss function is the sum of the loss functions for each content sample (including both tagged and untagged content samples). As an example, a loss function for each labeled content sample is used to constrain the proximity of the confidence distribution of the class of each labeled content sample output by the content sample classifier to the label of each labeled content sample; the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation, including the confidence for each of the plurality of classes, and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector. The specific total loss function may be described in step 202 described with reference to fig. 2.
In some embodiments, in training the content sample classifier with the labeled and unlabeled content samples, the classifier may be trained in fractions by selecting the same number of labeled and unlabeled content samples from a training set each time. For example, each sampling may be performed
Figure 343153DEST_PATH_IMAGE086
A "clean" sample and
Figure 524736DEST_PATH_IMAGE086
the "noise" samples, B being a positive integer, are then summed as the total loss function per training. By using the same number of labeled content samples and unlabeled content samples, the clustering accuracy and the generalization of the trained content sample classifier can be balanced, so that the trained content sample classifier has a better clustering effect.
As an example, the parameters of the content sample classifier may be adjusted by back propagation (back propagation) according to the total loss function described above. In the back propagation process, the learning rate for each parameter of the content sample classifier can be dynamically adjusted by computing first and second moment estimates of the gradient of the loss function. The specific training parameters may be as described in step 202 described with reference to fig. 2. The content sample classifier is trained by using the content samples with the labels and the content samples without the labels in a semi-supervised learning mode, so that the clustering accuracy and the generalization of the trained content sample classifier are greatly improved, the clustering effect is improved, and meanwhile, the trained content sample classifier realizes the purpose of acquiring the categories of the content samples end to end.
The embodiment of the present disclosure does not limit the specific structure of the content sample classifier, and may be adaptively changed according to the type of the content sample. For example, in the case that the content sample to be clustered is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (Convolutional Neural Networks) or the like. In the case that the content sample to be clustered is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network), LSTM (Long Short-term memory Network), BERT (Bidirectional Encoder characterization quantity of self-transformer), or the like.
In some embodiments, at least one of the plurality of clustering methods is a classifier-based clustering method as described above. Under the condition, the 'clean' sample can be updated, so that the obtained 'clean' sample is more accurate, and the clustering accuracy of the content sample classifier obtained by final training is improved.
FIG. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample. As shown in fig. 4, in step 401, the labeled content samples and unlabeled content samples may be used to train a classifier on which the clustering method is based. In step 402, the plurality of content samples may be clustered using a clustering method based on the trained classifier to determine a confidence distribution of an updated category in which each content sample is clustered under the clustering method based on the trained classifier. In step 403, the labeled content samples are reformed based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under other clustering methods of the plurality of clustering methods. Optionally, the update procedure may be repeated until the set of "clean" samples no longer changes, although this is not limiting.
In step 305, clustering content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered. In some embodiments, the content samples to be clustered may be input to a trained content sample classifier such that the content sample classifier outputs confidence distributions of the content samples to be clustered that correspond to the plurality of classes; then, the category with the highest confidence in the confidence distribution is determined as the category of the content sample to be clustered. The output of the content sample classifier is the confidence distributions of the content samples to be clustered corresponding to a plurality of categories, from which the category corresponding to the highest confidence is selected as the category for the content sample, which has the highest clustering accuracy and implements a clustering process that directly obtains the clustering categories.
In the method for clustering content samples described in the embodiments of the present disclosure, all content samples are accurately divided into a set of "clean" samples and a set of "noise" samples by fully utilizing the clustering results of a plurality of clustering methods on the content samples in a data set. The "clean" sample is then labeled to form a labeled sample of content. After the set division is finished, the content sample classifier can be trained by using the content samples with the labels and the content samples without the labels, so that the clustering accuracy and the generalization of the trained content sample classifier can be greatly improved, meanwhile, the classes of the content samples to be clustered can be obtained end to end through the trained content sample classifier, and the clustering effect is improved.
Fig. 5 illustrates an exemplary schematic diagram of clustering content samples according to one embodiment of the present disclosure. As shown in fig. 5, two clustering methods (a first clustering method and a second clustering method) are respectively used to cluster a plurality of content samples without labels in a data set to obtain a clustering result, wherein the clustering result includes a category corresponding to the highest confidence in which each of the plurality of content samples is clustered under each clustering method. Then, the clustering results obtained by using the first clustering method and the second clustering method respectively are matched to divide the content samples in the data set into "clean" samples and "noise" samples, and the "clean" samples are labeled to form labeled content samples. The label is to indicate a confidence of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier. In the dividing, for each content sample, if two conditions are satisfied, (a) the categories corresponding to all the highest confidence levels into which the content sample is respectively clustered under the multiple clustering methods are the same, and (b) all the highest confidence levels are greater than a confidence level threshold, the content sample may be determined to be a "clean" sample, otherwise, the content sample is a "noise" sample. All unlabeled content samples and labeled content samples together form a training set to train the content sample classifier in a semi-supervised learning manner to obtain a trained content sample classifier. Then, the content samples to be clustered may be input into the trained content sample classifier, so that the content sample classifier outputs confidence distributions of the content samples to be clustered corresponding to a plurality of categories, and then the category with the highest confidence in the confidence distributions is determined as the category of the content samples, thereby achieving an effect of obtaining the category of the content samples to be clustered end to end. The content sample to be clustered may be a content sample of the unlabeled plurality of content samples, or may be a content sample having a same content sample type (e.g., an image data sample, a text data sample, or a voice data sample, etc.) as the plurality of content samples.
Fig. 6 illustrates an exemplary block diagram of an apparatus 600 for determining a training set according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes a data set acquisition module 601, a sample clustering module 602, a sample labeling module 603, and a training set determination module 604.
The dataset acquisition module 601 is configured to acquire a dataset comprising a plurality of content samples without tags. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.
The sample clustering module 602 is configured to cluster the plurality of content samples using each of a plurality of clustering methods that are different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each of the plurality of content samples is clustered under the each clustering method. The plurality of clustering methods are different from each other, and the clustering method may be any suitable clustering method, such as K-means clustering, DEC, IDEC, DCEC, classifier-based clustering methods using artificial intelligence, and the like, although this is not limitative.
The sample tagging module 603 is configured to, for each content sample of the plurality of content samples, in response to determining: the categories corresponding to all the highest confidence levels into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence levels are greater than a confidence level threshold value, labeling each content sample to form a labeled content sample, and labeling each content sample to form a labeled content sample, wherein the label indicates the confidence level of the labeled content sample for each category in a plurality of categories corresponding to the output of a content sample classifier.
The training set determination module 604 is configured to determine labeled content samples and unlabeled content samples in the data set together as a training set for the content sample classifier. The content samples with the labels are the content samples marked by the marking module, and the content samples without the labels are the rest content samples without the labels in the data set. Together, the labeled content samples and unlabeled content samples in the data set are determined as a training set for the content sample classifier so that the content sample classifier can be trained.
Fig. 7 illustrates an exemplary block diagram of an apparatus 700 for training a content sample classifier according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes a training set acquisition module 701 and a classifier training module 702.
The training set obtaining module 701 is configured to obtain a training set for a content sample classifier, wherein the training set is determined by the apparatus 600 for determining a training set described with reference to fig. 6 and comprises labeled content samples and unlabeled content samples. The content sample classifier is used for classifying the content samples to obtain the categories of the content samples. Each of the labeled content sample and the unlabeled content sample may be an image data sample, a text data sample, or a voice data sample, and the type of the content sample is not limited herein.
Classifier training module 702 is configured to train the content sample classifier using the labeled and unlabeled content samples in a training set. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
Fig. 8 shows an exemplary block diagram of an apparatus 800 for clustering content samples according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 includes an obtaining module 801, a clustering module 802, a labeling module 803, a training module 804, and a determining module 805.
The acquisition module 801 is configured to acquire a data set comprising a plurality of content samples without tags. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.
The clustering module 802 is configured to cluster the plurality of content samples using each of a plurality of clustering methods that are different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method. The plurality of clustering methods are different from each other, and the clustering method may be any suitable clustering method, such as K-means clustering, DEC, IDEC, DCEC, classifier-based clustering methods using artificial intelligence, and the like, although this is not limitative.
The tagging module 803 is configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than the confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of the content sample classifier.
The training module 804 is configured to train the content sample classifier using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier. As an example, the training module 804 is configured to train the content sample classifier in a semi-supervised learning manner using the labeled content samples and unlabeled content samples. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
The determining module 805 is configured to cluster the content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered. The output of the content sample classifier is a confidence distribution for the content sample corresponding to a plurality of categories, from which, in some embodiments, the category of highest confidence may be selected as the category for the content sample. This has the highest clustering accuracy and enables a clustering process that directly gets the cluster class.
Fig. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. The computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system. The means for determining a training set 600 described above with reference to fig. 6, the means for training a content sample classifier 700 described with reference to fig. 7, and the means for clustering content samples 800 described with reference to fig. 8 may all take the form of a computing device 910. Alternatively, the means for determining a training set 600, the means for training a content sample classifier 700, and the means for clustering content samples 800 may each be implemented as a computer program in the form of an application 916.
The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 911 represents functionality to perform one or more operations using hardware. Accordingly, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 914 is not limited by the material from which it is formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in various other ways as further described below.
One or more I/O interfaces 913 represent functionality that allows a user to enter commands and information to computing device 910 using various input devices and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Thus, the computing device 910 may be configured in various ways to support user interaction, as described further below.
The computing device 910 also includes an application 916. The application 916 may be, for example, a software instance of the apparatus 600 to determine a training set, the apparatus 700 to train a content sample classifier, and the apparatus 800 to cluster content samples, and in combination with other elements in the computing device 910 to implement the techniques described herein.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
As previously described, hardware element 914 and computer-readable medium 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that, in some embodiments, may be used to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. The computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules.
In various implementations, the computing device 910 may assume a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 910 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. The computing device 910 may also be implemented as a television-like device that includes or is connected to a device having a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.
The techniques described herein may be supported by these various configurations of the computing device 910 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.
Cloud 920 includes and/or is representative of a platform 922 for resources 924. The platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 920. The resources 924 may include other applications and/or data that may be used when executing computer processes on servers remote from the computing device 910. The resources 924 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.
It will be appreciated that embodiments of the disclosure have been described with reference to different functional units for clarity. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the disclosure. For example, functionality illustrated to be performed by a single unit may be performed by a plurality of different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component or section from another device, element, component or section.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the terms "a" or "an" do not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims (15)

1. A method of clustering content samples, comprising:
obtaining a data set comprising a plurality of content samples without tags;
clustering the plurality of content samples by using each clustering method in a plurality of clustering methods different from each other so as to determine a category corresponding to the highest confidence in confidence distributions of categories into which each content sample is clustered under each clustering method;
for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier;
training the content sample classifier by using the labeled content samples and the unlabeled content samples in the data set to obtain a trained content sample classifier;
and clustering the content samples to be clustered by utilizing the trained content sample classifier so as to determine the category of the content samples to be clustered.
2. The method of claim 1, further comprising:
in response to a highest confidence corresponding category to which the each content sample is clustered having a highest number of identical content samples among all categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods and a highest confidence corresponding category to which the each content sample is clustered when a second clustering method of the every two clustering methods is used, determining that all highest confidence corresponding categories to which the each content sample is clustered under the plurality of clustering methods, respectively, are identical.
3. The method of claim 1, wherein tagging each of the content samples comprises:
determining an average value of the highest confidence degrees of the categories into which each content sample is respectively clustered under the plurality of clustering methods, and marking a first confidence degree of the category corresponding to the highest confidence degree in the label as the average value, an
Marking a second confidence level of each category in the label in other categories except the category corresponding to the highest confidence level, so that the sum of the first confidence level and all the second confidence levels is 1.
4. The method of claim 1, wherein tagging each of the content samples comprises:
the first confidence in the label for the category corresponding to the highest confidence is labeled 1, an
And respectively marking the second confidence degrees of each category in the labels in other categories except the category corresponding to the highest confidence degree as 0.
5. The method of claim 1, wherein at least one of the plurality of clustering methods comprises a classifier-based clustering method, and wherein prior to the training of the content sample classifier with labeled and unlabeled content samples in the dataset, the method further comprises:
training a classifier based on the clustering method by using the labeled content samples and the unlabeled content samples;
clustering the plurality of content samples by using a clustering method based on a trained classifier to determine the confidence degree distribution of an updated category obtained by clustering each content sample under the clustering method based on the trained classifier;
reformulating the labeled content samples based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under the other clustering methods of the plurality of clustering methods, respectively.
6. The method of claim 1, wherein at least one of the plurality of clustering methods is a K-means clustering method, and wherein clustering the plurality of content samples using the K-means clustering method further comprises:
extracting feature data of a plurality of dimensions for each of the plurality of content samples;
reducing the dimensions of the feature data of the plurality of dimensions;
clustering the plurality of content sample pairs based on the reduced-dimension feature data of the plurality of content samples.
7. The method of claim 1, wherein training the content sample classifier with labeled and unlabeled content samples in the dataset comprises:
training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized, wherein the total loss function is a sum of loss functions for each content sample, and,
wherein a loss function for each labeled content sample is used to constrain the proximity of the confidence distribution of the class of each labeled content sample output by the content sample classifier to the label of each labeled content sample;
wherein the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector, wherein the confidence distributions of the classes include a confidence for each of the plurality of classes.
8. The method of claim 7, wherein training the content sample classifier by adjusting parameters of the content sample classifier such that an overall loss function is minimized comprises:
training the content sample classifier in a back-propagation manner, wherein a learning rate for each parameter of the content sample classifier is dynamically adjusted in the back-propagation by calculating a first moment estimate and a second moment estimate of a gradient of a loss function.
9. The method of claim 1, wherein training the content sample classifier using labeled and unlabeled content samples in the dataset comprises:
the classifier is trained in separate runs each time the same number of labeled and unlabeled content samples are selected.
10. The method of claim 1, wherein clustering content samples to be clustered using a trained content sample classifier to determine categories of the content samples to be clustered comprises:
inputting the content samples to be clustered into a trained content sample classifier, so that the content sample classifier outputs confidence distributions of the content samples to be clustered, which correspond to the plurality of classes;
determining the category with the highest confidence in the confidence distribution as the category of the content sample to be clustered.
11. The method of claim 1, wherein each of the plurality of content samples comprises an image data sample and the structure of the content sample classifier comprises a convolutional neural network.
12. The method of claim 1, wherein each of the plurality of content samples comprises one of a text data sample and a speech data sample, and the structure of the content sample classifier comprises one of a recurrent neural network, a long-short term memory network, and a bi-directional encoder characterization of an auto-transformer.
13. An apparatus for clustering content samples, comprising:
an acquisition module configured to acquire a dataset comprising a plurality of content samples without tags;
a clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method;
a tagging module configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier;
a training module configured to train the content sample classifier using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier;
a determination module configured to cluster the content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered.
14. A computing device comprising
A memory configured to store computer-executable instructions;
a processor configured to perform the method of any one of claims 1-12 when the computer-executable instructions are executed by the processor.
15. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1-12.
CN202010824726.2A 2020-08-17 2020-08-17 Method and device for clustering content samples Active CN111898704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010824726.2A CN111898704B (en) 2020-08-17 2020-08-17 Method and device for clustering content samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010824726.2A CN111898704B (en) 2020-08-17 2020-08-17 Method and device for clustering content samples

Publications (2)

Publication Number Publication Date
CN111898704A true CN111898704A (en) 2020-11-06
CN111898704B CN111898704B (en) 2024-05-10

Family

ID=73230390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010824726.2A Active CN111898704B (en) 2020-08-17 2020-08-17 Method and device for clustering content samples

Country Status (1)

Country Link
CN (1) CN111898704B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329883A (en) * 2020-11-25 2021-02-05 Oppo广东移动通信有限公司 Model training system, method, device and storage medium
CN113361563A (en) * 2021-04-22 2021-09-07 重庆大学 Parkinson's disease voice data classification system based on sample and feature double transformation
CN114299194A (en) * 2021-12-23 2022-04-08 北京百度网讯科技有限公司 Training method of image generation model, image generation method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN109543756A (en) * 2018-11-26 2019-03-29 重庆邮电大学 A kind of tag queries based on Active Learning and change method
CN109800744A (en) * 2019-03-18 2019-05-24 深圳市商汤科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN110110802A (en) * 2019-05-14 2019-08-09 南京林业大学 Airborne laser point cloud classification method based on high-order condition random field
CN110263804A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 A kind of medical image dividing method based on safe semi-supervised clustering
US20190319977A1 (en) * 2019-06-27 2019-10-17 Intel Corporation Systems and Methods to Fingerprint and Classify Application Behaviors Using Telemetry
US20190378044A1 (en) * 2013-12-23 2019-12-12 Groupon, Inc. Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model
CN111209929A (en) * 2019-12-19 2020-05-29 平安信托有限责任公司 Access data processing method and device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US20190378044A1 (en) * 2013-12-23 2019-12-12 Groupon, Inc. Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN109543756A (en) * 2018-11-26 2019-03-29 重庆邮电大学 A kind of tag queries based on Active Learning and change method
CN109800744A (en) * 2019-03-18 2019-05-24 深圳市商汤科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN110263804A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 A kind of medical image dividing method based on safe semi-supervised clustering
CN110110802A (en) * 2019-05-14 2019-08-09 南京林业大学 Airborne laser point cloud classification method based on high-order condition random field
US20190319977A1 (en) * 2019-06-27 2019-10-17 Intel Corporation Systems and Methods to Fingerprint and Classify Application Behaviors Using Telemetry
CN111209929A (en) * 2019-12-19 2020-05-29 平安信托有限责任公司 Access data processing method and device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ELAHEH RASHEDI: "A novel multi-clustering method for hierarchical clusterings based on boosting", 《2011 19TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING》 *
刘伟涛;许信顺;: "一种使用未标记样本聚类信息的自训练方法", 计算机应用研究, no. 09, pages 147 - 150 *
申超波: "基于标签聚类的多标签分类算法", 《软件》, vol. 35, no. 8, pages 16 - 21 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329883A (en) * 2020-11-25 2021-02-05 Oppo广东移动通信有限公司 Model training system, method, device and storage medium
CN113361563A (en) * 2021-04-22 2021-09-07 重庆大学 Parkinson's disease voice data classification system based on sample and feature double transformation
CN114299194A (en) * 2021-12-23 2022-04-08 北京百度网讯科技有限公司 Training method of image generation model, image generation method and device

Also Published As

Publication number Publication date
CN111898704B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN111191078B (en) Video information processing method and device based on video information processing model
CN107688821B (en) Cross-modal image natural language description method based on visual saliency and semantic attributes
WO2020114378A1 (en) Video watermark identification method and apparatus, device, and storage medium
CN109471944B (en) Training method and device of text classification model and readable storage medium
Qu et al. Joint hierarchical category structure learning and large-scale image classification
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN113657425B (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN111898704B (en) Method and device for clustering content samples
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN111475622A (en) Text classification method, device, terminal and storage medium
CN112257441B (en) Named entity recognition enhancement method based on counterfactual generation
CN114283350B (en) Visual model training and video processing method, device, equipment and storage medium
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
Zhang et al. Large-scale aerial image categorization using a multitask topological codebook
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
Fan et al. A hierarchical Dirichlet process mixture of generalized Dirichlet distributions for feature selection
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
CN115131613A (en) Small sample image classification method based on multidirectional knowledge migration
CN111078881B (en) Fine-grained sentiment analysis method and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant