CN111898704A

CN111898704A - Method and device for clustering content samples

Info

Publication number: CN111898704A
Application number: CN202010824726.2A
Authority: CN
Inventors: 卢东焕; 赵俊杰; 马锴; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-11-06
Anticipated expiration: 2040-08-17
Also published as: CN111898704B

Abstract

The application describes a method of clustering content samples, comprising: obtaining a data set comprising a plurality of content samples without tags; clustering a plurality of content samples by using a plurality of clustering methods which are different from each other to determine a category corresponding to the highest confidence in confidence distribution of categories obtained by clustering each content sample under each clustering method; for each content sample, in response to determining: the categories corresponding to all the highest confidence degrees into which each content sample is clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample; training a content sample classifier by using the labeled content sample and the unlabeled content sample to obtain a trained content sample classifier; the content samples to be clustered are clustered using a trained content sample classifier to determine their class.

Description

Method and device for clustering content samples

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for clustering content samples.

Background

Currently, when clustering content samples such as image data samples, voice data samples, text data samples, etc., a two-stage clustering method is generally employed. In a first stage, features are extracted from the content samples using an encoder, and then in a second stage the extracted features are clustered using a basic clustering algorithm, such as a K-means algorithm, to obtain a class for each sample. However, such clustering methods are generally limited by the feature extraction capability of the encoder and are not effective, and the categories of the content samples cannot be obtained end-to-end (i.e., the categories of the content samples are obtained directly from the content samples without using the encoder to extract features from the content samples). In addition, the used basic clustering algorithm itself can affect the accuracy of clustering.

With the development of artificial intelligence, the classifier-based clustering method can achieve the purpose of obtaining the category of a content sample from end to end, but training a classifier with good clustering accuracy is difficult, because the training sample with the known accurate category is seriously insufficient, and the process of obtaining the training sample is also influenced by the feature extraction capability of an encoder and the basic clustering algorithm. Some researches try to comprehensively consider the clustering results of a plurality of basic clustering algorithms, but a simple weighted voting mode is used for fusion, so that the efficiency is very poor.

Disclosure of Invention

In view of the above, the present disclosure provides methods and apparatus for determining a training set, methods and apparatus for training a classifier of content samples, and methods and apparatus for clustering content samples, which desirably overcome some or all of the above-referenced deficiencies and possibly others.

According to a first aspect of the present disclosure, there is provided a method of clustering content samples, comprising: obtaining a data set comprising a plurality of content samples without tags; clustering the plurality of content samples by using each clustering method in a plurality of clustering methods different from each other so as to determine a category corresponding to the highest confidence in confidence distributions of categories into which each content sample is clustered under each clustering method; for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; training the content sample classifier by using the labeled content samples and the unlabeled content samples in the data set to obtain a trained content sample classifier; and clustering the content samples to be clustered by utilizing the trained content sample classifier so as to determine the category of the content samples to be clustered.

In some embodiments, the method further comprises: in response to a highest confidence corresponding category to which the each content sample is clustered having a highest number of identical content samples among all categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods and a highest confidence corresponding category to which the each content sample is clustered when a second clustering method of the every two clustering methods is used, determining that all highest confidence corresponding categories to which the each content sample is clustered under the plurality of clustering methods, respectively, are identical.

In some embodiments, tagging each of the content samples comprises: determining an average value of the highest confidence degrees of the categories into which each content sample is respectively clustered under the plurality of clustering methods, marking a first confidence degree of the category corresponding to the highest confidence degree in the label as the average value, and marking a second confidence degree of each category in the labels of other categories except the category corresponding to the highest confidence degree, so that the sum of the first confidence degree and all the second confidence degrees is 1.

In some embodiments, tagging each of the content samples comprises: the first confidence of the class corresponding to the highest confidence in the label is marked as 1, and the second confidence of each class in the labels of other classes except the class corresponding to the highest confidence is respectively marked as 0.

In some embodiments, at least one of the plurality of clustering methods comprises a classifier-based clustering method, and wherein prior to the training of the content sample classifier with labeled and unlabeled content samples in the dataset, the method further comprises: training a classifier based on the clustering method by using the labeled content samples and the unlabeled content samples; clustering the plurality of content samples by using a clustering method based on a trained classifier to determine the confidence degree distribution of an updated category obtained by clustering each content sample under the clustering method based on the trained classifier; reformulating the labeled content samples based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under the other clustering methods of the plurality of clustering methods, respectively.

In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method, and wherein clustering the plurality of content samples using the K-means clustering method further comprises: extracting feature data of a plurality of dimensions for each of the plurality of content samples; reducing the dimensions of the feature data of the plurality of dimensions; clustering the plurality of content sample pairs based on the reduced-dimension feature data of the plurality of content samples.

In some embodiments, training the content sample classifier using the labeled and unlabeled content samples in the dataset comprises: training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized, wherein the total loss function is a sum of the loss functions for each content sample, and wherein the loss function for each labeled content sample serves to constrain a confidence distribution of a class of each labeled content sample output by the content sample classifier to a proximity of a label of each labeled content sample; wherein the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector, wherein the confidence distributions of the classes include a confidence for each of the plurality of classes.

In some embodiments, training the content sample classifier by adjusting parameters of the content sample classifier such that an overall loss function is minimized comprises: training the content sample classifier in a back-propagation manner, wherein a learning rate for each parameter of the content sample classifier is dynamically adjusted in the back-propagation by calculating a first moment estimate and a second moment estimate of a gradient of a loss function.

In some embodiments, training the content sample classifier using the labeled and unlabeled content samples in the dataset comprises: the classifier is trained in separate runs each time the same number of labeled and unlabeled content samples are selected.

In some embodiments, clustering content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered includes: inputting the content samples to be clustered into a trained content sample classifier, so that the content sample classifier outputs confidence distributions of the content samples to be clustered, which correspond to the plurality of classes; determining the category with the highest confidence in the confidence distribution as the category of the content sample to be clustered.

In some embodiments, each of the plurality of content samples comprises an image data sample, and the structure of the content sample classifier comprises a convolutional neural network.

In some embodiments, each of the plurality of content samples comprises one of a text data sample and a speech data sample, and the structure of the content sample classifier comprises one of a recurrent neural network, a long-short term memory network, and a bidirectional encoder token of a self-transformer.

According to a second aspect of the present disclosure, there is provided an apparatus for clustering content samples, comprising: an acquisition module configured to acquire a dataset comprising a plurality of content samples without tags; a clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method; a tagging module configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; a training module configured to train the content sample classifier using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier; a determination module configured to cluster the content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered.

According to a third aspect of the present disclosure, there is provided a method of determining a training set, comprising: obtaining a data set comprising a plurality of content samples without tags; clustering the plurality of content samples by using each clustering method in a plurality of clustering methods different from each other so as to determine a category corresponding to the highest confidence in confidence distributions of categories into which each content sample is clustered under each clustering method; for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; determining labeled content samples and unlabeled content samples in the dataset together as a training set for the content sample classifier.

According to a fourth aspect of the present disclosure, there is provided an apparatus for determining a training set, comprising: a data set acquisition device configured to acquire a data set including a plurality of content samples without tags; a sample clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method; a sample tagging module configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier; a training set determination module configured to determine labeled content samples and unlabeled content samples in the data set together as a training set for the content sample classifier.

According to a fifth aspect of the present disclosure, there is provided a method of training a content sample classifier, comprising: obtaining a training set for a content sample classifier, wherein the training set is determined according to the method of the third aspect of the present disclosure and comprises labeled content samples and unlabeled content samples; training the content sample classifier using the labeled content samples and unlabeled content samples in a training set.

According to a sixth aspect of the present disclosure, there is provided an apparatus for training a content sample classifier, comprising: a training set obtaining module configured to obtain a training set for a content sample classifier, wherein the training set is determined by the apparatus for determining a training set according to the fourth aspect of the present disclosure and includes labeled content samples and unlabeled content samples; a classifier training module configured to train the content sample classifier using the labeled and unlabeled content samples in a training set.

According to a seventh aspect of the present disclosure, there is provided a computing device comprising a processor; and a memory configured to have computer-executable instructions stored thereon that, when executed by the processor, perform any of the methods described above.

According to an eighth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.

In the method and the device for determining the training set, the method and the device for training the content sample classifier, and the method and the device for clustering the content samples, which are claimed by the present disclosure, the 'clean' samples and 'noise' samples in the data set can be accurately determined by fully utilizing the clustering results of the content samples in the data set by a plurality of clustering methods. The label is marked on the 'clean' sample, and the label and the 'noise' sample in the 'clean' sample are trained on the content sample classifier, so that the clustering accuracy and the generalization of the trained content sample classifier can be greatly improved, and meanwhile, the trained content sample classifier can obtain the category of the content sample from end to end.

These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:

fig. 1 illustrates a schematic flow diagram of a method of determining a training set according to one embodiment of the present disclosure;

FIG. 2 illustrates a schematic flow diagram of a method of training a content sample classifier according to one embodiment of the present disclosure;

FIG. 3 illustrates a schematic flow chart diagram of a method of clustering content samples according to one embodiment of the present disclosure;

FIG. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample according to one embodiment of the present disclosure;

FIG. 5 illustrates an exemplary schematic diagram of clustering content samples according to one embodiment of the present disclosure;

FIG. 6 illustrates an exemplary block diagram of an apparatus for determining a training set according to one embodiment of the present disclosure;

FIG. 7 illustrates an exemplary block diagram of an apparatus for training a content sample classifier according to one embodiment of the present disclosure;

FIG. 8 illustrates an exemplary block diagram of an apparatus for clustering content samples according to one embodiment of the present disclosure;

fig. 9 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.

Detailed Description

The following description provides specific details of various embodiments of the disclosure so that those skilled in the art can fully understand and practice the various embodiments of the disclosure. It is understood that aspects of the disclosure may be practiced without some of these details. In some instances, well-known structures or functions are not shown or described in detail in this disclosure to avoid obscuring the description of the embodiments of the disclosure by these unnecessary descriptions. The terminology used in the present disclosure should be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.

First, some terms referred to in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

Clustering: the process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. As an important method for unsupervised learning, the idea of clustering is to classify samples or objects with similar attributes into one class. The class generated by the clustering is a collection of objects that are similar to objects in the same class and distinct from objects in other classes. Common clustering methods are K-means clustering, mean shift clustering, density-based clustering methods, and the like.

A classifier: the conventional task of a classifier is to learn classification rules using known training data for a given class and then classify or predict unknown data. The classifier is a general term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.

Semi-supervised learning: the method is a training mode/learning mode in machine learning (machine learning), is between supervised learning and unsupervised learning, and is a learning method combining the supervised learning and the unsupervised learning. For semi-supervised learning, one part of the data used for training is labeled and the other part is unlabeled, and the amount of unlabeled data is often much larger than that of labeled data (which is also realistic). The basic rule hidden under semi-supervised learning lies in that: the distribution of the data is not necessarily completely random, and acceptable or even very good classification results can be obtained through some local features of the labeled data and the overall distribution of more unlabeled data.

And (3) back propagation: the gradient descent algorithm is a gradient descent algorithm with a recursive structure, and is widely used as a basic learning training method of a deep neural network.

Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Fig. 1 illustrates a schematic flow diagram of a method 100 of determining a training set according to one embodiment of the present disclosure. The training set may be used to train a content sample classifier for classifying content samples to derive categories of content samples. As shown in fig. 1, the method 100 includes the following steps 101-104.

In step 101, a data set comprising a plurality of content samples without tags is obtained. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.

In step 102, the plurality of content samples are clustered by using each clustering method of a plurality of clustering methods different from each other, so as to determine a category corresponding to the highest confidence in the confidence distribution of the categories obtained by clustering each content sample of the plurality of content samples under each clustering method. The plurality of Clustering methods are different from each other, and the Clustering method may be any suitable Clustering method, such as K-means Clustering, DEC (Deep Embedded Clustering), IDEC (Improved Deep Embedded Clustering), DCEC (Deep Convolutional Embedded Clustering), classifier-based Clustering method using artificial intelligence, and the like, although this is not limitative.

As an example, given

An unlabeled content sample

Clustering these content samples using a plurality of clustering methods different from each other can obtain a confidence distribution of the category

Wherein M is the number of clustering methods,

the number of classes clustered for each clustering method,

and indicates the use of

In the case of individual clustering methods

A sample belongs to

Confidence of individual classes. It should be noted that the number of categories formed by clustering using each clustering method is generally set in advance,and is typically set the same for all clustering methods.

In some embodiments, for a classifier-based clustering method, the classifier may directly output the confidence of each content sample belonging to each of the plurality of classes, i.e., the confidence distribution of the classes. The highest confidence may then be determined from the confidence distributions for the categories for each of the content samples, and the category to which the highest confidence corresponds may be derived therefrom. For example, when using the 1 st clustering method, the classifier outputs the confidence distribution for the class for the 1 st sample

Is composed of

Then, the highest confidence of the samples may be determined to be 0.90, and the class corresponding to the highest confidence of the 1 st sample may be determined to be the 2 nd class.

In some embodiments, for a K-means clustering method such a clustering method that directly embodies the cluster class (i.e., the class corresponding to the highest confidence is directly obtained) rather than the confidence for each class, the confidence of a sample for each class (i.e., the confidence distribution of the class of the sample) may be calculated using the student t-distribution. The t distribution of the student can be expressed as

Wherein the content of the first and second substances,

the degree of freedom for the distribution of students t, usually set to 1,

to be aligned with

The characteristics of the individual content samples are such that,

obtained for clustering

The sample center point in a category (which is typically the mean of the spatial coordinates corresponding to the features of the content samples in the category).

In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method. In this case, when the plurality of content samples are clustered using the K-means clustering method, feature data of a plurality of dimensions may be extracted for each of the plurality of content samples; then, reducing the dimension of the characteristic data of the plurality of dimensions; and finally, clustering the content sample pairs based on the feature data of the content samples after dimensionality reduction. By extracting the characteristic data of multiple dimensions of the content sample, the accuracy of clustering can be improved. The dimension reduction of the feature data of the plurality of dimensions can reduce the calculation amount of the clustering process. As an example, the feature data of the plurality of dimensions may be subjected to a dimensionality reduction process using a Principal Component Analysis (PCA) method. The principal component analysis method is to convert a group of variables with possible correlation into a group of linearly uncorrelated variables through orthogonal transformation, and the group of converted variables is called principal components. It should be noted that the principal component analysis method is only an example, and that virtually any method that can convert multidimensional feature data into data of fewer dimensions is possible, and is not limited herein.

At step 103, for each content sample of the plurality of content samples, in response to determining: and if the categories corresponding to all the highest confidence degrees into which each content sample is clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than the confidence degree threshold value, labeling each content sample to form a labeled content sample. The label indicates a confidence level for the labeled content sample for each of a plurality of categories corresponding to outputs of a content sample classifier. In an embodiment of the present disclosure, if a content sample satisfies two conditions simultaneously (a) all categories corresponding to the highest confidence levels into which the content sample is clustered under the plurality of clustering methods, respectively, are the same and (b) all the highest confidence levels are greater than a confidence threshold, it may be determined that the categories into which the content sample is clustered under different clustering methods are the same, and such a content sample may be determined to be a "clean" sample and may therefore be labeled. Correspondingly, content samples in the data set that do not satisfy the above two conditions (a) and (b) may be determined to be "noise" samples and therefore not labeled, as the category into which such content samples are clustered is likely to be inaccurate.

Taking the example of using two different clustering methods, if the two categories corresponding to the highest confidence levels into which a content sample is clustered under two different clustering methods, respectively, are the same, and each of the two highest confidence levels is greater than a confidence level threshold, the content sample may be determined to be a "clean" sample and may therefore be labeled. This can be clearly expressed by the following expression:

wherein the content of the first and second substances,

is shown as

Whether an individual sample of content is a "clean" sample,

represents the first

One of the samples is a "clean" sample,

represents the first

One sample is not a "clean" sample,

is a confidence threshold.

And

respectively represent

The highest confidence of the class into which the individual content samples are clustered under the 1 st clustering method and the 2 nd clustering method,

and

respectively represent

And clustering the content samples under the 1 st clustering method and the 2 nd clustering method into two categories corresponding to the highest confidence degrees.

Since the classes clustered using the clustering method are randomly generated, different clustering methods do not correspond to the generated classes, and thus there is a certain difficulty in determining the condition (a). In some embodiments of the present disclosure, if, among all the categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods, the category corresponding to the highest confidence with which the each content sample is clustered has the largest number of identical content samples between the category corresponding to the highest confidence with which the each content sample is clustered when using a second clustering method of the every two clustering methods, it is determined that all the categories corresponding to the highest confidence with which the each content sample is clustered under the plurality of clustering methods, respectively, are identical. This provides an efficient, accurate method to determine whether the highest confidence corresponding category to which each of the content samples is respectively clustered under the plurality of clustering methods is the same when two or more clustering methods are used.

In some embodiments, when tagging each of the unlabeled content exemplars, the each unlabeled content exemplar may be tagged

Wherein

Wherein the content of the first and second substances,

a sequence number representing the content sample in the data set,

for the number of categories to be clustered using each clustering method,

is shown as

The content sample is directed to

Confidence for each class and K is less than or equal to K,

is as follows

The category corresponding to the highest confidence of each content sample, J is the total number of clustering methods,

to use the jth clustering method

Highest confidence of individual content samples. It should be noted that

The specific value or identification of the category m corresponding to the highest confidence of each content sample may be predetermined (e.g., determined as the 2 nd category), as long as the determined categories are consistent with the categories corresponding to the confidences output by the content sample classifier.

In other words, when labeling a content sample, an average value of the highest confidence levels of the categories into which the content sample is respectively clustered under the plurality of clustering methods may be determined, and a first confidence level of the category corresponding to the highest confidence level in the label (i.e., a confidence level corresponding to an expected category of the content sample) is labeled as the average value; and marking the second confidence of each category in the label in other categories except the category corresponding to the highest confidence, so that the sum of the first confidence and all the second confidence is 1, namely the second confidence is set as the ratio of the difference value between 1 and the average value to the number of the categories in other categories.

In some embodiments, when labeling a content sample, a first confidence in the label of the category corresponding to the highest confidence (i.e., the confidence corresponding to the expected category of the content sample) may be labeled as 1 (i.e., 100% confidence), and a second confidence in the label of each of the other categories except the category corresponding to the highest confidence may be labeled as 0, respectively. Of course, this is not limiting as long as the category corresponding to the highest confidence can be significantly distinguished from the remaining categories.

In some embodiments, at least one of the plurality of clustering methods is a classifier-based clustering method as described above. Under the condition, the 'clean' sample can be updated, so that the obtained 'clean' sample is more accurate, and the clustering accuracy of the content sample classifier obtained by final training is improved. As an example, fig. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample. As shown in fig. 4, in step 401, the labeled content samples and unlabeled content samples may be used to train a classifier on which the clustering method is based. In step 402, the plurality of content samples may be clustered using a clustering method based on the trained classifier to determine a confidence distribution of an updated category in which each content sample is clustered under the clustering method based on the trained classifier. In step 403, the labeled content samples are reformed based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under other clustering methods of the plurality of clustering methods. Step 403 corresponds to step 103 being performed once again. Optionally, the update procedure may be repeated until the set of "clean" samples no longer changes, although this is not limiting.

At step 104, labeled content samples and unlabeled content samples in the data set are determined together as a training set for the content sample classifier. The content sample with the label is the content sample marked in step 103, and the content sample without the label is the rest content sample without the label in the data set. Together, the labeled content samples and unlabeled content samples in the data set are determined as a training set for the content sample classifier so that the content sample classifier can be trained.

It should be noted that the embodiment of the present disclosure does not limit the specific structure of the content sample classifier, and may be adaptively changed according to the type of the content sample. For example, in a case where each of the plurality of content samples is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (convolutional neural Networks) or the like. In the case where each of the plurality of content samples is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), BERT (bidirectional encoder characterization from Transformers), or the like.

It should be noted that the term "plurality" in the embodiments of the present disclosure includes two as well as more than two, unless otherwise specified. For example, the plurality of clustering methods may include two clustering methods and more than two clustering methods.

In the method for determining a training set described in the embodiments of the present disclosure, all content samples are divided into a set of "clean" samples and a set of "noise" samples by using a clustering result of a plurality of clustering methods on content samples in a data set. Then, marking the 'clean' sample with a label to form a content sample with the label, wherein the label is constructed through output results of a plurality of clustering methods; and the "noise" samples are taken as unlabeled content samples. After the set is divided, the content samples with labels and the content samples without labels can be used as a training set of the content sample classifier to train the content sample classifier. The method has the advantages that the clustering results of the content samples in the data set are fully utilized, and the 'clean' samples and 'noise' samples can be accurately determined. The label is marked on the 'clean' sample, and the label and the 'noise' sample are used as a training set to train the content sample classifier, so that the clustering accuracy and the generalization of the trained content sample classifier can be greatly improved, and meanwhile, the trained content sample classifier can obtain the category of the content sample from end to end.

Fig. 2 illustrates a schematic flow diagram of a method 200 of training a content sample classifier according to one embodiment of the present disclosure. As shown in fig. 2, the method step 200 includes the following steps.

In step 201, a training set for a content sample classifier is obtained. The content sample classifier is used for clustering the content samples to obtain the category of the content samples. The training set is determined, for example, according to the method 100 described with reference to fig. 1 and includes labeled content samples and unlabeled content samples. Each of the labeled content sample and the unlabeled content sample may be an image data sample, a text data sample, or a voice data sample, and the type of the content sample is not limited herein.

At step 202, the content sample classifier is trained using the labeled content samples and unlabeled content samples in a training set. As an example, the content sample classifier may be trained in a semi-supervised learning manner using the labeled and unlabeled content samples in a training set. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.

In some embodiments, when training the content sample classifier with the labeled and unlabeled content samples in a training set, the content sample classifier may be trained by adjusting parameters of the content sample classifier such that an overall loss function is minimized. The total loss function is the sum of the loss functions for each content sample (including both tagged and untagged content samples). As an example, a loss function for each labeled content sample is used to constrain the proximity of the confidence distribution of the class of each labeled content sample output by the content sample classifier to the label of each labeled content sample; the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation, including the confidence for each of the plurality of classes, and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector.

As an example, the total loss function may be expressed as the following expression:

wherein the content of the first and second substances,

a sequence number representing the content sample in the data set,

is as follows

The loss function of an individual sample of content,

represents a cross-entropy loss function of the entropy of the sample,

the function of the entropy loss is represented by,

represents a classifier pair

The confidence distribution of the output class or classes,

is shown as

Whether or not each content sample is a "clean" sample, and Ture represents the second

One of the samples is a "clean" sample,

represents the first

The samples are not "clean" samples (i.e., are noise samples),

and

respectively represent to

The content samples are subjected to a first random transformation and a second random transformation which are different from each other.

In the above expression, when

Each content sample is a "clean" sample (i.e.,

) The loss function for the content sample may be

For constraining the closeness of the confidence distribution for the category of the ith content sample output by the content sample classifier to the label of the ith content sample, namely: adjusting parameters of the content sample classifier to enable the confidence degree distribution of the classification of the ith content sample output by the content sample classifier to be matched with the ith content sampleAs close as possible to the labels of the content samples.

In the above expression, when

The individual content samples are "noise" samples (i.e.,

) The loss function for the content sample may be

Wherein, in the step (A),

the method is used for constraining the direct invariance of the confidence coefficient distribution of the category output by the content sample classifier after the ith content sample is subjected to the first random transformation and the confidence coefficient distribution of the category output by the content sample classifier after the ith content sample is subjected to the second random transformation. For "noise" samples, since their labels cannot be determined, embodiments of the present disclosure train based on the point of data enhancement invariance, namely: the confidence distribution of the content sample classifier for the class of the content sample output is not affected by the stochastic transformation.

The confidence distributions used to constrain the output of the content sample classifier to the similarity of the one-hot vectors are: by adjusting the parameters of the content sample classifier, the confidence degree distribution of the classification of the ith content sample output by the content sample classifier is as close as possible to the form of a one-hot vector (one-hot vector). The one-hot vector is a vector formed by one-hot encoding. One-hot encoding, also known as one-bit-efficient encoding, uses an N-bit status register to encode N states, each having its own independent register bit and only one of which is active at any one time. For example, the 6 unique heat vectors formed by the unique heat coding may be 000001, 000010, 000100, 001000, 010000, 100000, respectively, where only one bit is valid in each unique heat vector.

It should be noted that the random transformation described above may be any suitable random transformation and is not limited thereto. For example, when the content sample is an image data sample, the random transformation may be random cropping, random horizontal transformation, color dithering, or randomly combining color channels, etc. When the content sample is a text data sample, the random transformation may be to translate the content sample into another language and then back into the original language (the semantics are unchanged, but the text is changed). It should also be noted that the above-described loss function is not limiting, and any other suitable loss function may be used.

In some embodiments, in training the content sample classifier with the labeled and unlabeled content samples in the training set, the classifier may be trained in separate passes by selecting the same number of labeled and unlabeled content samples from the training set each time. For example, each sampling may be performed

A "clean" sample and

the "noise" samples, B being a positive integer, are then summed as the total loss function per training. By using the same number of labeled content samples and unlabeled content samples, the clustering accuracy and the generalization of the trained content sample classifier can be balanced, so that the trained content sample classifier has a better clustering effect.

As an example, the parameters of the content sample classifier may be adjusted by back propagation (back propagation) according to the total loss function described above. In the back propagation process, the learning rate for each parameter of the content sample classifier can be dynamically adjusted by computing first and second moment estimates of the gradient of the loss function. The back propagation is essentially based on the gradient descent method, and the step length can be 0.001 and the first moment in trainingEstimated exponential decay Rate

Set to 0.9, the second moment estimated exponential decay Rate

Set to 0.999. Batch size at optimization

It can be set to 128, and for all parameters, using L2 regularization, the regularization coefficient is 0.0001. The step size decay is 0.1 times the previous one for each 50 times training of the content sample using the content sample in the training set. Generally, confidence thresholds are used when partitioning "clean" samples

May be set to 0.95. The content sample classifier obtained through the training shows a good clustering effect.

In the method for training the content sample classifier described in the embodiment of the disclosure, the content sample classifier is trained by using the content sample with the label and the content sample without the label as a training set, so that the clustering accuracy and the generalization of the trained content sample classifier are greatly improved, thereby improving the clustering effect, and the trained content sample classifier realizes the purpose of acquiring the category of the content sample from end to end.

Fig. 3 illustrates a schematic flow diagram of a method 300 of clustering content samples according to one embodiment of the present disclosure. As shown in fig. 3, the method step 300 includes the following steps.

In step 301, a data set comprising a plurality of content samples without tags is obtained. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.

In step 302, the plurality of content samples are clustered by using each clustering method of a plurality of clustering methods different from each other, so as to determine a category corresponding to the highest confidence in the confidence distribution of the categories into which each content sample is clustered under each clustering method. The Clustering method may be any suitable Clustering method, such as K-means Clustering, DEC (Deep Embedded Clustering), IDEC (Improved Deep Embedded Clustering), DCEC (Deep Convolutional Embedded Clustering), classifier-based Clustering method using artificial intelligence, and the like, although this is not limiting.

As an example, given

An unlabeled content sample

Clustering content samples using a plurality of clustering methods different from each other can obtain a confidence distribution of a category

Wherein M is the number of clustering methods,

the number of classes clustered for each clustering method,

and indicates the use of

In the case of individual clustering methods

A sample belongs to

Confidence of individual classes. It should be noted that the number of categories formed by clustering using each clustering method is generally set in advance and is directed toAll clustering methods are typically set to be the same.

In some embodiments, for classifier-based clustering methods, the classifier may directly output the confidence of each content sample belonging to each class. Then, the highest confidence may be determined from the confidence of each content sample belonging to each category, and the category corresponding to the highest confidence is obtained accordingly. In some embodiments, for a K-means clustering method such a clustering method that directly embodies the cluster category (i.e., the category corresponding to the highest confidence is directly obtained) rather than the confidence for each category, the confidence of the sample for each category (i.e., the confidence distribution of the category of the sample) may be calculated using the student t-distribution, as described with reference to step 102 of fig. 1.

In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method. In this case, when the plurality of content samples are clustered using the K-means clustering method, feature data of a plurality of dimensions may be extracted for each of the plurality of content samples; then, reducing the dimension of the characteristic data of the plurality of dimensions; and finally, clustering the content sample pairs based on the feature data of the content samples after dimensionality reduction. By extracting the characteristic data of multiple dimensions of the content sample, the accuracy of clustering can be improved. Performing dimension reduction on the feature data of the plurality of dimensions can simplify the calculation amount of the clustering process.

At step 303, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than the confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of the content sample classifier.

In the embodiment of the present disclosure, if a content sample simultaneously satisfies two conditions (a) the categories corresponding to all the highest confidences into which the content sample is respectively clustered under the multiple clustering methods are the same and (b) all the highest confidences are greater than the confidence threshold, it may be determined that the categories into which the content sample is clustered under different clustering methods are all the same and have high accuracy, and such content sample may be determined as a "clean" sample and thus may be labeled. Correspondingly, content samples in the data set that do not satisfy any of the above two conditions (a) and (b) may be determined to be "noise" samples and therefore not labeled, as the category into which such content samples are clustered is likely to be inaccurate.

Taking the example of using two different clustering methods, if the two categories corresponding to the highest confidence levels into which a content sample is clustered under the two different clustering methods, respectively, are the same, and each of the two highest confidence levels is greater than a confidence level threshold, the content sample may be determined to be a "clean" sample and may therefore be labeled.

In some embodiments, when labeling a content sample, an average value of the highest confidence values of the categories into which the content sample is respectively clustered under the plurality of clustering methods may be determined, and a first confidence value of the category corresponding to the highest confidence value in the label (i.e., a confidence value corresponding to an expected category of the content sample) is labeled as the average value; and marking the second confidence of each category in the label in other categories except the category corresponding to the highest confidence, so that the sum of the first confidence and all the second confidence is 1, namely the second confidence is set as the ratio of the difference value between 1 and the average value to the number of the categories in other categories. In some embodiments, when labeling a content sample, a first confidence in the label of the category corresponding to the highest confidence (i.e., the confidence corresponding to the expected category of the content sample) may be labeled as 1 (i.e., 100% confidence), and a second confidence in the label of each of the other categories except the category corresponding to the highest confidence may be labeled as 0, respectively. Of course, this is not limiting as long as the category corresponding to the highest confidence can be significantly distinguished from the remaining categories.

At step 304, the content sample classifier is trained using the labeled content samples and unlabeled content samples in the dataset to obtain a trained content sample classifier. As an example, the content sample classifier can be trained in a semi-supervised learning manner using the labeled and unlabeled content samples. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.

In some embodiments, when training the content sample classifier with the labeled and unlabeled content samples in a training set, the content sample classifier may be trained by adjusting parameters of the content sample classifier such that an overall loss function is minimized. The total loss function is the sum of the loss functions for each content sample (including both tagged and untagged content samples). As an example, a loss function for each labeled content sample is used to constrain the proximity of the confidence distribution of the class of each labeled content sample output by the content sample classifier to the label of each labeled content sample; the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation, including the confidence for each of the plurality of classes, and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector. The specific total loss function may be described in step 202 described with reference to fig. 2.

In some embodiments, in training the content sample classifier with the labeled and unlabeled content samples, the classifier may be trained in fractions by selecting the same number of labeled and unlabeled content samples from a training set each time. For example, each sampling may be performed

A "clean" sample and

As an example, the parameters of the content sample classifier may be adjusted by back propagation (back propagation) according to the total loss function described above. In the back propagation process, the learning rate for each parameter of the content sample classifier can be dynamically adjusted by computing first and second moment estimates of the gradient of the loss function. The specific training parameters may be as described in step 202 described with reference to fig. 2. The content sample classifier is trained by using the content samples with the labels and the content samples without the labels in a semi-supervised learning mode, so that the clustering accuracy and the generalization of the trained content sample classifier are greatly improved, the clustering effect is improved, and meanwhile, the trained content sample classifier realizes the purpose of acquiring the categories of the content samples end to end.

The embodiment of the present disclosure does not limit the specific structure of the content sample classifier, and may be adaptively changed according to the type of the content sample. For example, in the case that the content sample to be clustered is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (Convolutional Neural Networks) or the like. In the case that the content sample to be clustered is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network), LSTM (Long Short-term memory Network), BERT (Bidirectional Encoder characterization quantity of self-transformer), or the like.

In some embodiments, at least one of the plurality of clustering methods is a classifier-based clustering method as described above. Under the condition, the 'clean' sample can be updated, so that the obtained 'clean' sample is more accurate, and the clustering accuracy of the content sample classifier obtained by final training is improved.

FIG. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample. As shown in fig. 4, in step 401, the labeled content samples and unlabeled content samples may be used to train a classifier on which the clustering method is based. In step 402, the plurality of content samples may be clustered using a clustering method based on the trained classifier to determine a confidence distribution of an updated category in which each content sample is clustered under the clustering method based on the trained classifier. In step 403, the labeled content samples are reformed based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under other clustering methods of the plurality of clustering methods. Optionally, the update procedure may be repeated until the set of "clean" samples no longer changes, although this is not limiting.

In step 305, clustering content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered. In some embodiments, the content samples to be clustered may be input to a trained content sample classifier such that the content sample classifier outputs confidence distributions of the content samples to be clustered that correspond to the plurality of classes; then, the category with the highest confidence in the confidence distribution is determined as the category of the content sample to be clustered. The output of the content sample classifier is the confidence distributions of the content samples to be clustered corresponding to a plurality of categories, from which the category corresponding to the highest confidence is selected as the category for the content sample, which has the highest clustering accuracy and implements a clustering process that directly obtains the clustering categories.

In the method for clustering content samples described in the embodiments of the present disclosure, all content samples are accurately divided into a set of "clean" samples and a set of "noise" samples by fully utilizing the clustering results of a plurality of clustering methods on the content samples in a data set. The "clean" sample is then labeled to form a labeled sample of content. After the set division is finished, the content sample classifier can be trained by using the content samples with the labels and the content samples without the labels, so that the clustering accuracy and the generalization of the trained content sample classifier can be greatly improved, meanwhile, the classes of the content samples to be clustered can be obtained end to end through the trained content sample classifier, and the clustering effect is improved.

Fig. 5 illustrates an exemplary schematic diagram of clustering content samples according to one embodiment of the present disclosure. As shown in fig. 5, two clustering methods (a first clustering method and a second clustering method) are respectively used to cluster a plurality of content samples without labels in a data set to obtain a clustering result, wherein the clustering result includes a category corresponding to the highest confidence in which each of the plurality of content samples is clustered under each clustering method. Then, the clustering results obtained by using the first clustering method and the second clustering method respectively are matched to divide the content samples in the data set into "clean" samples and "noise" samples, and the "clean" samples are labeled to form labeled content samples. The label is to indicate a confidence of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier. In the dividing, for each content sample, if two conditions are satisfied, (a) the categories corresponding to all the highest confidence levels into which the content sample is respectively clustered under the multiple clustering methods are the same, and (b) all the highest confidence levels are greater than a confidence level threshold, the content sample may be determined to be a "clean" sample, otherwise, the content sample is a "noise" sample. All unlabeled content samples and labeled content samples together form a training set to train the content sample classifier in a semi-supervised learning manner to obtain a trained content sample classifier. Then, the content samples to be clustered may be input into the trained content sample classifier, so that the content sample classifier outputs confidence distributions of the content samples to be clustered corresponding to a plurality of categories, and then the category with the highest confidence in the confidence distributions is determined as the category of the content samples, thereby achieving an effect of obtaining the category of the content samples to be clustered end to end. The content sample to be clustered may be a content sample of the unlabeled plurality of content samples, or may be a content sample having a same content sample type (e.g., an image data sample, a text data sample, or a voice data sample, etc.) as the plurality of content samples.

Fig. 6 illustrates an exemplary block diagram of an apparatus 600 for determining a training set according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes a data set acquisition module 601, a sample clustering module 602, a sample labeling module 603, and a training set determination module 604.

The dataset acquisition module 601 is configured to acquire a dataset comprising a plurality of content samples without tags. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.

The sample clustering module 602 is configured to cluster the plurality of content samples using each of a plurality of clustering methods that are different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each of the plurality of content samples is clustered under the each clustering method. The plurality of clustering methods are different from each other, and the clustering method may be any suitable clustering method, such as K-means clustering, DEC, IDEC, DCEC, classifier-based clustering methods using artificial intelligence, and the like, although this is not limitative.

The sample tagging module 603 is configured to, for each content sample of the plurality of content samples, in response to determining: the categories corresponding to all the highest confidence levels into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence levels are greater than a confidence level threshold value, labeling each content sample to form a labeled content sample, and labeling each content sample to form a labeled content sample, wherein the label indicates the confidence level of the labeled content sample for each category in a plurality of categories corresponding to the output of a content sample classifier.

The training set determination module 604 is configured to determine labeled content samples and unlabeled content samples in the data set together as a training set for the content sample classifier. The content samples with the labels are the content samples marked by the marking module, and the content samples without the labels are the rest content samples without the labels in the data set. Together, the labeled content samples and unlabeled content samples in the data set are determined as a training set for the content sample classifier so that the content sample classifier can be trained.

Fig. 7 illustrates an exemplary block diagram of an apparatus 700 for training a content sample classifier according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes a training set acquisition module 701 and a classifier training module 702.

The training set obtaining module 701 is configured to obtain a training set for a content sample classifier, wherein the training set is determined by the apparatus 600 for determining a training set described with reference to fig. 6 and comprises labeled content samples and unlabeled content samples. The content sample classifier is used for classifying the content samples to obtain the categories of the content samples. Each of the labeled content sample and the unlabeled content sample may be an image data sample, a text data sample, or a voice data sample, and the type of the content sample is not limited herein.

Classifier training module 702 is configured to train the content sample classifier using the labeled and unlabeled content samples in a training set. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.

Fig. 8 shows an exemplary block diagram of an apparatus 800 for clustering content samples according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 includes an obtaining module 801, a clustering module 802, a labeling module 803, a training module 804, and a determining module 805.

The acquisition module 801 is configured to acquire a data set comprising a plurality of content samples without tags. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited thereto.

The clustering module 802 is configured to cluster the plurality of content samples using each of a plurality of clustering methods that are different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method. The plurality of clustering methods are different from each other, and the clustering method may be any suitable clustering method, such as K-means clustering, DEC, IDEC, DCEC, classifier-based clustering methods using artificial intelligence, and the like, although this is not limitative.

The tagging module 803 is configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and if all the highest confidence degrees are greater than the confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of the content sample classifier.

The training module 804 is configured to train the content sample classifier using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier. As an example, the training module 804 is configured to train the content sample classifier in a semi-supervised learning manner using the labeled content samples and unlabeled content samples. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (i.e., generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.

The determining module 805 is configured to cluster the content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered. The output of the content sample classifier is a confidence distribution for the content sample corresponding to a plurality of categories, from which, in some embodiments, the category of highest confidence may be selected as the category for the content sample. This has the highest clustering accuracy and enables a clustering process that directly gets the cluster class.

Fig. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. The computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system. The means for determining a training set 600 described above with reference to fig. 6, the means for training a content sample classifier 700 described with reference to fig. 7, and the means for clustering content samples 800 described with reference to fig. 8 may all take the form of a computing device 910. Alternatively, the means for determining a training set 600, the means for training a content sample classifier 700, and the means for clustering content samples 800 may each be implemented as a computer program in the form of an application 916.

The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.

The processing system 911 represents functionality to perform one or more operations using hardware. Accordingly, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 914 is not limited by the material from which it is formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.

The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in various other ways as further described below.

One or more I/O interfaces 913 represent functionality that allows a user to enter commands and information to computing device 910 using various input devices and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Thus, the computing device 910 may be configured in various ways to support user interaction, as described further below.

The computing device 910 also includes an application 916. The application 916 may be, for example, a software instance of the apparatus 600 to determine a training set, the apparatus 700 to train a content sample classifier, and the apparatus 800 to cluster content samples, and in combination with other elements in the computing device 910 to implement the techniques described herein.

Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.

As previously described, hardware element 914 and computer-readable medium 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that, in some embodiments, may be used to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.

Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. The computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules.

In various implementations, the computing device 910 may assume a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 910 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. The computing device 910 may also be implemented as a television-like device that includes or is connected to a device having a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.

The techniques described herein may be supported by these various configurations of the computing device 910 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.

Cloud 920 includes and/or is representative of a platform 922 for resources 924. The platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 920. The resources 924 may include other applications and/or data that may be used when executing computer processes on servers remote from the computing device 910. The resources 924 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.

It will be appreciated that embodiments of the disclosure have been described with reference to different functional units for clarity. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the disclosure. For example, functionality illustrated to be performed by a single unit may be performed by a plurality of different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component or section from another device, element, component or section.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the terms "a" or "an" do not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims

1. A method of clustering content samples, comprising:

obtaining a data set comprising a plurality of content samples without tags;

clustering the plurality of content samples by using each clustering method in a plurality of clustering methods different from each other so as to determine a category corresponding to the highest confidence in confidence distributions of categories into which each content sample is clustered under each clustering method;

for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier;

training the content sample classifier by using the labeled content samples and the unlabeled content samples in the data set to obtain a trained content sample classifier;

and clustering the content samples to be clustered by utilizing the trained content sample classifier so as to determine the category of the content samples to be clustered.

2. The method of claim 1, further comprising:

in response to a highest confidence corresponding category to which the each content sample is clustered having a highest number of identical content samples among all categories clustered using a first clustering method of every two clustering methods of the plurality of clustering methods and a highest confidence corresponding category to which the each content sample is clustered when a second clustering method of the every two clustering methods is used, determining that all highest confidence corresponding categories to which the each content sample is clustered under the plurality of clustering methods, respectively, are identical.

3. The method of claim 1, wherein tagging each of the content samples comprises:

determining an average value of the highest confidence degrees of the categories into which each content sample is respectively clustered under the plurality of clustering methods, and marking a first confidence degree of the category corresponding to the highest confidence degree in the label as the average value, an

Marking a second confidence level of each category in the label in other categories except the category corresponding to the highest confidence level, so that the sum of the first confidence level and all the second confidence levels is 1.

4. The method of claim 1, wherein tagging each of the content samples comprises:

the first confidence in the label for the category corresponding to the highest confidence is labeled 1, an

And respectively marking the second confidence degrees of each category in the labels in other categories except the category corresponding to the highest confidence degree as 0.

5. The method of claim 1, wherein at least one of the plurality of clustering methods comprises a classifier-based clustering method, and wherein prior to the training of the content sample classifier with labeled and unlabeled content samples in the dataset, the method further comprises:

training a classifier based on the clustering method by using the labeled content samples and the unlabeled content samples;

clustering the plurality of content samples by using a clustering method based on a trained classifier to determine the confidence degree distribution of an updated category obtained by clustering each content sample under the clustering method based on the trained classifier;

reformulating the labeled content samples based on the updated confidence distributions for the categories and the confidence distributions for the categories in which each content sample is clustered under the other clustering methods of the plurality of clustering methods, respectively.

6. The method of claim 1, wherein at least one of the plurality of clustering methods is a K-means clustering method, and wherein clustering the plurality of content samples using the K-means clustering method further comprises:

extracting feature data of a plurality of dimensions for each of the plurality of content samples;

reducing the dimensions of the feature data of the plurality of dimensions;

clustering the plurality of content sample pairs based on the reduced-dimension feature data of the plurality of content samples.

7. The method of claim 1, wherein training the content sample classifier with labeled and unlabeled content samples in the dataset comprises:

training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized, wherein the total loss function is a sum of loss functions for each content sample, and,

wherein a loss function for each labeled content sample is used to constrain the proximity of the confidence distribution of the class of each labeled content sample output by the content sample classifier to the label of each labeled content sample;

wherein the loss function for each unlabeled content sample is used to constrain invariance of the confidence distributions of the content sample classifier for the classes output by each unlabeled content sample after the stochastic transformation and similarity of the confidence distributions of the classes output by the content sample classifier to the one-hot vector, wherein the confidence distributions of the classes include a confidence for each of the plurality of classes.

8. The method of claim 7, wherein training the content sample classifier by adjusting parameters of the content sample classifier such that an overall loss function is minimized comprises:

training the content sample classifier in a back-propagation manner, wherein a learning rate for each parameter of the content sample classifier is dynamically adjusted in the back-propagation by calculating a first moment estimate and a second moment estimate of a gradient of a loss function.

9. The method of claim 1, wherein training the content sample classifier using labeled and unlabeled content samples in the dataset comprises:

the classifier is trained in separate runs each time the same number of labeled and unlabeled content samples are selected.

10. The method of claim 1, wherein clustering content samples to be clustered using a trained content sample classifier to determine categories of the content samples to be clustered comprises:

inputting the content samples to be clustered into a trained content sample classifier, so that the content sample classifier outputs confidence distributions of the content samples to be clustered, which correspond to the plurality of classes;

determining the category with the highest confidence in the confidence distribution as the category of the content sample to be clustered.

11. The method of claim 1, wherein each of the plurality of content samples comprises an image data sample and the structure of the content sample classifier comprises a convolutional neural network.

12. The method of claim 1, wherein each of the plurality of content samples comprises one of a text data sample and a speech data sample, and the structure of the content sample classifier comprises one of a recurrent neural network, a long-short term memory network, and a bi-directional encoder characterization of an auto-transformer.

13. An apparatus for clustering content samples, comprising:

an acquisition module configured to acquire a dataset comprising a plurality of content samples without tags;

a clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of categories in which each content sample is clustered under the each clustering method;

a tagging module configured to, for each content sample of the plurality of content samples, in response to determining: if the classes corresponding to all the highest confidence degrees into which each content sample is respectively clustered under the plurality of clustering methods are the same, and all the highest confidence degrees are greater than a confidence degree threshold value, labeling each content sample to form a labeled content sample, wherein the label indicates the confidence degree of the labeled content sample for each class in a plurality of classes corresponding to the output of a content sample classifier;

a training module configured to train the content sample classifier using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier;

a determination module configured to cluster the content samples to be clustered using the trained content sample classifier to determine a category of the content samples to be clustered.

14. A computing device comprising

A memory configured to store computer-executable instructions;

a processor configured to perform the method of any one of claims 1-12 when the computer-executable instructions are executed by the processor.

15. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1-12.