CN111898704B - Method and device for clustering content samples - Google Patents

Method and device for clustering content samples Download PDF

Info

Publication number
CN111898704B
CN111898704B CN202010824726.2A CN202010824726A CN111898704B CN 111898704 B CN111898704 B CN 111898704B CN 202010824726 A CN202010824726 A CN 202010824726A CN 111898704 B CN111898704 B CN 111898704B
Authority
CN
China
Prior art keywords
content
clustering
sample
samples
content sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010824726.2A
Other languages
Chinese (zh)
Other versions
CN111898704A (en
Inventor
卢东焕
赵俊杰
马锴
郑冶枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010824726.2A priority Critical patent/CN111898704B/en
Publication of CN111898704A publication Critical patent/CN111898704A/en
Application granted granted Critical
Publication of CN111898704B publication Critical patent/CN111898704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application describes a method for clustering content samples, comprising the following steps: obtaining a dataset comprising a plurality of unlabeled content samples; clustering a plurality of content samples by using a plurality of clustering methods which are different from each other to determine a category corresponding to the highest confidence in the confidence distribution of the category obtained by clustering each content sample under each clustering method; for each content sample, in response to determining: each content sample is clustered into all categories corresponding to the highest confidence degrees under the clustering methods respectively, and the highest confidence degrees are larger than a confidence degree threshold value, labeling the labels on each content sample to form labeled content samples; training a content sample classifier using the labeled content samples and the unlabeled content samples to obtain a trained content sample classifier; clustering content samples to be clustered using a trained content sample classifier to determine categories thereof.

Description

Method and device for clustering content samples
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a method and apparatus for clustering content samples.
Background
Currently, when clustering content samples such as image data samples, voice data samples, text data samples, etc., a two-stage clustering method is generally employed. In a first stage, features are extracted for the content samples with an encoder, and then in a second stage the extracted features are clustered using a basic clustering algorithm such as a K-means algorithm to obtain categories for the respective samples. However, such clustering methods are often limited to the feature extraction capability of the encoder and do not work well, nor can the class of content samples be obtained end-to-end (i.e., the class of content samples is obtained directly from the content samples without the need to extract features from the content samples using the encoder). In addition, the basic clustering algorithm used itself can also affect the accuracy of clustering.
With the development of artificial intelligence, the clustering method based on the classifier can achieve end-to-end acquisition of the category of the content sample, but great difficulty is faced to training the classifier with good clustering accuracy, because training samples of the accurate category are known to be seriously insufficient, and the process of acquiring the training samples is also influenced by the feature extraction capability of the encoder and the basic clustering algorithm. There are also some studies attempting to comprehensively consider the clustering results of multiple basic clustering algorithms, but fusion is performed by using a simple weighted voting method, which is extremely inefficient.
Disclosure of Invention
In view of the foregoing, the present disclosure provides methods and apparatus for determining training sets, methods and apparatus for training content sample classifiers, and methods and apparatus for clustering content samples, it is desirable to overcome some or all of the above-mentioned drawbacks, as well as other possible drawbacks.
According to a first aspect of the present disclosure, there is provided a method of clustering content samples, comprising: obtaining a dataset comprising a plurality of unlabeled content samples; clustering the plurality of content samples by using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of the categories obtained by clustering each content sample under each clustering method; for each content sample of the plurality of content samples, in response to determining: labeling each content sample to form a labeled content sample if all categories corresponding to highest confidence levels into which the each content sample is clustered under the plurality of clustering methods, respectively, are the same and the all highest confidence levels are greater than a confidence level threshold, wherein the label indicates a confidence level of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier; training the content sample classifier with the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier; clustering content samples to be clustered by using a trained content sample classifier to determine categories of the content samples to be clustered.
In some embodiments, the method further comprises: in response to having the largest number of identical content samples between the category corresponding to the highest confidence that each content sample is clustered in using the first clustering method of every two clustering methods among all categories clustered in using the first clustering method of every two clustering methods, determining that all the categories corresponding to the highest confidence that each content sample is clustered in each of the plurality of clustering methods are identical.
In some embodiments, labeling each content sample with a label includes: and determining an average value of the highest confidence coefficient of the category into which each content sample is clustered under the plurality of clustering methods, marking the first confidence coefficient of the category corresponding to the highest confidence coefficient in the label as the average value, and marking the second confidence coefficient of each category in other categories except the category corresponding to the highest confidence coefficient in the label, so that the sum of the first confidence coefficient and all the second confidence coefficients is 1.
In some embodiments, labeling each content sample with a label includes: the first confidence of the category corresponding to the highest confidence in the label is marked as 1, and the second confidence of each of the other categories except the category corresponding to the highest confidence in the label is respectively marked as 0.
In some embodiments, at least one of the plurality of clustering methods comprises a classifier-based clustering method, and wherein prior to the training of the content sample classifier with the labeled content samples and the unlabeled content samples in the dataset, the method further comprises: training a classifier on which the clustering method is based by using the labeled content samples and the unlabeled content samples; clustering the plurality of content samples by using a clustering method based on a trained classifier to determine a confidence distribution of an updated category obtained by clustering each content sample under the clustering method based on the trained classifier; and reforming the labeled content samples based on the updated confidence distribution of the category, wherein each content sample is clustered under other clustering methods in the plurality of clustering methods.
In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method, and wherein clustering the plurality of content samples using the K-means clustering method further comprises: extracting feature data of a plurality of dimensions for each of the plurality of content samples; performing dimension reduction on the feature data of the multiple dimensions; and clustering the plurality of content sample pairs based on the feature data of the plurality of content samples after dimension reduction.
In some embodiments, training the content sample classifier with labeled content samples and unlabeled content samples in the dataset comprises: training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized, wherein the total loss function is a sum of the loss functions for each content sample, and wherein the loss function for each tagged content sample is used to constrain a confidence distribution of a category of each tagged content sample output by the content sample classifier to a proximity of a tag of each tagged content sample; wherein the loss function for each unlabeled content sample is used to constrain invariance of the confidence distribution of the content sample classifier for each unlabeled content sample output after the random transformation and the similarity of the confidence distribution of the content sample classifier output for the class to the one-hot vector, wherein the confidence distribution of the class includes a confidence for each of the plurality of classes.
In some embodiments, training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized comprises: the content sample classifier is trained in a back-propagation manner, wherein a learning rate for each parameter of the content sample classifier is dynamically adjusted by computing a first moment estimate and a second moment estimate of a gradient of a loss function in back-propagation.
In some embodiments, training the content sample classifier with labeled content samples and unlabeled content samples in the dataset comprises: the classifier is trained in fractions each time the same number of labeled content samples and unlabeled content samples are selected.
In some embodiments, clustering content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered comprises: inputting the content samples to be clustered into a trained content sample classifier, so that the content sample classifier outputs confidence distribution of the content samples to be clustered, which corresponds to the plurality of categories; the category with the highest confidence in the confidence distribution is determined as the category of the content sample to be clustered.
In some embodiments, each content sample of the plurality of content samples comprises an image data sample, and the structure of the content sample classifier comprises a convolutional neural network.
In some embodiments, each of the plurality of content samples comprises one of a text data sample and a speech data sample, and the structure of the content sample classifier comprises one of a recurrent neural network, a long-short-term memory network, and a bi-directional encoder characterization of the self-transformer.
According to a second aspect of the present disclosure, there is provided an apparatus for clustering content samples, comprising: an acquisition module configured to acquire a dataset comprising a plurality of unlabeled content samples; a clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other, so as to determine a category corresponding to a highest confidence in a confidence distribution of a category obtained by clustering each content sample under each of the clustering methods; a tagging module configured to, for each content sample of the plurality of content samples, in response to determining: labeling each content sample to form a labeled content sample if all categories corresponding to highest confidence levels into which the each content sample is clustered under the plurality of clustering methods, respectively, are the same and the all highest confidence levels are greater than a confidence level threshold, wherein the label indicates a confidence level of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier; a training module configured to train the content sample classifier with the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier; a determination module configured to cluster content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered.
According to a third aspect of the present disclosure, there is provided a method of determining a training set, comprising: obtaining a dataset comprising a plurality of unlabeled content samples; clustering the plurality of content samples by using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of the categories obtained by clustering each content sample under each clustering method; for each content sample of the plurality of content samples, in response to determining: labeling each content sample to form a labeled content sample if all categories corresponding to highest confidence levels into which the each content sample is clustered under the plurality of clustering methods, respectively, are the same and the all highest confidence levels are greater than a confidence level threshold, wherein the label indicates a confidence level of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier; a labeled content sample and an unlabeled content sample in the dataset are together determined as a training set for the content sample classifier.
According to a fourth aspect of the present disclosure, there is provided an apparatus for determining a training set, comprising: a data set acquisition device configured to acquire a data set including a plurality of unlabeled content samples; a sample clustering module configured to cluster the plurality of content samples by using each of a plurality of clustering methods different from each other, so as to determine a category corresponding to a highest confidence in a confidence distribution of a category obtained by clustering each content sample under each of the clustering methods; a sample tagging module configured to, for each content sample of the plurality of content samples, in response to determining: labeling each content sample to form a labeled content sample if all categories corresponding to highest confidence levels into which the each content sample is clustered under the plurality of clustering methods, respectively, are the same and the all highest confidence levels are greater than a confidence level threshold, wherein the label indicates a confidence level of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier; a training set determination module is configured to determine a labeled content sample and an unlabeled content sample together in the dataset as a training set for the content sample classifier.
According to a fifth aspect of the present disclosure, there is provided a method of training a content sample classifier, comprising: obtaining a training set for a content sample classifier, wherein the training set is determined according to the method of the third aspect of the present disclosure and includes labeled content samples and unlabeled content samples; the content sample classifier is trained using the labeled content samples and unlabeled content samples in a training set.
According to a sixth aspect of the present disclosure, there is provided an apparatus for training a content sample classifier, comprising: a training set acquisition module configured to acquire a training set for a content sample classifier, wherein the training set is determined according to the apparatus for determining a training set of the fourth aspect of the present disclosure and includes a labeled content sample and an unlabeled content sample; a classifier training module configured to train the content sample classifier with the labeled content samples and unlabeled content samples in a training set.
According to a seventh aspect of the present disclosure, there is provided a computing device comprising a processor; and a memory configured to store computer-executable instructions thereon that, when executed by the processor, perform any of the methods described above.
According to an eighth aspect of the present disclosure, there is provided a computer readable storage medium storing computer executable instructions that, when executed, perform any of the methods as described above.
In the method and device for determining the training set, the method and device for training the content sample classifier and the method and device for clustering the content samples, which are claimed in the disclosure, the clustering results of the content samples in the data set by fully utilizing a plurality of clustering methods can accurately determine the clean samples and the noise samples in the data set. The label is marked on the clean sample, and the content sample classifier is trained together with the noise sample, so that the clustering accuracy and generalization of the trained content sample classifier can be greatly improved, and the trained content sample classifier can acquire the category of the content sample end to end.
These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:
FIG. 1 illustrates a schematic flow diagram of a method of determining a training set in accordance with one embodiment of the present disclosure;
FIG. 2 illustrates a schematic flow diagram of a method of training a content sample classifier in accordance with one embodiment of the present disclosure;
FIG. 3 illustrates a schematic flow diagram of a method of clustering content samples according to one embodiment of the present disclosure;
FIG. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample in accordance with one embodiment of the present disclosure;
FIG. 5 illustrates an exemplary schematic diagram of clustering content samples according to one embodiment of the present disclosure;
FIG. 6 illustrates an exemplary block diagram of an apparatus for determining a training set in accordance with one embodiment of the present disclosure;
FIG. 7 illustrates an exemplary block diagram of an apparatus for training a content sample classifier in accordance with one embodiment of the present disclosure;
FIG. 8 illustrates an exemplary block diagram of an apparatus for clustering content samples according to one embodiment of the present disclosure;
FIG. 9 illustrates an example system including an example computing device that represents one or more systems and/or devices that can implement the various techniques described herein.
Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art may fully understand and practice the various embodiments of the disclosure. It should be understood that the technical solutions of the present disclosure may be practiced without some of these details. In some instances, well-known structures or functions have not been shown or described in detail to avoid obscuring the description of embodiments of the present disclosure with such unnecessary description. The terminology used in the present disclosure should be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms related to the embodiments of the present application will be described so as to be easily understood by those skilled in the art.
Clustering: the process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. As an important method of unsupervised learning, the idea of clustering is to group samples or objects with similar properties into a class. The class generated by a cluster is a collection of objects that are similar to objects in the same class, and are different from objects in other classes. Common clustering methods are K-means clustering, mean shift clustering, density-based clustering methods, and the like.
A classifier: the conventional task of a classifier is to learn classification rules with given classes, known training data, and then classify or predict unknown data. The classifier is a generic term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.
Semi-supervised learning: is a training mode/learning mode in machine learning (MACHINE LEARNING), which is between supervised learning and unsupervised learning, and is a learning method combining supervised learning and unsupervised learning. For semi-supervised learning, a part of the data used for training is tagged and another part is not tagged, and the amount of untagged data is often much larger than the amount of tagged data (which is also realistic). The basic law hidden under semi-supervised learning is that: the distribution of the data is not necessarily completely random, and acceptable or even very good classification results can be obtained by local features of some tagged data, and overall distribution of more untagged data.
Back propagation: essentially, a gradient descent algorithm with a recursive structure is widely used as a basic learning training method for deep neural networks.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Fig. 1 illustrates a schematic flow diagram of a method 100 of determining a training set according to one embodiment of the present disclosure. The training set may be used to train a content sample classifier for classifying content samples to obtain categories of content samples. As shown in fig. 1, the method 100 includes the following steps 101-104.
In step 101, a dataset comprising a plurality of unlabeled content samples is acquired. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited herein.
In step 102, clustering the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of a category obtained by clustering each of the plurality of content samples under the each clustering method. The plurality of clustering methods are different from each other and the clustering method therein may be any suitable clustering method, such as K-means clustering, DEC (Deep Embedded Clustering ), IDEC (Improved Deep Embedded Clustering, modified deep embedded clustering), DCEC (Deep Convolutional Embedded Clustering, deep convolution embedded clustering), classifier-based clustering method using artificial intelligence, etc., although this is not limiting.
As an example, given N unlabeled content samples { x i: i e { 1..once., N }, clustering the content samples using multiple clustering methods that are different from each other may result in a confidence distribution for the categoryWherein M is the number of clustering methods, K is the number of categories formed by clustering by each clustering method, and I >And represents the confidence that the ith sample belongs to the kth category using the jth clustering method. It should be noted that the number of categories formed using each clustering method is generally set in advance, and is generally set to be the same for all the clustering methods.
In some embodiments, for a classifier-based clustering method, the classifier may directly output the confidence of each content sample belonging to each of the plurality of categories, i.e., the confidence distribution of the categories. The highest confidence may then be determined from the confidence distribution for the category of each content sample and the category corresponding to the highest confidence may be derived therefrom. For example, when using the 1 st clustering method, the classifier outputs a confidence distribution of the class for the 1 st sampleFor/>The highest confidence therein may be determined to be 0.90 and the category corresponding to the highest confidence of the 1 st sample is the 2 nd category.
In some embodiments, in a clustering method such as the K-means clustering method that directly reflects the clustering class (i.e., the class to which the highest confidence corresponds is directly obtained) rather than the confidence for each class, student t-distribution may be used to calculate the confidence of a sample for each class (i.e., the confidence distribution of the class of the sample). Student t-distribution can be expressed as
Where α is the degree of freedom of student t distribution, and is generally set to 1, z i is the feature of the ith content sample, and μ k is the sample center point in the kth class obtained by clustering (which is generally the average of the spatial coordinates corresponding to the feature of the content sample in the class).
In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method. In this case, when the plurality of content samples are clustered using the K-means clustering method, feature data of a plurality of dimensions may be extracted for each of the plurality of content samples; then, dimension reduction is carried out on the feature data of the plurality of dimensions; and finally, clustering the plurality of content sample pairs based on the feature data of the plurality of content samples after dimension reduction. By extracting feature data of multiple dimensions of the content sample, the accuracy of clustering can be improved. The calculation amount of the clustering process can be reduced by reducing the dimension of the feature data of the plurality of dimensions. As an example, the feature data of the multiple dimensions may be dimensionality reduced using a principal component analysis (PRINCIPAL COMPONENT ANALYSIS, PCA) method. Principal component analysis is a method of converting a set of variables that may have a correlation into a set of linearly uncorrelated variables by an orthogonal transformation, the converted set of variables being referred to as principal components. It should be noted that the principal component analysis method is merely an example, and virtually any method that can convert multidimensional feature data into data of fewer dimensions is possible, and this is not a limitation.
At step 103, for each content sample of the plurality of content samples, in response to determining: and marking labels on the content samples to form labeled content samples if all the categories corresponding to the highest confidence degrees, which are clustered under the clustering methods respectively, of each content sample are the same and all the highest confidence degrees are larger than a confidence degree threshold value. The tag indicates a confidence of the tagged content sample for each of a plurality of categories corresponding to outputs of a content sample classifier. In an embodiment of the present disclosure, if one content sample satisfies both conditions simultaneously (a) all of the categories corresponding to the highest confidence to which the content sample is clustered under the plurality of clustering methods, respectively, are the same and (b) all of the highest confidence is greater than a confidence threshold, it may be determined that the categories to which the content sample is clustered under different clustering methods are all the same, such content sample may be determined to be a "clean" sample and thus may be labeled. Correspondingly, content samples in the dataset that do not meet the above two conditions (a) and (b) may be determined to be "noisy" samples, and thus not labeled, as the categories into which such content samples are clustered are likely to be inaccurate.
Taking the example of using two different clustering methods, if the two highest confidence levels to which a content sample is clustered under the two different clustering methods, respectively, correspond to the same category, and each of the two highest confidence levels is greater than a confidence threshold, then the content sample may be determined to be a "clean" sample and may therefore be labeled. This can be clearly expressed by the following expression:
Where clean i indicates whether the ith content sample is a "clean" sample, ture indicates that the ith sample is a "clean" sample, false indicates that the ith sample is not a "clean" sample, and t is a confidence threshold. And/>Representing the highest confidence level of the class clustered by the ith content sample under the 1 st clustering method and the 2 nd clustering method respectively,And/>Respectively representing the category corresponding to the two highest confidence degrees clustered by the ith content sample under the 1 st clustering method and the 2 nd clustering method.
Since the categories clustered using the clustering method are randomly generated, the different clustering methods do not correspond to the generated categories, and thus there is a difficulty in determining the condition (a). In some embodiments of the present disclosure, if the category corresponding to the highest confidence that each content sample is clustered among all the categories clustered using the first clustering method of every two clustering methods of the plurality of clustering methods has the largest number of identical content samples between the category corresponding to the highest confidence that each content sample is clustered using the second clustering method of every two clustering methods, it is determined that all the categories corresponding to the highest confidence that each content sample is clustered under the plurality of clustering methods, respectively, are identical. This provides an efficient and accurate method to determine if the categories corresponding to the highest confidence with which each content sample is clustered under the plurality of clustering methods, respectively, are the same when using two or more clustering methods.
In some embodiments, when labeling each unlabeled content sample, the label q ik may be labeled for each unlabeled content sample, where
Wherein i represents the serial number of the content samples in the dataset, K is the number of categories clustered by using each clustering method, q ik represents the confidence of the ith content sample for the kth category and K is less than or equal to K, m is the category corresponding to the highest confidence of the ith content sample, J is the total number of clustering methods,The highest confidence level for the ith content sample when using the jth clustering method. It should be noted that the specific value or identification of the category m corresponding to the highest confidence of the ith content sample may be predetermined (e.g., determined as the 2 nd category) as long as the determined categories are kept consistent with the categories corresponding to the confidence of the content sample classifier output.
In other words, when labeling a content sample, an average value of highest confidence levels of categories into which the content sample is respectively clustered under the plurality of clustering methods may be determined, and a first confidence level of a category corresponding to the highest confidence level in the label (i.e., a confidence level corresponding to an expected category of the content sample) may be labeled as the average value; and marking the second confidence level of each of the other categories in the label except the category corresponding to the highest confidence level, so that the sum of the first confidence level and all the second confidence levels is 1, namely the second confidence level is set to be the ratio of the difference value of 1 and the average value to the category number of the other categories.
In some embodiments, when labeling a content sample, a first confidence level of a category corresponding to the highest confidence level in the label (i.e., a confidence level corresponding to an expected category of the content sample) may be labeled 1 (i.e., a confidence level of 100%), and a second confidence level of each of the other categories in the label, except the category corresponding to the highest confidence level, may be labeled 0, respectively. Of course, this is not limiting as long as the category corresponding to the highest confidence level can be significantly distinguished from the remaining categories.
In some embodiments, at least one of the plurality of clustering methods is the classifier-based clustering method described above. Under the condition, the clean sample can be updated, so that the obtained clean sample is more accurate, and the clustering accuracy of the content sample classifier obtained through final training is improved. By way of example, fig. 4 illustrates an exemplary flowchart of a method of updating a "clean" sample. As shown in fig. 4, in step 401, the labeled content samples and unlabeled content samples may be utilized to train a classifier on which the clustering method is based. At step 402, the plurality of content samples may be clustered using a trained classifier-based clustering method to determine a confidence distribution for updated categories that each content sample is clustered under the trained classifier-based clustering method. At step 403, the labeled content samples are reformed based on the updated confidence distribution of the category and the confidence distribution of the category in which each content sample is clustered separately under other clustering methods of the plurality of clustering methods. Step 403 corresponds to step 103 being performed once again. Alternatively, the update procedure may be repeated until the set of "clean" samples is no longer changing, although this is not limiting.
At step 104, the labeled content samples and unlabeled content samples in the dataset are determined together as a training set for the content sample classifier. The labeled content samples are labeled content samples in step 103, and the unlabeled content samples are the rest of the content samples in the data set that are not labeled. The labeled content samples and unlabeled content samples in the dataset are together determined as a training set for the content sample classifier to enable training of the content sample classifier.
It should be noted that embodiments of the present disclosure are not limited to the specific structure of the content sample classifier, which may be adaptively changed according to the type of content sample. For example, in the case where each of the plurality of content samples is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (Convolutional Neural Networks, convolutional neural network) or the like. In the case where each of the plurality of content samples is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization of the self-transformer), or the like.
It should be noted that the term "plurality" in the embodiments of the present disclosure includes two as well as more than two unless specified otherwise. For example, the plurality of clustering methods includes two clustering methods and more than two clustering methods may be included.
In the method of determining a training set described in the embodiments of the present disclosure, all content samples are divided into a set of "clean" samples and a set of "noise" samples using clustering results of the content samples in the data set by a plurality of clustering methods. Marking the label with the clean sample to form a content sample with the label, and constructing the label through output results of a plurality of clustering methods; whereas the "noise" samples are taken as unlabeled content samples. After the set is divided, the labeled content samples and the unlabeled content samples can be used together as a training set of the content sample classifier to train the content sample classifier. By fully utilizing the clustering results of the plurality of clustering methods on the content samples in the data set, the clean samples and the noise samples can be accurately determined. The label is marked on the clean sample, and the label and the noise sample are used as a training set to train the content sample classifier, so that the clustering accuracy and generalization of the trained content sample classifier can be greatly improved, and the trained content sample classifier can acquire the category of the content sample end to end.
Fig. 2 illustrates a schematic flow diagram of a method 200 of training a content sample classifier according to one embodiment of the present disclosure. As shown in fig. 2, the method step 200 includes the following steps.
In step 201, a training set for a content sample classifier is obtained. The content sample classifier is used for clustering the content samples to obtain categories of the content samples. The training set is determined, for example, according to the method 100 described with reference to fig. 1, and includes labeled content samples and unlabeled content samples. Each of the tagged content samples and untagged content samples may be an image data sample, a text data sample, or a voice data sample, and the type of content sample is not limited herein.
It should be noted that embodiments of the present disclosure are not limited to the specific structure of the content sample classifier, which may be adaptively changed according to the type of content sample. For example, in the case where each of the plurality of content samples is an image data sample or a voice data sample, the structure of the content sample classifier may be CNN (Convolutional Neural Networks, convolutional neural network) or the like. In the case where each of the plurality of content samples is a text data sample, the structure of the content sample classifier may be RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization of the self-transformer), or the like.
At step 202, the content sample classifier is trained using the labeled content samples and unlabeled content samples in a training set. As an example, the content sample classifier may be trained in a semi-supervised learning manner using the labeled content samples and unlabeled content samples in a training set. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (namely, generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
In some embodiments, the content sample classifier may be trained by adjusting parameters of the content sample classifier such that a total loss function is minimized when the content sample classifier is trained with the labeled content samples and unlabeled content samples in a training set. The total loss function is the sum of the loss functions for each content sample (including tagged content samples and untagged content samples). As an example, the loss function for each tagged content sample is used to constrain the confidence distribution of the category of each tagged content sample output by the content sample classifier to the proximity of the tag of each tagged content sample; the loss function for each unlabeled content sample is used to constrain invariance of the confidence distribution of the content sample classifier for each unlabeled content sample output after the random transformation and the similarity of the confidence distribution of the content sample classifier output for the class to the independent heat vector, wherein the confidence distribution of the class includes a confidence for each of the plurality of classes.
As an example, the total loss function may be expressed as the following expression:
Where i represents the sequence number of the content sample in the dataset, For the loss function of the ith content sample, H (|·|·) represents the cross entropy loss function, H (·) represents the entropy loss function, p θ(y|xi) represents the confidence distribution of the classifier on the class output by x i, clean i represents whether the ith content sample is a "clean" sample, ture represents that the ith sample is a "clean" sample, false represents that the ith sample is not a "clean" sample (i.e., is a noise sample), and T 1 and T 2 represent the first and second random transforms, respectively, that are different from each other on the ith content sample.
In the above expression, when the i-th content sample is a "clean" sample (i.e., clean i = Ture), the loss function for that content sample may be H (q i||pθ(y|xi)), which is used to constrain the proximity of the confidence distribution of the category for the i-th content sample output by the content sample classifier to the tag of the i-th content sample, namely: by adjusting the parameters of the content sample classifier, the confidence distribution of the category of the ith content sample output by the content sample classifier is made to be as close as possible to the label of the ith content sample.
In the above expression, when the ith content sample is a "noise" sample (i.e., clean i =false), the loss function for that content sample may beWherein/>The method is used for restraining the direct invariance of the confidence distribution of the category output by the content sample classifier after the first random transformation of the ith content sample and the confidence distribution of the category output by the content sample classifier after the second random transformation of the ith content sample. For "noise" samples, embodiments of the present disclosure train based on the point of data enhancement invariance, since their labels cannot be determined, namely: the confidence distribution of the content sample classifier for the category of the content sample output is not affected by the random transformation. H (p θ(y|xi)) is used to constrain the similarity of the confidence distribution of the class output by the content sample classifier to the one-hot vector, namely: by adjusting the parameters of the content sample classifier, the confidence distribution of the category of the ith content sample output by the content sample classifier is as close as possible to the form of a unique heat vector (one-hotvector). The one-hot vector is a vector formed by one-hot encoding. One-hot encoding, also known as one-bit efficient encoding, uses an N-bit state register to encode N states, each with its own register bit, and at any time, only one of the bits is valid. For example, the 6 independent heat vectors formed by independent heat encoding may be 000001, 000010, 000100, 001000, 010000, 100000, respectively, where only one bit in each independent heat vector is valid.
It should be noted that the random transformation described above may be any suitable random transformation, and is not limited in this regard. For example, when the content sample is an image data sample, the random transformation may be random cropping, random horizontal transformation, color dithering, or random combination of color channels, etc. When the content sample is a text data sample, the stochastic transformation may be to translate the content sample into another language and then back into the original language (semantically unchanged, but the text changed). It should also be noted that the above described loss functions are not limiting and any other suitable loss function may be used.
In some embodiments, the classifier may be trained in fractions by selecting the same number of tagged and untagged content samples from the training set each time the content sample classifier is trained with the tagged and untagged content samples in the training set. For example, B "clean" samples and B "noise" samples may be sampled each time, B being a positive integer, and then the loss functions of these sampled samples are added as the total loss function at each training. By using the same number of labeled content samples and unlabeled content samples, the clustering accuracy and generalization of the trained content sample classifier can be balanced, so that the trained content sample classifier has a better clustering effect.
As an example, the parameters of the content sample classifier may be adjusted by back-propagation (backpropagation) according to the total-loss function described above. During back propagation, the learning rate for each parameter of the content sample classifier may be dynamically adjusted by computing a first moment estimate and a second moment estimate of the gradient of the loss function. The back propagation is essentially based on a gradient descent method, and the step size can be 0.001, the exponential decay rate β 1 of the first moment estimation is set to 0.9, and the exponential decay rate β 2 of the second moment estimation is set to 0.999 during training. When optimizing, the batch size B can be set to 128, for all parameters, using L2 regularization, the regularization coefficient is 0.0001. Each time the content samples in the training set were used to train the content samples 50 times, the step size decays by a factor of 0.1. Typically, the confidence threshold t may be set to 0.95 when dividing a "clean" sample. The content sample classifier obtained through training shows good clustering effect.
In the method for training the content sample classifier described in the embodiments of the present disclosure, the content sample classifier is trained by using the labeled content sample and the unlabeled content sample together as a training set, so that the clustering accuracy and generalization of the trained content sample classifier are greatly improved, and thus, the clustering effect is improved, and meanwhile, the trained content sample classifier achieves the purpose of obtaining the category of the content sample end to end.
Fig. 3 illustrates a schematic flow diagram of a method 300 of clustering content samples according to one embodiment of the present disclosure. As shown in fig. 3, the method step 300 includes the following steps.
In step 301, a dataset comprising a plurality of unlabeled content samples is acquired. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited herein.
In step 302, clustering the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of a category in which each content sample is clustered under the each clustering method. The clustering method may be any suitable clustering method, such as K-means clustering, DEC (Deep Embedded Clustering ), IDEC (Improved Deep Embedded Clustering, modified deep embedded clustering), DCEC (Deep Convolutional Embedded Clustering ), classifier-based clustering method using artificial intelligence, etc., although this is not limiting.
As an example, given N unlabeled content samples { x i: i e { 1..once., N }, clustering these content samples using multiple clustering methods that are different from each other may result in a confidence distribution for the categoryWherein M is the number of clustering methods, K is the number of categories formed by clustering by each clustering method, and I >And represents the confidence that the ith sample belongs to the kth category using the jth clustering method. It should be noted that the number of categories formed using each clustering method is generally set in advance, and is generally set to be the same for all the clustering methods.
In some embodiments, for a classifier-based clustering approach, the classifier may directly output the confidence of each content sample as belonging to each category. The highest confidence may then be determined from the confidence of each content sample as belonging to each category, and the category corresponding to the highest confidence may be derived therefrom. In some embodiments, in a clustering method such as the K-means clustering method that directly reflects the clustering class (i.e., the class to which the highest confidence corresponds is directly obtained) rather than the confidence for each class, student t-distribution may be employed to calculate the confidence of a sample for each class (i.e., the confidence distribution of the class of the sample), as described with reference to step 102 of fig. 1.
In some embodiments, at least one of the plurality of clustering methods is a K-means clustering method. In this case, when the plurality of content samples are clustered using the K-means clustering method, feature data of a plurality of dimensions may be extracted for each of the plurality of content samples; then, dimension reduction is carried out on the feature data of the plurality of dimensions; and finally, clustering the plurality of content sample pairs based on the feature data of the plurality of content samples after dimension reduction. By extracting feature data of multiple dimensions of the content sample, the accuracy of clustering can be improved. The dimension reduction of the feature data of the plurality of dimensions can simplify the calculation amount of the clustering process.
In step 303, for each content sample of the plurality of content samples, in response to determining: and labeling each content sample to form a labeled content sample if all the categories corresponding to the highest confidence levels into which each content sample is clustered under the plurality of clustering methods are the same, and if all the highest confidence levels are greater than a confidence level threshold, wherein the label indicates the confidence level of the labeled content sample for each of a plurality of categories corresponding to the output of a content sample classifier.
In an embodiment of the present disclosure, if one content sample satisfies both conditions simultaneously (a) all of the highest confidence corresponding categories into which the content sample is clustered under the plurality of clustering methods, respectively, are the same and (b) all of the highest confidence is greater than a confidence threshold, it may be determined that the categories into which the content sample is clustered under different clustering methods are the same and have high accuracy, such content sample may be determined as a "clean" sample and thus may be labeled. Correspondingly, content samples in the dataset that do not meet any of the above two conditions (a) and (b) may be determined to be "noisy" samples, and thus not labeled, as the categories into which such content samples are clustered are likely to be inaccurate.
Taking two different clustering methods as an example, if two highest confidence levels, into which one content sample is clustered under the two different clustering methods, respectively, are the same for the corresponding categories, and each of the two highest confidence levels is greater than a confidence level threshold, then the content sample may be determined to be a "clean" sample and may therefore be labeled.
Since the categories clustered using the clustering method are randomly generated, the different clustering methods do not correspond to the generated categories, and thus there is a difficulty in determining the condition (a). In some embodiments of the present disclosure, if the category corresponding to the highest confidence that each content sample is clustered among all the categories clustered using the first clustering method of every two clustering methods of the plurality of clustering methods has the largest number of identical content samples between the category corresponding to the highest confidence that each content sample is clustered using the second clustering method of every two clustering methods, it is determined that all the categories corresponding to the highest confidence that each content sample is clustered under the plurality of clustering methods, respectively, are identical. This provides an efficient and accurate method to determine if the categories corresponding to the highest confidence with which each content sample is clustered under the plurality of clustering methods, respectively, are the same when using two or more clustering methods.
In some embodiments, when labeling content samples, an average value of highest confidence levels of categories into which the content samples are respectively clustered under the plurality of clustering methods may be determined, and a first confidence level of a category corresponding to the highest confidence level in the label (i.e., a confidence level corresponding to an expected category of the content samples) may be labeled as the average value; and marking the second confidence level of each of the other categories in the label except the category corresponding to the highest confidence level, so that the sum of the first confidence level and all the second confidence levels is 1, namely the second confidence level is set to be the ratio of the difference value of 1 and the average value to the category number of the other categories. In some embodiments, when labeling a content sample, a first confidence level of a category corresponding to the highest confidence level in the label (i.e., a confidence level corresponding to an expected category of the content sample) may be labeled 1 (i.e., a confidence level of 100%), and a second confidence level of each of the other categories in the label, except the category corresponding to the highest confidence level, may be labeled 0, respectively. Of course, this is not limiting as long as the category corresponding to the highest confidence level can be significantly distinguished from the remaining categories.
At step 304, the content sample classifier is trained using the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier. As an example, the content sample classifier may be trained in a semi-supervised learning manner using the labeled content samples and unlabeled content samples. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (namely, generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
In some embodiments, the content sample classifier may be trained by adjusting parameters of the content sample classifier such that a total loss function is minimized when the content sample classifier is trained with the labeled content samples and unlabeled content samples in a training set. The total loss function is the sum of the loss functions for each content sample (including tagged content samples and untagged content samples). As an example, the loss function for each tagged content sample is used to constrain the confidence distribution of the category of each tagged content sample output by the content sample classifier to the proximity of the tag of each tagged content sample; the loss function for each unlabeled content sample is used to constrain invariance of the confidence distribution of the content sample classifier for each unlabeled content sample output after the random transformation and the similarity of the confidence distribution of the content sample classifier output for the class to the independent heat vector, wherein the confidence distribution of the class includes a confidence for each of the plurality of classes. The specific total loss function may be as described in step 202 described with reference to fig. 2.
In some embodiments, the classifier may be trained in separate passes by selecting the same number of tagged and untagged content samples from the training set each time the content sample classifier is trained with the tagged and untagged content samples. For example, B "clean" samples and B "noise" samples may be sampled each time, B being a positive integer, and then the loss functions of these sampled samples are added as the total loss function at each training. By using the same number of labeled content samples and unlabeled content samples, the clustering accuracy and generalization of the trained content sample classifier can be balanced, so that the trained content sample classifier has a better clustering effect.
As an example, the parameters of the content sample classifier may be adjusted by back propagation (back propagation) according to the total loss function described above. During back propagation, the learning rate for each parameter of the content sample classifier may be dynamically adjusted by computing a first moment estimate and a second moment estimate of the gradient of the loss function. Specific training parameters may be described in step 202 described with reference to fig. 2. The content sample classifier is trained in a semi-supervised learning mode by using the labeled content samples and the unlabeled content samples, so that the clustering accuracy and generalization of the trained content sample classifier are greatly improved, the clustering effect is improved, and the trained content sample classifier achieves the purpose of acquiring the categories of the content samples end to end.
Embodiments of the present disclosure are not limited to the specific structure of the content sample classifier, which may be adaptively changed according to the type of content sample. For example, in the case where the content samples to be clustered are image data samples or voice data samples, the structure of the content sample classifier may be CNN (Convolutional Neural Networks, convolutional neural network) or the like. In the case that the content samples to be clustered are text data samples, the structure of the content sample classifier may be RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization of the self-transformer), or the like.
In some embodiments, at least one of the plurality of clustering methods is the classifier-based clustering method described above. Under the condition, the clean sample can be updated, so that the obtained clean sample is more accurate, and the clustering accuracy of the content sample classifier obtained through final training is improved.
FIG. 4 illustrates an exemplary flow chart of a method of updating a "clean" sample. As shown in fig. 4, in step 401, the labeled content samples and unlabeled content samples may be utilized to train a classifier on which the clustering method is based. At step 402, the plurality of content samples may be clustered using a trained classifier-based clustering method to determine a confidence distribution for updated categories that each content sample is clustered under the trained classifier-based clustering method. At step 403, the labeled content samples are reformed based on the updated confidence distribution of the category and the confidence distribution of the category in which each content sample is clustered separately under other clustering methods of the plurality of clustering methods. Alternatively, the update procedure may be repeated until the set of "clean" samples is no longer changing, although this is not limiting.
At step 305, content samples to be clustered are clustered using a trained content sample classifier to determine a category of the content samples to be clustered. In some embodiments, the content samples to be clustered may be input into a trained content sample classifier such that the content sample classifier outputs confidence distributions of the content samples to be clustered corresponding to the plurality of categories; then, the category with the highest confidence in the confidence distribution is determined as the category of the content sample to be clustered. The output of the content sample classifier is the confidence distribution of the content samples to be clustered corresponding to a plurality of categories, from which the category corresponding to the highest confidence is selected as the category for the content sample, which has the highest clustering accuracy and implements the clustering process that directly gets the clustered category.
In the method for clustering content samples described in the embodiments of the present disclosure, all content samples are accurately divided into a set of "clean" samples and a set of "noise" samples by fully utilizing the clustering results of the content samples in the dataset by a plurality of clustering methods. The "clean" swatches are then labeled with labels to form labeled content swatches. After the set is divided, the content sample classifier can be trained by using the labeled content samples and the unlabeled content samples, so that the clustering accuracy and generalization of the trained content sample classifier can be greatly improved, and the categories of the content samples to be clustered can be acquired end to end through the trained content sample classifier, so that the clustering effect is improved.
Fig. 5 illustrates an exemplary schematic diagram of clustering content samples according to one embodiment of the present disclosure. As shown in fig. 5, the unlabeled plurality of content samples in the dataset are clustered using two clustering methods (a first clustering method and a second clustering method), respectively, to obtain a clustering result including a category corresponding to a highest confidence level into which each of the plurality of content samples is clustered under each of the clustering methods. Then, the clustering results obtained by using the first clustering method and the second clustering method respectively are matched to divide the content samples in the data set into a clean sample and a noise sample, and the clean sample is labeled with a label to form a labeled content sample. The tag is to indicate a confidence of the tagged content sample for each of a plurality of categories corresponding to an output of a content sample classifier. When dividing, for each content sample, if two conditions are met, namely (a) all categories corresponding to the highest confidence degrees, into which each content sample is clustered under the clustering methods respectively, are the same, and (b) all the highest confidence degrees are larger than a confidence degree threshold value, the content sample can be determined to be a clean sample, and otherwise, the content sample is a noise sample. All unlabeled content samples and labeled content samples together form a training set to train the content sample classifier in a semi-supervised learning manner to obtain a trained content sample classifier. Then, the content samples to be clustered may be input into a trained content sample classifier, such that the content sample classifier outputs confidence distributions of the content samples to be clustered corresponding to a plurality of categories, and then a category having the highest confidence in the confidence distributions is determined as the category of the content samples, thereby achieving an effect of obtaining the category of the content samples to be clustered end-to-end. The content samples to be clustered may be content samples of the unlabeled plurality of content samples, or may be content samples having the same content sample type (e.g., image data samples, text data samples, or voice data samples, etc.) as the plurality of content samples.
Fig. 6 illustrates an exemplary block diagram of an apparatus 600 for determining a training set according to one embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes a data set acquisition module 601, a sample clustering module 602, a sample labeling module 603, and a training set determination module 604.
The data set acquisition module 601 is configured to acquire a data set comprising a plurality of unlabeled content samples. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited herein.
The sample clustering module 602 is configured to cluster the plurality of content samples using each of a plurality of clustering methods that are different from each other to determine a category corresponding to a highest confidence in a confidence distribution of the obtained categories under each of the plurality of clustering methods. The plurality of clustering methods are different from each other and the clustering method therein may be any suitable clustering method, such as K-means clustering, DEC, IDEC, DCEC, classifier-based clustering methods using artificial intelligence, etc., although this is not limiting.
The sample tagging module 603 is configured to, for each content sample of the plurality of content samples, in response to determining: the categories corresponding to all the highest confidence levels to which each content sample is clustered under the plurality of clustering methods respectively are the same, and all the highest confidence levels are greater than a confidence level threshold, then labeling each content sample to form a labeled content sample, wherein the labels indicate the confidence levels of the labeled content sample for each of a plurality of categories corresponding to the output of the content sample classifier.
The training set determination module 604 is configured to determine the labeled content samples and the unlabeled content samples in the dataset together as a training set for the content sample classifier. The labeled content samples are content samples labeled by the labeling module, and the unlabeled content samples are the rest content samples without labels in the data set. The labeled content samples and unlabeled content samples in the dataset are together determined as a training set for the content sample classifier to enable training of the content sample classifier.
Fig. 7 illustrates an exemplary block diagram of an apparatus 700 for training a content sample classifier according to one embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes a training set acquisition module 701 and a classifier training module 702.
The training set acquisition module 701 is configured to acquire a training set for the content sample classifier, wherein the training set is determined by the apparatus 600 for determining a training set described with reference to fig. 6 and comprises labeled content samples and unlabeled content samples. The content sample classifier is used for classifying the content samples to obtain categories of the content samples. Each of the tagged content samples and untagged content samples may be an image data sample, a text data sample, or a voice data sample, and the type of content sample is not limited herein.
Classifier training module 702 is configured to train the content sample classifier with the labeled content samples and unlabeled content samples in a training set. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (namely, generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
Fig. 8 illustrates an exemplary block diagram of an apparatus 800 for clustering content samples according to one embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 includes an acquisition module 801, a clustering module 802, a labeling module 803, a training module 804, and a determination module 805.
The acquisition module 801 is configured to acquire a dataset comprising a plurality of unlabeled content samples. Each of the plurality of content samples may be, but is not limited to, an image data sample, a text data sample, or a voice data sample, and in fact the type of content sample is not limited herein.
The clustering module 802 is configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of a category in which each content sample is clustered under the each clustering method. The plurality of clustering methods are different from each other and the clustering method therein may be any suitable clustering method, such as K-means clustering, DEC, IDEC, DCEC, classifier-based clustering methods using artificial intelligence, etc., although this is not limiting.
The tagging module 803 is configured to, for each content sample of the plurality of content samples, in response to determining: and labeling each content sample to form a labeled content sample if all the categories corresponding to the highest confidence levels into which each content sample is clustered under the plurality of clustering methods are the same, and if all the highest confidence levels are greater than a confidence level threshold, wherein the label indicates the confidence level of the labeled content sample for each of a plurality of categories corresponding to the output of a content sample classifier.
The training module 804 is configured to train the content sample classifier with the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier. As an example, the training module 804 is configured to train the content sample classifier in a semi-supervised learning manner using the labeled content samples and unlabeled content samples. The clustering accuracy of the trained content sample classifier can be improved by using the labeled content samples, and the clustering generalization (namely, generalization capability) of the trained content sample classifier can be improved by using the unlabeled content samples.
The determination module 805 is configured to cluster content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered. The content sample classifier outputs a confidence distribution of the content samples corresponding to a plurality of categories from which, in some embodiments, the highest confidence category may be selected as the category for the content sample. This has the highest clustering accuracy and enables a clustering process that directly gets the cluster category.
FIG. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. Computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system-on-a-chip, and/or any other suitable computing device or computing system. The means 600 for determining a training set described above with reference to fig. 6, the means 700 for training a content sample classifier described with reference to fig. 7, and the means 800 for clustering content samples described with reference to fig. 8 may all take the form of a computing device 910. Alternatively, each of the means 600 for determining a training set, the means 700 for training a content sample classifier, and the means 800 for clustering content samples may be implemented as a computer program in the form of an application 916.
The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus may include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 911 is representative of functionality that performs one or more operations using hardware. Thus, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as application specific integrated circuits or other logic devices formed using one or more semiconductors. The hardware component 914 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, the processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, the processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) and removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in a variety of other ways as described further below.
One or more I/O interfaces 913 represent functionality that allows a user to input commands and information to computing device 910 using various input devices, and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include keyboards, cursor control devices (e.g., mice), microphones (e.g., for voice input), scanners, touch functions (e.g., capacitive or other sensors configured to detect physical touches), cameras (e.g., motion that does not involve touches may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a display or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 910 may be configured in a variety of ways as described further below to support user interaction.
Computing device 910 also includes application 916. The application 916 may be, for example, a software instance of the apparatus 600 for determining a training set, the apparatus 700 for training a content sample classifier, and the apparatus 800 for clustering content samples, and implement the techniques described herein in combination with other elements in the computing device 910.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
As previously described, the hardware elements 914 and computer-readable media 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or components of a system on a chip, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or other hardware devices. In this context, the hardware elements may be implemented as processing devices that perform program tasks defined by instructions, modules, and/or logic embodied by the hardware elements, as well as hardware devices that store instructions for execution, such as the previously described computer-readable storage media.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules, and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. Computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules.
In various implementations, the computing device 910 may take on a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, desktop computer, multi-screen computer, laptop computer, netbook, and the like. Computing device 910 may also be implemented as a mobile apparatus-like device including a mobile device such as a mobile phone, portable music player, portable gaming device, tablet computer, multi-screen computer, or the like. Computing device 910 may also be implemented as a television-like device that includes devices having or connected to generally larger screens in casual viewing environments. Such devices include televisions, set-top boxes, gaming machines, and the like.
The techniques described herein may be supported by these various configurations of computing device 910 and are not limited to the specific examples of techniques described herein. The functionality may also be implemented in whole or in part on the "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.
Cloud 920 includes and/or represents a platform 922 for resources 924. Platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of cloud 920. Resources 924 may include other applications and/or data that may be used when executing computer processing on a server remote from computing device 910. Resources 924 may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks.
It should be understood that for clarity, embodiments of the present disclosure have been described with reference to different functional units. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the present disclosure. For example, functionality illustrated to be performed by a single unit may be performed by multiple different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component, or section from another device, element, component, or section.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the appended claims. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the term "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims (14)

1. A method of clustering content samples, comprising:
Obtaining a dataset comprising a plurality of unlabeled content samples;
clustering the plurality of content samples by using each of a plurality of clustering methods different from each other to determine a category corresponding to a highest confidence in a confidence distribution of the categories obtained by clustering each content sample under each clustering method;
In response to having the largest number of identical content samples between the category corresponding to the highest confidence that each content sample is clustered into and the category corresponding to the highest confidence that each content sample is clustered into when using the second clustering method of the every two clustering methods, among all categories clustered into using the first clustering method of every two clustering methods, determining that all the categories corresponding to the highest confidence that each content sample is clustered into under the plurality of clustering methods, respectively, are identical;
For each content sample of the plurality of content samples, in response to determining: labeling each content sample to form a labeled content sample if all categories corresponding to highest confidence levels into which the each content sample is clustered under the plurality of clustering methods, respectively, are the same and the all highest confidence levels are greater than a confidence level threshold, wherein the label indicates a confidence level of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier;
training the content sample classifier with the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier;
clustering content samples to be clustered by using a trained content sample classifier to determine categories of the content samples to be clustered.
2. The method of claim 1, wherein labeling each content sample comprises:
And determining an average value of the highest confidence coefficient of the category into which each content sample is clustered under the plurality of clustering methods, marking the first confidence coefficient of the category corresponding to the highest confidence coefficient in the label as the average value, and marking the second confidence coefficient of each category in other categories except the category corresponding to the highest confidence coefficient in the label, so that the sum of the first confidence coefficient and all the second confidence coefficients is 1.
3. The method of claim 1, wherein labeling each content sample comprises:
marking the first confidence of the category corresponding to the highest confidence in the label as 1, and
The second confidence of each of the other categories in the label, except the category to which the highest confidence corresponds, is marked as 0, respectively.
4. The method of claim 1, wherein at least one of the plurality of clustering methods comprises a classifier-based clustering method, and wherein prior to the training the content sample classifier with the labeled content samples and unlabeled content samples in the dataset, the method further comprises:
training a classifier on which the clustering method is based by using the labeled content samples and the unlabeled content samples;
clustering the plurality of content samples by using a clustering method based on a trained classifier to determine a confidence distribution of an updated category obtained by clustering each content sample under the clustering method based on the trained classifier;
And reforming the labeled content samples based on the updated confidence distribution of the category, wherein each content sample is clustered under other clustering methods in the plurality of clustering methods.
5. The method of claim 1, wherein at least one of the plurality of clustering methods is a K-means clustering method, and wherein clustering the plurality of content samples using the K-means clustering method further comprises:
extracting feature data of a plurality of dimensions for each of the plurality of content samples;
Performing dimension reduction on the feature data of the multiple dimensions;
and clustering the plurality of content sample pairs based on the feature data of the plurality of content samples after dimension reduction.
6. The method of claim 1, wherein training the content sample classifier with labeled content samples and unlabeled content samples in the dataset comprises:
training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized, wherein the total loss function is a sum of loss functions for each content sample, and
Wherein the penalty function for each tagged content sample is used to constrain the confidence distribution of the category of each tagged content sample output by the content sample classifier to the proximity of the tag of each tagged content sample;
wherein the loss function for each unlabeled content sample is used to constrain invariance of the confidence distribution of the content sample classifier for each unlabeled content sample output after the random transformation and the similarity of the confidence distribution of the content sample classifier output for the class to the one-hot vector, wherein the confidence distribution of the class includes a confidence for each of the plurality of classes.
7. The method of claim 6, wherein training the content sample classifier by adjusting parameters of the content sample classifier such that a total loss function is minimized comprises:
the content sample classifier is trained in a back-propagation manner, wherein a learning rate for each parameter of the content sample classifier is dynamically adjusted by computing a first moment estimate and a second moment estimate of a gradient of a loss function in back-propagation.
8. The method of claim 1, wherein training the content sample classifier with labeled content samples and unlabeled content samples in the dataset comprises:
The classifier is trained in fractions each time the same number of labeled content samples and unlabeled content samples are selected.
9. The method of claim 1, wherein clustering content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered comprises:
inputting the content samples to be clustered into a trained content sample classifier, so that the content sample classifier outputs confidence distribution of the content samples to be clustered, which corresponds to the plurality of categories;
the category with the highest confidence in the confidence distribution is determined as the category of the content sample to be clustered.
10. The method of claim 1, wherein each content sample of the plurality of content samples comprises an image data sample and the structure of the content sample classifier comprises a convolutional neural network.
11. The method of claim 1, wherein each of the plurality of content samples comprises one of a text data sample and a speech data sample, and the structure of the content sample classifier comprises one of a recurrent neural network, a long-short-term memory network, and a bi-directional encoder characterization of a self-transformer.
12. An apparatus for clustering content samples, comprising:
An acquisition module configured to acquire a dataset comprising a plurality of unlabeled content samples;
A clustering module configured to cluster the plurality of content samples using each of a plurality of clustering methods different from each other, so as to determine a category corresponding to a highest confidence in a confidence distribution of a category obtained by clustering each content sample under each of the clustering methods;
A marking module configured to: in response to having the largest number of identical content samples between the category corresponding to the highest confidence that each content sample is clustered into and the category corresponding to the highest confidence that each content sample is clustered into when using the second clustering method of the every two clustering methods, among all categories clustered into using the first clustering method of every two clustering methods, determining that all the categories corresponding to the highest confidence that each content sample is clustered into under the plurality of clustering methods, respectively, are identical; and, for each content sample of the plurality of content samples, in response to determining: labeling each content sample to form a labeled content sample if all categories corresponding to highest confidence levels into which the each content sample is clustered under the plurality of clustering methods, respectively, are the same and the all highest confidence levels are greater than a confidence level threshold, wherein the label indicates a confidence level of the labeled content sample for each of a plurality of categories corresponding to an output of a content sample classifier;
a training module configured to train the content sample classifier with the labeled content samples and the unlabeled content samples in the dataset to obtain a trained content sample classifier;
A determination module configured to cluster content samples to be clustered using a trained content sample classifier to determine a category of the content samples to be clustered.
13. A computing device, comprising
A memory configured to store computer-executable instructions;
A processor configured to perform the method of any of claims 1-11 when the computer executable instructions are executed by the processor.
14. A computer readable storage medium storing computer executable instructions which, when executed, perform the method of any one of claims 1-11.
CN202010824726.2A 2020-08-17 2020-08-17 Method and device for clustering content samples Active CN111898704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010824726.2A CN111898704B (en) 2020-08-17 2020-08-17 Method and device for clustering content samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010824726.2A CN111898704B (en) 2020-08-17 2020-08-17 Method and device for clustering content samples

Publications (2)

Publication Number Publication Date
CN111898704A CN111898704A (en) 2020-11-06
CN111898704B true CN111898704B (en) 2024-05-10

Family

ID=73230390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010824726.2A Active CN111898704B (en) 2020-08-17 2020-08-17 Method and device for clustering content samples

Country Status (1)

Country Link
CN (1) CN111898704B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329883A (en) * 2020-11-25 2021-02-05 Oppo广东移动通信有限公司 Model training system, method, device and storage medium
CN113361563B (en) * 2021-04-22 2022-11-25 重庆大学 Parkinson's disease voice data classification system based on sample and feature double transformation
CN114299194B (en) * 2021-12-23 2023-06-02 北京百度网讯科技有限公司 Training method of image generation model, image generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN109543756A (en) * 2018-11-26 2019-03-29 重庆邮电大学 A kind of tag queries based on Active Learning and change method
CN109800744A (en) * 2019-03-18 2019-05-24 深圳市商汤科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN110110802A (en) * 2019-05-14 2019-08-09 南京林业大学 Airborne laser point cloud classification method based on high-order condition random field
CN110263804A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 A kind of medical image dividing method based on safe semi-supervised clustering
CN111209929A (en) * 2019-12-19 2020-05-29 平安信托有限责任公司 Access data processing method and device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US10614373B1 (en) * 2013-12-23 2020-04-07 Groupon, Inc. Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model
US11258813B2 (en) * 2019-06-27 2022-02-22 Intel Corporation Systems and methods to fingerprint and classify application behaviors using telemetry

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN109543756A (en) * 2018-11-26 2019-03-29 重庆邮电大学 A kind of tag queries based on Active Learning and change method
CN109800744A (en) * 2019-03-18 2019-05-24 深圳市商汤科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN110263804A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 A kind of medical image dividing method based on safe semi-supervised clustering
CN110110802A (en) * 2019-05-14 2019-08-09 南京林业大学 Airborne laser point cloud classification method based on high-order condition random field
CN111209929A (en) * 2019-12-19 2020-05-29 平安信托有限责任公司 Access data processing method and device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A novel multi-clustering method for hierarchical clusterings based on boosting;Elaheh Rashedi;《2011 19th Iranian Conference on Electrical Engineering》;全文 *
一种使用未标记样本聚类信息的自训练方法;刘伟涛;许信顺;;计算机应用研究(09);147-150 *
基于标签聚类的多标签分类算法;申超波;《软件》;第35卷(第8期);16-21 *

Also Published As

Publication number Publication date
CN111898704A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN107688821B (en) Cross-modal image natural language description method based on visual saliency and semantic attributes
CN110059217B (en) Image text cross-media retrieval method for two-stage network
CN109598231B (en) Video watermark identification method, device, equipment and storage medium
CN112084327B (en) Classification of sparsely labeled text documents while preserving semantics
CN111898704B (en) Method and device for clustering content samples
Qu et al. Joint hierarchical category structure learning and large-scale image classification
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN113657425B (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN111324769A (en) Training method of video information processing model, video information processing method and device
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN110019795B (en) Sensitive word detection model training method and system
Rabby et al. Bangla handwritten digit recognition using convolutional neural network
CN116610803B (en) Industrial chain excellent enterprise information management method and system based on big data
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
Qin et al. Machine learning basics
CN115131613A (en) Small sample image classification method based on multidirectional knowledge migration
CN114328934B (en) Attention mechanism-based multi-label text classification method and system
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
Dong et al. A supervised dictionary learning and discriminative weighting model for action recognition
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN113592045B (en) Model adaptive text recognition method and system from printed form to handwritten form
CN115098681A (en) Open service intention detection method based on supervised contrast learning
CN110532384B (en) Multi-task dictionary list classification method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant