CN111598153B - Data clustering processing method and device, computer equipment and storage medium - Google Patents

Data clustering processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111598153B
CN111598153B CN202010400391.1A CN202010400391A CN111598153B CN 111598153 B CN111598153 B CN 111598153B CN 202010400391 A CN202010400391 A CN 202010400391A CN 111598153 B CN111598153 B CN 111598153B
Authority
CN
China
Prior art keywords
sample
data
prior distribution
clustering
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010400391.1A
Other languages
Chinese (zh)
Other versions
CN111598153A (en
Inventor
卢东焕
赵俊杰
马锴
郑冶枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010400391.1A priority Critical patent/CN111598153B/en
Publication of CN111598153A publication Critical patent/CN111598153A/en
Application granted granted Critical
Publication of CN111598153B publication Critical patent/CN111598153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a processing method and device for data clustering, computer equipment and a storage medium in the field of artificial intelligence. The method comprises the following steps: acquiring a data sample; the data samples are samples of clustering objects in clustering services; mapping the data samples into sample features through a clustering model; the sample features comprise sample class features and sample intra-class style features; determining a correlation of the data sample and the sample features; determining a score value for the sample characteristic subject to a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics; adjusting the clustering model based at least on the relevance and the score value; and clustering the data to be clustered in the clustering service by using the adjusted clustering model. By adopting the method, the precision of data clustering can be effectively improved under the condition of no need of manual marking.

Description

Data clustering processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing data clusters, a computer device, and a storage medium.
Background
Artificial Intelligence (AI) is a comprehensive subject, and relates to a wide range of fields, both hardware and software technologies. One of the important directions of artificial intelligence software technology is machine learning. Clustering analysis is a common technique for machine learning. Data types such as images, texts, voice and the like can be used as objects of clustering. Similar objects can be classified into the same category and dissimilar objects can be classified into different categories by clustering.
In a conventional manner, by learning the tag characteristics of the data samples, the tag characteristics are taken as a clustering result. However, for massive data in the internet, if manual annotation is performed, a large amount of human resources are consumed. Therefore, how to accurately complete data clustering without manual labeling is a technical problem to be solved at present.
Disclosure of Invention
In view of the above, it is necessary to provide a data clustering processing method, apparatus, computer device and storage medium capable of accurately completing data clustering without manual labeling.
A method of processing data clusters, the method comprising:
acquiring a data sample; the data samples are samples of clustering objects in clustering services;
mapping the data samples into sample features through a clustering model; the sample features comprise sample class features and sample intra-class style features;
determining a correlation of the data sample and the sample features;
determining a score value for the sample characteristic subject to a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
adjusting the clustering model based at least on the relevance and the score value;
and clustering the data to be clustered in the clustering service by using the adjusted clustering model.
A processing apparatus for data clustering, the apparatus comprising:
the first acquisition module is used for acquiring a data sample; the data samples are samples of clustering objects in clustering services;
the characteristic mapping module is used for mapping the data sample into a sample characteristic through a clustering model; the sample features comprise sample class features and sample intra-class style features;
a correlation identification module for determining a correlation of the data sample and the sample features;
a prior distribution scoring module for determining a score value at which the sample feature obeys a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
a cluster training module for adjusting the cluster model at least according to the relevance and the score value; and clustering the data to be clustered in the clustering service by using the adjusted clustering model.
A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of:
acquiring a data sample; the data samples are samples of clustering objects in clustering services;
mapping the data samples into sample features through a clustering model; the sample features comprise sample class features and sample intra-class style features;
determining a correlation of the data sample and the sample features;
determining a score value for which the sample feature obeys a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
adjusting the clustering model based at least on the relevance and the score value;
and clustering the data to be clustered in the clustering service by using the adjusted clustering model.
A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor performs the steps of:
acquiring a data sample; the data samples are samples of clustering objects in clustering services;
mapping the data samples into sample features through a clustering model; the sample features comprise sample class features and sample intra-class style features;
determining a correlation of the data sample and the sample features;
determining a score value for which the sample feature obeys a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
adjusting the clustering model based at least on the relevance and the score value;
and clustering the data to be clustered in the clustering service by using the adjusted clustering model.
According to the data clustering processing method, the data clustering processing device, the computer equipment and the storage medium, for the data samples of the clustering objects in the clustering service, no additional clustering algorithm is required to be executed, no real image is required to be generated to be compared with the original image, the correlation between the data samples and the sample characteristics is determined, the category prior distribution is introduced to the sample category characteristics, the intra-class grid prior distribution is introduced to the intra-class grid characteristics, the score value of the sample characteristics obeying the prior distribution is determined, and therefore the clustering model is trained by utilizing the correlation and the score, and the learning of the clustering model to the sample characteristics can be effectively improved. The feature distribution learned by the clustering model is close to prior distribution, and the sample class features and the in-sample style features are effectively decoupled, so that the adjusted clustering model can quickly and accurately obtain the corresponding clustering class according to the class features of the data to be clustered. Therefore, the accuracy of data clustering is effectively improved under the condition of no need of manual marking.
A process of data clustering, the method comprising:
acquiring data to be clustered in clustering services;
encoding the data to be clustered into data characteristics through an encoder; the encoder is obtained by training at least according to the relevance and the score value; the correlation is the result of carrying out correlation discrimination on the data sample and the sample characteristics through a discriminator on the sample characteristics obtained by encoding the data sample by a data encoder; the scoring value is a scoring result of subjecting the sample characteristics to prior distribution through an evaluator; the sample features comprise sample class features and sample intra-class style features; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the sample intra-class style characteristics;
and clustering corresponding data to be clustered according to the category characteristics in the data characteristics.
A processing apparatus for data clustering, the apparatus comprising:
the second acquisition module is used for acquiring data to be clustered in the clustering service;
the characteristic coding module is used for coding the data to be clustered into data characteristics through a coder; the encoder is trained according to at least the relevance and the score value; the correlation is the result of carrying out correlation discrimination on the data sample and the sample characteristics through a discriminator on the sample characteristics obtained by encoding the data sample by a data encoder; the scoring value is a scoring result of subjecting the sample characteristics to prior distribution through an evaluator; the sample features comprise sample class features and sample intra-class style features; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the sample intra-class style characteristics;
and the clustering module is used for clustering corresponding data to be clustered according to the category characteristics in the data characteristics.
A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of:
acquiring data to be clustered in clustering services;
encoding the data to be clustered into data characteristics through an encoder; the encoder is trained according to at least the relevance and the score value; the correlation is a result of carrying out correlation discrimination on the data sample and the sample characteristics through a discriminator on the sample characteristics obtained by encoding the data sample by a data encoder; the scoring value is a scoring result of the sample characteristics subjected to prior distribution through an evaluator; the sample features comprise sample class features and sample intra-class style features; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
and clustering the corresponding data to be clustered according to the category characteristics in the data characteristics.
A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of:
acquiring data to be clustered in clustering services;
encoding the data to be clustered into data characteristics through an encoder; the encoder is obtained by training at least according to the relevance and the score value; the correlation is the result of carrying out correlation discrimination on the data sample and the sample characteristics through a discriminator on the sample characteristics obtained by encoding the data sample by a data encoder; the scoring value is a scoring result of the sample characteristics subjected to prior distribution through an evaluator; the sample features comprise sample class features and sample intra-class style features; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the sample intra-class style characteristics;
and clustering corresponding data to be clustered according to the category characteristics in the data characteristics.
According to the data clustering processing method, the data clustering processing device, the computer equipment and the storage medium, the relevance between the data samples and the sample characteristics is determined, the class prior distribution is introduced into the sample class characteristics, the class prior distribution is introduced into the sample intra-class style characteristics, and the score value of the sample characteristics subjected to the prior distribution is determined, so that the encoder is trained by utilizing the relevance and the score, and the learning of the encoder on the sample characteristics can be effectively improved. The characteristic distribution learned by the encoder is close to prior distribution, and the sample class characteristic and the internal style characteristic of the sample class are effectively decoupled, so that the clustering class corresponding to the data to be clustered can be obtained according to the class characteristic in the data characteristic. Therefore, the accuracy of data clustering is effectively improved under the condition of no need of manual marking.
Drawings
FIG. 1 is a diagram of an exemplary environment in which a method for clustering data may be implemented;
FIG. 2 is a flow diagram illustrating a method for processing data clusters in one embodiment;
FIG. 3 is a diagram illustrating an overall network structure for training a clustering model according to an embodiment;
FIG. 4 is a schematic diagram of a self-encoding network architecture in one embodiment;
FIG. 5 is a schematic diagram of a network architecture of a countermeasure network in one embodiment;
FIG. 6-1 is a schematic diagram showing non-overlapping clusters in the t-SNE graph in one embodiment;
FIG. 6-2 is a diagram illustrating an example of overlap of cluster occurrences in a t-SNE graph;
FIG. 7 is a flow chart illustrating a method for processing data clusters in another embodiment;
FIG. 8 is a block diagram showing the structure of a processing means for clustering data in one embodiment;
FIG. 9 is a block diagram showing the structure of a data clustering processing apparatus in another embodiment;
FIG. 10 is a block diagram showing the construction of a processing means for data clustering according to still another embodiment;
FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data clustering processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be an independent physical server, a server cluster or a distributed system formed by multiple physical servers, or a cloud server providing basic cloud computing services such as a cloud database, cloud storage, cloud communication, and a big data and artificial intelligence platform. A large number of data samples are stored in the server 104, and the data samples are samples of clustering objects in a clustering service. The terminal 102 sends a sample acquisition request to the server 104, and the server 104 returns a data sample of the cluster model training to the terminal 102 according to the sample acquisition request. The terminal 102 maps the data sample as a sample characteristic through a clustering model, and determines the correlation between the data sample and the sample characteristic; the sample features include sample class features and intra-sample class style features. And determining the scoring value of the sample characteristics obeying to the prior distribution, wherein the prior distribution comprises the category prior distribution corresponding to the sample category characteristics and the intra-class style prior distribution corresponding to the sample intra-class style characteristics. The terminal 102 adjusts the clustering model at least according to the correlation and the score value, and clusters the data to be clustered in the clustering service by using the adjusted clustering model. Therefore, the accuracy of data clustering is effectively improved under the condition of no need of manual labeling.
In one embodiment, as shown in fig. 2, a processing method for data clustering is provided, which is described by taking the method as an example of being applied to a computer device (terminal or server) in fig. 1, and includes the following steps:
step 202, obtaining a data sample; the data samples are samples of clustered objects in a clustering service.
Step 204, mapping the data sample into a sample characteristic through a clustering model; the sample features include sample class features and intra-sample class style features.
A computer device obtains a set of data samples. The set of data samples includes data samples. The data samples may be image, text, voice, etc. data types. A clustering model is established in advance in the computer equipment, and the clustering model can be an encoder. The neural networks used by the clustering models are different for data samples of different data types. For example, when the data sample is an image, the clustering model may employ a convolutional neural network, whose convolution block may be adjusted according to the image size. The larger the image, the correspondingly larger the convolution block. For example, 2 convolution blocks may be used for a 32 × 32 image, and 4 convolution blocks may be used for a 96 × 96 image. When the data samples are text or speech, the clustering model may use a neural network such as LSTM (Long Short-Term Memory), bert (Bidirectional Encoder retrieval from transforms), etc. The computer device inputs the data samples into a clustering model, which maps the data samples to corresponding sample features. The sample features include sample class features and intra-sample class style features. Wherein, the element in the sample category characteristic is the probability that the data sample belongs to each cluster category. The sample intra-class style feature describes intra-class style information for the data sample. The computer device may train the clustering model with other network models. Wherein the computer device may determine the correlation of the data sample and the sample characteristic using the discriminator and the value of credit subject to the prior distribution using the evaluator. The overall network structure diagram of the computer device for training the clustering model can be as shown in fig. 3.
At step 206, the correlation between the data sample and the sample characteristics is determined.
The discriminator is a deep neural network composed of a plurality of fully connected layers. For example, it may be a deep neural network consisting of three or more fully connected layers. The evaluator is also a deep neural network composed of multiple fully connected layers. It may be a deep neural network consisting of three or more fully connected layers. The discriminator can judge whether the data sample is related to the sample characteristics, so as to maximize the mutual information between the data sample and the sample characteristics. The computer device may input the data sample and the extracted sample features to the discriminator simultaneously. The data samples comprise a first sample and a second sample. When the data sample input to the discriminator is the first sample, the extracted sample feature is derived from the second sample, and the first sample is different from the second sample, the first sample and the sample feature are negative samples, and the discriminator judges that the two are irrelevant. When the data sample input to the discrimination is the first sample and the extracted sample feature is from the first sample, the first sample and the extracted sample feature are positive samples, and the discriminator judges that the first sample and the extracted sample feature are related. In fig. 3, a shoe image may be used as a first sample, and a clothing image may be used as a second sample. The first sample is correlated with the first sample characteristic and the first sample is uncorrelated with the second sample characteristic. When the discriminator can correctly judge whether the data sample is related to the sample characteristics, the information related to the data sample is contained in the sample characteristics, and therefore the purpose of maximizing mutual information can be achieved.
Step 208, determining the scoring value of the sample characteristics which obey the prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics.
The evaluator introduces a prior distribution for the sample characteristics. The evaluator is also a deep neural network composed of multiple fully connected layers. It may be a deep neural network consisting of three or more fully connected layers. The prior distribution includes a category prior distribution and an intra-category-style prior distribution. The class prior distribution may be referred to as a class distribution for short, and the intra-class lattice prior distribution may be a gaussian distribution. The evaluator is a sample class feature z c Introducing class distributions
Figure BDA0002489183310000081
Is the sample intra-class style feature z s Introducing Gaussian distribution
Figure BDA0002489183310000082
Therefore, the sample class characteristics and the style characteristics in the sample class can be effectively decoupled.
When the sample characteristics are subjected to prior distribution, the output category characteristic part is a unique heat vector, the element with the maximum value in the unique heat vector can be directly used for representing the category of the data sample, and the next clustering operation is avoided. Meanwhile, the data samples can be prevented from being grouped into only 1 class or a few classes, and the required number of classes, such as 10 classes, can be guaranteed to be grouped.
Step 210, the clustering model is adjusted based on at least the relevance and the score value.
And 212, clustering the data to be clustered in the clustering service by using the adjusted clustering model.
The computer device may perform a reverse optimization of the network parameters of the clustering model using correlations of the data samples and sample characteristics, the scoring values of the sample characteristics subject to prior distribution. Wherein, each network parameter in the clustering model can be optimized by adopting a back propagation method. For example, the back propagation method may employ an Adam-based gradient descent method. When the clustering model is reversely optimized, the weight of the network parameters of the clustering model, the discriminator and the evaluator can be updated. During training, the learning rate is 0.0001, and the parameter beta for controlling the convergence of the loss function is controlled 1 Is set to 0.5, beta 2 Set to 0.9. The batch size (batch size) is set to 64. In the reverse optimization process, the evaluator, the cluster model and the discriminator can be optimized alternately by using the same batch of data samples each time. When the loss function of the evaluator starts to converge, the feature distribution learned by the clustering model is close to the prior distribution, and the training can be stopped.
In the current image clustering method based on the deep neural network, one method is to learn the image characteristics based on self-encoding and then improve the clustering effect by optimizing a clustering loss function. The network structure is shown in fig. 4. And mapping the image x to corresponding image characteristics after passing through the clustering model, inputting the image characteristics to a decoder, generating a reconstructed image through the decoder, and comparing the reconstructed image with the original image. In this method, additional clustering algorithms such as K-means are required to obtain the clustering effect. Another clustering method is based on generating a countermeasure network, the network structure of which is shown in fig. 5. And mapping the image characteristics Zn and Zc into a real image Xg by using a generator G, and coding by using a clustering model E to obtain the coded image characteristics Zn and Zc. And (4) carrying out correlation judgment on the real image Xg and the original image Xr through a discriminator D, thereby obtaining a clustering result. In this method, the network is difficult to train because of the need to generate real images.
In the embodiment, for the data sample of the clustering object in the clustering service, no additional clustering algorithm is required to be executed, no real image is required to be generated and compared with the original image, and the value of the score of the sample characteristic obeying the prior distribution is determined by determining the correlation between the data sample and the sample characteristic, introducing the category prior distribution into the sample category characteristic and introducing the intra-class wind-lattice prior distribution into the intra-class wind-lattice characteristic, so that the clustering model is trained by utilizing the correlation and the score, and the learning of the clustering model on the sample characteristic can be effectively improved. The feature distribution learned by the clustering model is close to prior distribution, and the sample category features and the sample in-class style features are effectively decoupled, so that the adjusted clustering model can quickly and accurately obtain the corresponding clustering categories according to the category features of the data to be clustered. Therefore, the accuracy of data clustering is effectively improved under the condition of no need of manual marking.
In one embodiment, the method further comprises: enhancing the data sample, and mapping to obtain enhanced sample characteristics through a clustering model; the enhanced sample features comprise enhanced sample category features and enhanced sample in-class style features; determining the category characteristic difference between the sample category characteristic and the enhanced sample category characteristic; training the clustering model based at least on the relevance and score values comprises: and adjusting the clustering model according to the relevance, the category characteristic difference and the grade value.
And the computer equipment inputs the data samples into the clustering model and maps to obtain corresponding sample characteristics. The sample features include sample class features and intra-sample class style features. The sample category features are vectors activated through a Softmax function of a clustering model, elements in the vectors represent the probability that the data samples belong to each clustering category, and vector dimensions are set as the number of the clustering categories. The style features within the sample class are linearly activated vectors. The vector describes the intra-class style information of the data sample, and the vector dimension may be a preset number, for example, 50. After the sample class characteristics and the style characteristics in the sample class are excited differently, the obtained numerical values are different, but part of information may be mixed together. By introducing category prior distribution to the sample category characteristics and introducing category prior distribution to the sample intra-category style characteristics, the sample category characteristics and the sample intra-category style characteristics can be effectively decoupled.
Because different styles exist in the same type of data sample, the style change does not change the original data type. Based on the specific data enhancement, the phenomenon that the original data category is not changed is avoided. The computer device performs enhancement processing on the data samples, and performs different enhancement processing on the data samples of different data types. For example, when the data samples are images, the enhancement processing includes random cropping, random horizontal flipping, color dithering, and randomly combining color channels on the images. When the data sample is text or voice, the enhancement processing includes random clipping, random position conversion and the like. And inputting the enhanced data sample into the clustering model, and mapping to obtain the enhanced sample characteristic. The computer device extracts a sample category feature in the sample features, extracts an enhanced sample category feature in the enhanced sample features, inputs the sample category feature and the enhanced sample category feature to an evaluator, and identifies a category feature difference between the sample category feature and the enhanced sample category feature through the evaluator. Wherein, the elements in the vector of the sample class characteristics are the probabilities that the data sample belongs to each cluster class. The class feature difference between the sample class feature and the enhanced sample class feature may be measured by divergence.
The computer device may perform a reverse optimization of the network parameters of the clustering model using correlations of the data samples and sample characteristics, scoring values of the sample characteristics subject to prior distribution, and class characteristic differences of the sample class characteristics and the enhanced sample class characteristics. And in the back propagation process of the network, updating the weight values corresponding to the network parameters of the clustering model, the discriminator and the evaluator by using gradient descent. Therefore, the sample characteristics learned by the clustering model are related to the data samples, the learned sample category characteristics can represent the clustering categories of the data samples, and the learned intra-category style characteristics can represent the differences of the same data samples. After the data enhancement processing, the sample class characteristics of the data sample remain unchanged, that is, the style of the data sample may be changed to some extent, but still belong to the same class. Moreover, due to the introduction of the prior distribution constraint, the sample class characteristics can be close to the heat independent vector as much as possible, that is, the numerical values of most elements are close to 0, and the value of only one element is close to 1, so that the cluster class corresponding to the data sample can be directly determined according to the heat independent vector of the sample class characteristics.
In one embodiment, the data samples include a first sample and a second sample; determining the correlation of the data sample and the sample characteristics comprises: obtaining a first sample vector, and splicing the first sample vector by using the sample characteristics of the first sample to generate a spliced first sample vector; splicing the sample characteristics of the second sample with the first sample vector to generate a spliced second sample vector; and identifying the correlation between the spliced first sample vector and the spliced second sample vector through a discriminator to obtain the correlation between the sample characteristics of the first sample and the first sample.
The data samples include a first sample and a second sample, wherein the first sample and the second sample may be completely different or the same. The first sample is input into the clustering model, and the sample characteristic corresponding to the first sample is obtained through mapping and can also be called as the first sample characteristic. And inputting the second sample into the clustering model, and mapping to obtain a sample characteristic corresponding to the second sample, which can also be called as a second sample characteristic. The first sample feature and the second sample feature may be multidimensional vectors, for example 50 dimensions. The computer device converts the first sample into a first sample vector. And the computer equipment splices the first sample characteristic and the first sample vector to generate a spliced first sample vector. The way of stitching may be to add the first sample vector after the first sample feature. The first sample feature may also be added after the first sample vector. The computer device may adopt the above splicing manner to splice the second sample feature with the first sample vector, and generate a spliced second sample vector. And inputting the spliced first sample vector and the spliced second sample vector into a discriminator, comparing the two vectors by the discriminator, outputting 1 if the two vectors are correlated, and outputting 0 if the two vectors are uncorrelated. When the discriminator can correctly judge whether the data sample is related to the sample characteristics, the information related to the data sample is contained in the sample characteristics, and the purpose of maximizing mutual information is achieved, so that the sample characteristics learned by the clustering model are related to the data sample.
In one embodiment, determining the score value that the sample feature obeys the prior distribution comprises: determining a category prior distribution result corresponding to the sample category characteristics through an evaluator; determining an intra-class style prior distribution result corresponding to the intra-class style characteristics of the sample through an evaluator; and grading the category distribution result and the intra-category style prior distribution result through an evaluator to obtain a grading value of the sample characteristics obeying the prior distribution.
The evaluator introduces a prior distribution for the sample characteristics. The prior distribution includes a category prior distribution and an intra-category-style prior distribution. The class prior distribution may be referred to as the class distribution for short, and the intra-class style prior distribution may be a gaussian distribution. The class distribution may be
Figure BDA0002489183310000111
Wherein the content of the first and second substances,
Figure BDA0002489183310000112
the distribution of sample class characteristics, cat is class distribution and is a unique heat vector, K is the number of clustering classes, and P is the reciprocal of K. The sample intra-class style feature may be
Figure BDA0002489183310000113
Figure BDA0002489183310000114
The distribution of the style characteristics in the sample class is N is gaussian distribution, and σ is standard deviation, and may be a predetermined value, such as 0.1.
And the computer equipment simultaneously inputs the sample category characteristics and the style characteristics in the sample category to the evaluator, and the evaluator respectively outputs a category distribution result corresponding to the sample category characteristics and a Gaussian distribution result corresponding to the style characteristics in the sample category. Wherein, the category distribution result can be a category vector, and the category vector can be a hot independent vector. The gaussian distribution result may be a style vector.
In one embodiment, the scoring the category distribution result and the intra-category style prior distribution result by the evaluator comprises: splicing the category distribution vector of the sample category characteristics and the Gaussian distribution vector of the style characteristics in the sample category to generate a prior distribution vector; and scoring the prior distribution vector through an evaluator to obtain a score value of the sample characteristic obeying to the prior distribution.
And the computer equipment splices the category result and the Gaussian distribution result, namely splices the corresponding category vector and the style vector. The stitching approach may be to add the elements of the style vector after the last element of the category vector. It is also possible to add the elements of the style vector after the last element of the category vector. And the evaluator scores the spliced vectors to obtain corresponding scores, wherein the scores are the probability that the sample characteristics obey prior distribution. The higher the probability, the more obeyed the sample features to the prior distribution. When the sample characteristics obey prior distribution, the output sample category characteristics can be close to the hot independent vector as much as possible, so that the element with the largest numerical value in the hot independent vector can be directly used for representing the category of the data sample, and the next clustering operation is avoided. Furthermore, when obeying the prior distribution, the data samples can be prevented from being grouped into only one or several classes, so that the data samples can be guaranteed to be grouped into the desired number of classes.
In one embodiment, the method further comprises: determining the correlation between the data sample and the sample characteristics through a discriminator; determining, by an evaluator, a score value of a sample characteristic subject to a prior distribution; and alternately optimizing the clustering model, the discriminator and the evaluator at least according to the relevance and the score.
Correlations between the data samples and sample features are identified by a discriminator. The loss function that discriminators identify correlations between data samples and sample features may be referred to as a mutual information loss function. The arbiter may be trained by a mutual information loss function. The mutual information loss function can be expressed by the following formula (1):
Figure BDA0002489183310000121
wherein X is a data sample, Z is a sample characteristic, S is a sigmoid function, E represents expectation, D is a discriminator used for judging whether X and Z are related, and Q (Z | X) is posterior distribution of Z obtained by mapping a clustering model; p is X For the a-priori distribution of the input pictures,
Figure BDA0002489183310000122
is the posterior distribution of the polymerization of Z,
Figure BDA0002489183310000123
denotes that X, Z obey Q (Z | X) P X Mathematical expectation of (X). When X and Z are positive samples, then,
Figure BDA0002489183310000124
Figure BDA0002489183310000125
when X and Z are negative samples, then,
Figure BDA0002489183310000126
in the process of training the discriminator through the mutual information loss function, the smaller the loss function value is, the more accurate the correlation judgment is, and the smaller the influence on the weight of each layer in the discriminator network is during reverse optimization. When the discriminator can correctly judge whether the data sample is related to the features, the information related to the data sample is contained in the features, and the purpose of maximizing mutual information is achieved.
The class feature difference between the sample class feature and the enhanced sample class feature may be measured by divergence. The divergence may be a KL divergence. The corresponding loss function may be referred to as a class difference loss function, using the following equation (2)
L Aug =KL(Q(Z c |X)||Q(Z c |T(X))) (2)
Wherein KL is KL divergence, Q is a clustering model, zc is a sample class characteristic, X is a data sample, T is data enhancement, and Q (Z) c I X) is the posterior distribution of polymerization of Zc, Q (Z) c | T (X)) is the posterior distribution of the enhanced sample features.
The smaller the function value of the class difference loss function is, the smaller the class feature difference between the sample class feature and the enhanced sample class feature is, and correspondingly, the smaller the probability of the change of the sample class feature after the data sample is subjected to data enhancement processing is.
The sample features are scored by an evaluator subject to a prior distribution. The loss function that introduces the prior distribution for the sample features may be referred to as the prior distribution loss function. Wherein different prior distribution loss functions may be defined for the clustering model and the evaluator, respectively. The sample characteristics mapped by the clustering model can be close to the prior distribution as much as possible through the prior distribution loss function. The prior distribution loss function of the clustering model may be as shown in equation (3), and the prior distribution loss function of the evaluator may be as shown in equation (4):
Figure BDA0002489183310000131
wherein Q is a clustering model, Z is a sample characteristic of a data sample, C (Z) is a probability value of whether the sample characteristic obeys prior distribution, and Q Z Is the posterior distribution of the polymerization of Z,
Figure BDA0002489183310000132
obey Qz for Z [ C (Z)]The mathematical expectation of (2).
Figure BDA0002489183310000133
Wherein C is an evaluator, P Z In order to be a priori distributed,
Figure BDA0002489183310000134
to distribute P from the prior inspection Z And a polymerization posterior distribution Q z The sampled features are aligned with the features on the link,
Figure BDA0002489183310000135
and the evaluation unit C is a gradient penalty term and is used for enabling the evaluator C to meet Lipschitz constraint and enabling the evaluation score, namely the probability change obeying prior distribution, of the evaluator C not to be too severe, and lambda is a gradient penalty term coefficient and is set to be 10.
In one embodiment, a mutual information loss function, a class difference loss function, a priori distribution loss function of the clustering model may be used as the sub-loss functions to define the total loss function of the clustering model. Each of the sub-loss functions may have a corresponding weight. The total loss function of the discriminator may be defined by the mutual information loss function and its corresponding weights. The overall loss function of the evaluator may be defined by its a priori distributed loss function and its weights.
The total loss function of the clustering model is as follows (5) the total loss function of the discriminator is as follows (6), and the total loss function of the evaluator is as follows (7):
Figure BDA0002489183310000141
L D =β MI L MI (6)
Figure BDA0002489183310000142
wherein L is Q Is the total loss function of the cluster model. L is MI As a function of mutual information loss, L Aug In order to be a function of the class difference loss,
Figure BDA0002489183310000143
a priori distributed loss function, beta, of a clustering model MI Is L MI Weight of (1), beta Aug Is L Aug Weight of (1), beta Adv Is composed of
Figure BDA0002489183310000144
The weight of (c). Beta is a MI 、β Adv Can be set to a corresponding fixed value, e.g. beta MI Is set to 0.5, beta Adv Is set to 1. Beta is a Aug In relation to the data set of data samples, this may be set in the following way. Specifically, the computer device may perform nonlinear dimensionality reduction on the sample features to generate a corresponding visual dimensionality reduction map, and select the weight of the category difference loss function according to the visual dimensionality reduction map. The visual dimension reduction graph is the result of reducing the dimension of high-dimensional data to low-dimensional data, so that the result is visual. Low dimensions such as two or three dimensions. For example, the t-SNE can be adopted to perform nonlinear dimension reduction processing on the sample characteristics, and a visual dimension reduction map, namely a t-SNE map, is generated according to the processing result. In the t-SNE diagram, the data samples are clustered to form cluster, and the cluster is expressed by beta Aug When the value of (a) is lower, the clustering of each data sample is more dispersed, with beta Aug The resulting features tend to aggregate, clustering may even overlap. And the clustered results are different for different data types of data sets. Taking data samples as images, at β Aug And when =2, the clusters in the t-SNE graph have no overlap, as shown in FIG. 6-1. At beta Aug When =3, the cluster clusters in the t-SNE diagram overlap, as shown in fig. 6-2. The maximum value of clustering overlap can thus be selected between 2 and 3 as β Aug The total loss function of the clustering model can be more accurate, and the clustering result of the trained clustering model is more accurate.
In another embodiment, the computer device may replace the discriminator with a decoder, perform data sample reconstruction on the sample features through the decoder, generate a reconstructed sample, and obtain the correlation between the data sample and the sample features by determining whether the reconstructed sample is the same as the data sample. The mutual information loss function corresponding to the decoder can be shown in the following equation (8):
Figure BDA0002489183310000145
wherein x is sample data and x' is a reconstructed sample.
The total loss function of the clustering model is shown in equation (9) below:
Figure BDA0002489183310000151
β r is L r Weight of (b), beta Aug Is L Aug Weight of (1), beta Adv Is composed of
Figure BDA0002489183310000152
The weight of (c). Beta is a r 、β Adv May be set to a corresponding fixed value. Beta is a Aug Can be determined in the manner described above.
The training of the clustering model can be performed in a reverse optimization mode. In the reverse optimization, the evaluator, the clustering model and the discriminator may be alternately optimized. Wherein the evaluator is optimized first and then the clustering model and the discriminator are optimized. Specifically, the evaluator is first reversely optimized by using the total loss function of the evaluator, so that the probability of the evaluator on the sample characteristic which obeys the prior distribution is close to 1, and the probability of the evaluator on the sample characteristic which does not obey the prior distribution is close to 0. And then, reversely optimizing the clustering model by using the total loss function of the clustering model and reversely optimizing the discriminator by using the total loss function of the discriminator to enable the sample characteristics output by the clustering model to obtain high scores as much as possible, namely the probability that the sample characteristics obey the prior distribution is as high as possible, repeating the alternative optimization process to enable the sample characteristics output by the clustering model to obtain high scores, namely the probability that the sample characteristics obey the prior distribution is close to 1, thereby obeying the prior distribution.
In one embodiment, the alternately optimizing the clustering model, the discriminator and the evaluator based on at least the relevance and the score comprises: firstly, optimizing the network parameters of the evaluator at least once according to the scoring value; and optimizing the network parameters of the clustering model at least according to the correlation and the score, and optimizing the network parameters of the discriminator according to the correlation.
Specifically, due to the large number of data samples, all the data samples cannot be input into the clustering model at one time for training. In reverse optimization, the data samples may be randomly divided into multiple batches, each batch using a fixed number of data samples, which may also be referred to as batch samples. For example, the batch sample may be set to 64 data samples, i.e., the batch size (batch size) is set to 64.
During training, the computer device determines the scoring value of the sample characteristic subjected to the prior distribution and determines the correlation between the data sample and the sample characteristic. And updating the weight corresponding to each network parameter when the clustering model, the discriminator and the evaluator are alternately optimized. Firstly, optimizing network parameters of an evaluator for at least one time according to a score value of a sample characteristic subjected to prior distribution and a total loss function of the evaluator, then optimizing the network parameters of a clustering model according to the correlation between a data sample and the sample characteristic, the score value of the sample characteristic subjected to prior distribution, category characteristic difference and the total loss function of the clustering model, and optimizing the network parameters of a discriminator according to the correlation between the data sample and the sample characteristic and the total loss function of the discriminator. For example, the evaluator is first given 4 sub-optimizations, and then the cluster model and the discriminator are given 1 sub-optimizations. When the clustering model and the discriminator are reversely optimized, the reverse optimization can be carried out successively or simultaneously.
When the evaluator is optimized reversely, the more the output of the priori distribution is close to 1, the smaller the loss function value is, when the priori distribution is input, the smaller the change of the parameter is when the priori distribution is reversely propagated, the more the output of the priori distribution is close to 0, the smaller the loss function is, when the priori distribution is reversely propagated, the smaller the change of the parameter is. When the clustering model is reversely optimized, the output of the data sample is closer to 1, the loss function value is smaller, and the change of the parameters is smaller when the data sample is reversely transmitted. The prior distribution is not considered when the clustering model is reversely optimized. When the clustering model is subjected to reverse optimization, the difference between the feature distribution learned by the current clustering model and the prior distribution can be indicated by the total loss function of the evaluator, when the total loss function of the evaluator starts to converge, the feature distribution learned by the clustering model is close to the prior distribution, and the training can be stopped.
In one embodiment, a processing method for data clustering is provided, which can be applied to a computer device, as shown in fig. 7, and includes the following steps:
step 702, acquiring data to be clustered in the clustering service.
Step 704, encoding the data to be clustered into data characteristics through an encoder; an encoder trained on at least the relevance and the score value; the correlation is the result of carrying out correlation discrimination on the data sample and the sample characteristics through a discriminator on the sample characteristics obtained by encoding the data sample by a data encoder; the scoring value is a scoring result of the sample characteristics subjected to prior distribution through the evaluator; the sample characteristics comprise sample category characteristics and style characteristics in a sample category; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics.
And step 706, clustering the corresponding data to be clustered according to the category characteristics in the data characteristics.
The computer device may obtain the data to be clustered in a variety of ways. The plurality may be two or more. For example, the computer device obtains data to be clustered in the clustering service through the internet, and the computer device obtains the data to be clustered in the clustering service in the database according to the clustering task by receiving the clustering task. The data to be clustered may include a variety of data types, such as image, text, voice, etc.
A pre-trained encoder may be used as a clustering model. The encoder is a neural network model. For example, when the data sample is an image, the Encoder may use a convolutional neural network, and when the data sample is a text or a speech, the Encoder may use a neural network such as LSTM (Long Short-Term Memory), bert (Bidirectional Encoder retrieval from transformations). The encoder may be trained by means of neural networks such as discriminators and evaluators. The discriminator is a deep neural network composed of a plurality of fully connected layers. For example, it may be a deep neural network consisting of three or more fully connected layers. The evaluator is also a deep neural network composed of multiple fully connected layers. It may be a deep neural network consisting of three or more fully connected layers.
In the training process of the encoder, manual marking of the data samples is not needed in advance. The computer device inputs the data samples to an encoder, which maps the data samples to corresponding sample features. The sample features include sample class features and intra-sample class style features. Wherein, the probability that the element in the sample class characteristic belongs to each cluster class of the data sample. And the sample intra-class style characteristic describes intra-class style information of the data sample. The discriminator may determine whether the data sample is correlated with the sample features. The correlation degree between the data sample and the sample characteristic can be judged by calculating mutual information between the two. By maximizing mutual information between the data samples and the sample characteristics. The method can effectively improve the sample feature representation quality and improve the learning of the encoder on feature distribution, thereby promoting the improvement of clustering precision.
During the training process, the evaluator introduces a prior distribution for the sample features. The prior distribution includes a class distribution and a gaussian distribution. The evaluator introduces class distribution for the sample class characteristics and introduces Gaussian distribution for the style characteristics in the sample class. Wherein the category distribution is a one-hot vector. When the sample characteristics are subjected to prior distribution, the output category characteristic part is a unique heat vector, the element with the maximum value in the unique heat vector can be directly used for representing the category of the data sample, and the next clustering operation is avoided. Meanwhile, the data samples can be prevented from being grouped into only 1 or several classes, and the required number of classes, such as 10 classes, can be guaranteed to be grouped. By introducing class distribution for the sample class characteristics and introducing Gaussian distribution for the style characteristics in the sample class, the sample characteristics are decoupled.
The computer device can utilize the correlation of the data samples and the sample characteristics, the grading values of the sample characteristics subjected to prior distribution and the category characteristic difference to carry out reverse optimization on the network parameters of the encoder. Wherein, each network parameter in the encoder can be optimized by adopting a back propagation method. The learning of the encoder on the sample characteristics can be improved by maximizing the mutual information, introducing a prior distribution, etc.
Further, the encoder may be trained in the manner described in the above embodiments. After the encoder is trained, the sample class characteristics and the internal sample class style characteristics are effectively decoupled. The data to be clustered is input into an encoder of the encoder, corresponding category features can be mapped through the encoder, the categories corresponding to the category features are distributed into the thermal unique vectors, and therefore the elements with the largest numerical values in the thermal unique vectors can be directly used for representing the clustering categories of the data to be clustered. The data to be clustered can be grouped into one type, multiple types or the required number of types.
In the embodiment, the relevance between the data sample and the sample characteristic is determined, the category prior distribution is introduced into the sample category characteristic, and the score value of the sample characteristic obeying the prior distribution is determined, so that the encoder is trained by utilizing the relevance and the score, and the learning of the encoder on the sample characteristic can be effectively improved. The characteristic distribution learned by the encoder is close to prior distribution, and the sample class characteristic and the internal style characteristic of the sample class are effectively decoupled, so that the clustering class corresponding to the data to be clustered can be obtained according to the class characteristic in the data characteristic. Therefore, the accuracy of data clustering is effectively improved under the condition of no need of manual labeling.
It should be understood that, although the steps in the flowcharts of fig. 2 and 7 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in fig. 2 and 7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternatively with other steps or at least a part of the steps or stages in other steps.
The clustering process is described by taking the data sample as an image sample, a text sample and a voice sample, respectively.
For the image samples, the clustering model is an image clustering model. The image clustering model may be an image encoder, and may employ a convolutional neural network. The convolution block can be adjusted according to the image size. The larger the image, the correspondingly larger the convolution block. The image samples are input to an image encoder, and the image encoder encodes the image samples into image sample features, wherein the image sample features comprise image class features and intra-image class style features. The image category is a classification result of the image. Different feature vectors may be employed for different classes of images. For example, in a human face image, classification can be performed according to feature vectors of five sense organs. In the nature scene image, color feature vectors can be used for classification. The intra-image class lattices are different representations of the same class of images. For example, the plurality of images may be recognized as the same category and may have different styles, such as different colors of the images, different postures of photographing, different photographing backgrounds, and the like. Or the same category of images may include different styles of cartoons, caricatures, paintings, ink and wash, etc. Determining, by a discriminator, a correlation between the image sample and the image feature, and determining, by an evaluator, a score value for the image feature subject to a prior distribution, wherein the prior distribution includes an image category prior distribution corresponding to the image category feature, and an image category prior distribution corresponding to the intra-image category feature. The image category features related to a certain category appear more in the image category prior distribution, and the image category features of other categories appear less. The intra-image-class lattice prior distribution may be a gaussian distribution of intra-image-class lattices. And performing enhancement processing on the image sample, including random cropping, random horizontal flipping, color dithering, random combination of color channels and the like on the image sample. And coding the enhanced image sample through an image coder to obtain enhanced image characteristics, and determining the category characteristic difference between the image category characteristics and the enhanced image category characteristics. And training an image encoder by utilizing the correlation between the image sample and the image characteristics, the grading value of the image characteristics subjected to prior distribution and the category characteristic difference. And clustering the images to be clustered in the clustering service by the trained encoder to obtain the image clustering categories corresponding to the images to be clustered.
For text samples, the clustering model is a text clustering model. The text clustering model can be a text encoder, and neural networks such as LSTM and Bert can be adopted. The text sample is input to a text encoder, and the text encoder encodes the text sample into text sample features, wherein the text sample features comprise text category features and style features in the text category. The text category features are phrase elements representing the meaning of the text, and text category feature vectors can be generated by adopting words, phrases, common concepts and the like. The styles within a text class are different expressions of the same category of text. For example, the same category of text may be different fonts, different background colors, different font effects, different text rotation angles, and the like. The method comprises the steps of determining the correlation between a text sample and a text characteristic through a discriminator, and determining a score value of the text characteristic obeying to prior distribution through an evaluator, wherein the prior distribution comprises text category prior distribution corresponding to the text category characteristic and text category prior distribution corresponding to the text category characteristic. The phrase elements related to a certain class in the text class prior distribution appear more, and the phrase elements of other classes appear less. The a priori distribution of the lattices within the text class may be a gaussian distribution of the lattices within the text class. And performing enhancement processing on the text sample, wherein the enhancement processing comprises synonym replacement, random deletion or random insertion of certain words of the text sample, random disorder of the sequence of the text, random hiding of certain words in the text and the like. And coding the enhanced text sample through a text coder to obtain enhanced text characteristics, and determining the category characteristic difference between the text category characteristics and the enhanced text category characteristics. And training a text encoder by utilizing the correlation between the text sample and the text characteristic, the scoring value of the text characteristic subjected to prior distribution and the category characteristic difference. And clustering the texts to be clustered in the clustering service by the trained encoder to obtain the text clustering categories corresponding to the texts to be clustered.
For speech samples, the clustering model is a speech clustering model. The speech clustering model can be a speech coder, and neural networks such as LSTM and Bert can be adopted. The speech samples are input to a speech encoder, which encodes the speech samples into speech sample features, which include speech class features and intra-speech style features. The speech class feature may be an MFCC (Mel-frequency cepstral coefficients) feature extracted for speech. For example, the MFCC features extracted differ among different age categories of speech text. The style features within a speech class are different expressions of the same class of speech. For example, the same category of speech may have different speeds, different moods, different intonations, different accents, etc. The method comprises the steps of determining the correlation between a voice sample and a voice characteristic through a discriminator, and determining a score value of the voice characteristic obeying to prior distribution through an evaluator, wherein the prior distribution comprises the prior distribution of the voice category corresponding to the voice category characteristic and the prior distribution of the style in the voice category corresponding to the style in the voice category characteristic. The MFCC features associated with a certain class appear more often in the prior distribution of speech classes, and the MFCC features of other classes appear less often. The prior distribution of the style within the speech class may be a gaussian distribution of the style within the speech class. The enhancement processing may be performed on the time-frequency spectrum of the audio in the speech sample, and includes the same class enhancement, time shift enhancement, pitch transform enhancement, and the like. Or enhancing the sound spectrogram of the voice sample, including time shift transformation, speed adjustment, mixing background sounds such as human voice noise, music background noise, real noise and the like, volume adjustment, stretching of an audio signal and the like. And coding the enhanced voice sample through a voice coder to obtain enhanced voice characteristics, and determining the category characteristic difference between the voice category characteristics and the enhanced voice category characteristics. The speech coder is trained using correlations between speech samples and speech features, scores of the speech features subject to prior distributions, and class feature differences. And clustering the voices to be clustered in the clustering service by the trained encoder to obtain voice clustering categories corresponding to the voices to be clustered.
In one embodiment, as shown in fig. 8, there is provided a data clustering processing apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and the apparatus specifically includes: a first obtaining module 802, a feature mapping module 804, a correlation identification module 806, a prior distribution scoring module 808, and a cluster training module 810, wherein:
a first obtaining module 802, configured to obtain a data sample; the data samples are samples of clustered objects in a clustering service.
A feature mapping module 804, configured to map the data samples into sample features through the clustering model; the sample features include sample class features and intra-sample class style features.
A correlation identification module 806 for determining a correlation of the data sample and the sample characteristics.
A prior distribution scoring module 808 for determining a score value at which the sample features obey the prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics.
A cluster training module 810 for adjusting the cluster model based at least on the relevance and the score value; and clustering the data to be clustered in the clustering service by using the adjusted clustering model.
In one embodiment, as shown in fig. 9, the processing device for data clustering further includes: an enhancement processing module 812, configured to perform enhancement processing on the data sample; the first feature mapping module 804 is further configured to map the enhanced sample features through a clustering model; the enhanced sample features comprise enhanced sample category features and enhanced sample in-class style features; a feature difference identification module 814, configured to determine a category feature difference between the sample category feature and the enhanced sample category feature; the cluster training module 810 is further configured to adjust the clustering model based on the relevance, the category feature difference, and the score value.
In one embodiment, the data samples include a first sample and a second sample; the correlation identification module 806 is further configured to obtain a first sample vector, and splice the first sample vector with sample features of the first sample to generate a spliced first sample vector; splicing the sample characteristics of the second sample with the first sample vector to generate a spliced second sample vector; and identifying the correlation between the spliced first sample vector and the spliced second sample vector through a discriminator to obtain the correlation between the sample characteristics of the first sample and the first sample.
In one embodiment, the prior distribution scoring module 808 is further configured to determine a class prior distribution result corresponding to the sample class feature through the evaluator; determining an intra-class style prior distribution result corresponding to the intra-class style characteristics of the sample through an evaluator; and grading the category distribution result and the intra-category style prior distribution result through an evaluator to obtain a grading value of the sample characteristics obeying the prior distribution.
In one embodiment, the prior distribution scoring module 808 is further configured to splice the class distribution vector of the sample class feature and the gaussian distribution vector of the style feature in the sample class to generate a prior distribution vector; and scoring the prior distribution vector through an evaluator to obtain a score value of the sample characteristic obeying to the prior distribution.
In one embodiment, the correlation identification module 806 is further configured to determine a correlation between the data sample and the sample feature by the discriminator; the a priori distribution scoring module 808 is further configured to determine, by the evaluator, a score value at which the sample feature obeys the a priori distribution; the cluster training module 810 is further configured to perform an alternating optimization of the cluster model, the discriminator, and the evaluator based on at least the relevance and the score.
In one embodiment, the cluster training module 810 is further configured to optimize the network parameters of the evaluator at least once according to the score; and optimizing the network parameters of the clustering model at least according to the correlation and the score, and optimizing the network parameters of the discriminator according to the correlation.
In one embodiment, the cluster training module 810 is further configured to obtain mutual information loss functions and weights, a priori distributed loss functions and weights, and class difference loss functions; generating a corresponding t-SNE graph by using the sample characteristics, and selecting the weight of the category difference loss function according to the t-SNE graph; generating a total loss function of the clustering model by using a mutual information loss function and weight, a prior distribution loss function and weight, and a category difference loss function and weight; and optimizing the network parameters of the clustering model by using the total loss function of the clustering model.
In one embodiment, as shown in fig. 10, there is provided a data clustering processing apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and the apparatus specifically includes: a second obtaining module 1002, a feature encoding module 1004, and a clustering module 1006, wherein:
the second obtaining module 1002 is configured to obtain data to be clustered in the clustering service.
The feature encoding module 1004 is used for encoding the data to be clustered into data features through an encoder; an encoder trained on at least the relevance and the score values; the correlation is the result of carrying out correlation discrimination on the data sample and the sample characteristics through a discriminator on the sample characteristics obtained by encoding the data sample by a data encoder; the scoring value is a scoring result of the sample characteristics subjected to prior distribution through the evaluator; the sample characteristics comprise sample category characteristics and style characteristics in a sample category; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics.
The clustering module 1006 is configured to cluster the corresponding data to be clustered according to the category features in the data features.
Further, the encoder may be obtained by training in the manner provided in the above embodiments.
For specific limitations of the processing apparatus for data clustering, reference may be made to the above limitations on the processing method for data clustering, which are not described herein again. The modules in the processing device for data clustering can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 11. The computer device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a processing method of data clustering. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (20)

1. A method for processing data clusters, the method comprising:
acquiring a data sample; the data samples are samples of clustering objects in clustering services; the data type of the data sample comprises at least one of an image, text or voice;
mapping the data sample into a sample characteristic through a clustering model matched with the data type of the data sample; the sample features comprise sample class features and sample intra-class style features;
determining a correlation of the data sample and the sample features; the correlation is used for characterizing whether the data sample is correlated with the sample characteristic;
determining a score value for which the sample feature obeys a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
adjusting the clustering model based at least on the relevance and the score value;
clustering the data to be clustered in the clustering service by using the adjusted clustering model; and the data type of the data to be clustered is matched with the adjusted clustering model.
2. The method of claim 1, further comprising:
enhancing the data sample, and mapping to obtain enhanced sample characteristics through the clustering model; the enhanced sample features comprise enhanced sample class features and enhanced sample intra-class style features;
determining a class feature difference between the sample class feature and the enhanced sample class feature;
said adjusting said clustering model based on at least said relevance and said score value comprises:
and adjusting the clustering model according to the relevance, the category characteristic difference and the score value.
3. The method of claim 1, wherein the data samples comprise a first sample and a second sample; the determining the correlation of the data sample and the sample characteristic comprises:
obtaining a first sample vector, and splicing the first sample vector by using the sample characteristics of the first sample to generate a spliced first sample vector;
splicing the sample characteristics of the second sample with the first sample vector to generate a spliced second sample vector;
and identifying the correlation between the spliced first sample vector and the spliced second sample vector through a discriminator to obtain the correlation between the sample characteristics of the first sample and the first sample.
4. The method of claim 1, wherein said determining a score value that the sample feature obeys a prior distribution comprises:
determining a category prior distribution result corresponding to the sample category characteristics through an evaluator;
determining an intra-class style prior distribution result corresponding to the intra-class style characteristics of the sample through the evaluator;
and scoring the category distribution result and the intra-category style prior distribution result through the evaluator to obtain a score value of the sample characteristic obeying prior distribution.
5. The method of claim 4, wherein the intra-class lattice prior distribution comprises a Gaussian distribution, and wherein scoring, by the evaluator, the intra-class lattice prior distribution results and the intra-class lattice prior distribution results comprises:
splicing the class distribution vector of the sample class characteristics with the Gaussian distribution vector of the style characteristics in the sample class to generate a prior distribution vector;
and scoring the prior distribution vector through the evaluator to obtain a score value of the sample characteristic obeying to the prior distribution.
6. The method of claim 1, further comprising:
determining, by a discriminator, a correlation of the data sample and the sample features;
determining, by an evaluator, a score value for which the sample feature obeys a prior distribution;
said adjusting said clustering model based on at least said relevance and said score value comprises:
alternately optimizing the clustering model, the discriminator, and the evaluator based at least on the correlation and the score.
7. The method of claim 6, wherein the alternately optimizing the clustering model, the discriminators, and the raters based on at least the correlations and the scores comprises:
firstly, optimizing the network parameters of the evaluator at least once according to the scoring value;
and optimizing the network parameters of the clustering model at least according to the correlation and the score value, and optimizing the network parameters of the discriminator according to the correlation.
8. The method of claim 6, further comprising:
obtaining mutual information loss functions and weights, prior distribution loss functions and weights and category difference loss functions;
generating a corresponding visual dimension reduction graph by using the sample characteristics, and selecting the weight of the category difference loss function according to the visual dimension reduction graph;
generating a total loss function of the clustering model by using the mutual information loss function and weight, the prior distribution loss function and weight and the category difference loss function and weight;
and optimizing the network parameters of the clustering model by using the total loss function of the clustering model.
9. A method for processing data clusters, the method comprising:
acquiring data to be clustered in clustering services; the data type of the data to be clustered comprises at least one of images, texts or voice;
encoding the data to be clustered into data characteristics through an encoder matched with the data type of the data to be clustered; the encoder is obtained by training at least according to the relevance and the score value; the correlation is a result of carrying out correlation discrimination on the data sample and the sample characteristic through a discriminator by using the sample characteristic obtained by encoding the data sample which belongs to the same data type as the data to be clustered by a data encoder; the scoring value is a scoring result of subjecting the sample characteristics to prior distribution through an evaluator; the sample features comprise sample class features and sample intra-class style features; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the sample intra-class style characteristics;
and clustering corresponding data to be clustered according to the category characteristics in the data characteristics.
10. A processing apparatus for clustering data, the apparatus comprising:
the first acquisition module is used for acquiring a data sample; the data samples are samples of clustering objects in clustering services; the data type of the data sample comprises at least one of image, text or voice;
the characteristic mapping module is used for mapping the data sample into a sample characteristic through a clustering model matched with the data type of the data sample; the sample features comprise sample class features and sample intra-class style features;
a correlation identification module for determining a correlation of the data sample and the sample features; the correlation is used for characterizing whether the data sample is correlated with the sample characteristic or not;
a prior distribution scoring module for determining a score value at which the sample feature obeys a prior distribution; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
a cluster training module for adjusting the cluster model at least according to the relevance and the score value; clustering the data to be clustered in the clustering service by using the adjusted clustering model; and the data type of the data to be clustered is matched with the adjusted clustering model.
11. The apparatus of claim 10, further comprising:
the enhancement processing module is used for carrying out enhancement processing on the data sample;
the characteristic mapping module is also used for mapping to obtain enhanced sample characteristics through the clustering model; the enhanced sample features comprise enhanced sample category features and enhanced sample in-class style features;
a feature difference identification module for determining a class feature difference between the sample class feature and the enhanced sample class feature;
the cluster training module is further used for adjusting the cluster model according to the relevance, the category feature difference and the score value.
12. The apparatus of claim 10, wherein the data samples comprise a first sample and a second sample;
the correlation identification module is further configured to obtain a first sample vector, and splice the first sample vector with the sample features of the first sample to generate a spliced first sample vector; splicing the sample characteristics of the second sample with the first sample vector to generate a spliced second sample vector; and identifying the correlation between the spliced first sample vector and the spliced second sample vector through a discriminator to obtain the correlation between the sample characteristics of the first sample and the first sample.
13. The apparatus of claim 10,
the prior distribution scoring module is further used for determining a category prior distribution result corresponding to the sample category characteristic through an evaluator; determining an intra-class style prior distribution result corresponding to the intra-class style characteristics of the sample through the evaluator; and scoring the category distribution result and the intra-category style prior distribution result through the evaluator to obtain a score value of the sample characteristic obeying prior distribution.
14. The apparatus of claim 13, wherein the intra-class lattice prior distribution comprises a gaussian distribution;
the prior distribution scoring module is further used for splicing the class distribution vector of the sample class characteristics with the Gaussian distribution vector of the style characteristics in the sample class to generate a prior distribution vector; and scoring the prior distribution vector through the evaluator to obtain a score value of the sample characteristic obeying to the prior distribution.
15. The apparatus of claim 10,
the correlation identification module is further used for determining the correlation between the data sample and the sample characteristics through a discriminator;
the prior distribution scoring module is further used for determining a scoring value of the sample characteristic obeying to the prior distribution through an evaluator;
the cluster training module is further used for performing alternate optimization on the cluster model, the discriminator and the evaluator at least according to the correlation and the score.
16. The apparatus of claim 15,
the cluster training module is further used for optimizing the network parameters of the evaluator at least once according to the score value; and optimizing the network parameters of the clustering model at least according to the correlation and the score value, and optimizing the network parameters of the discriminator according to the correlation.
17. The apparatus of claim 15,
the clustering training module is also used for acquiring a mutual information loss function and weight, a prior distribution loss function and weight and a category difference loss function; generating a corresponding visual dimension reduction graph by using the sample characteristics, and selecting the weight of the category difference loss function according to the visual dimension reduction graph; generating a total loss function of the clustering model by using the mutual information loss function and weight, the prior distribution loss function and weight and the category difference loss function and weight; and optimizing the network parameters of the clustering model by using the total loss function of the clustering model.
18. An apparatus for clustering data, the apparatus comprising:
the second acquisition module is used for acquiring data to be clustered in the clustering service; the data type of the data to be clustered comprises at least one of images, texts or voice;
the characteristic coding module is used for coding the data to be clustered into data characteristics through a coder matched with the data type of the data to be clustered; the encoder is trained according to at least the relevance and the score value; the correlation is a result obtained by encoding a data sample which belongs to the same data type as the data to be clustered by a data encoder to obtain a sample characteristic, and performing correlation discrimination between the data sample and the sample characteristic by a discriminator; the scoring value is a scoring result of subjecting the sample characteristics to prior distribution through an evaluator; the sample features comprise sample class features and sample intra-class style features; the prior distribution comprises category prior distribution corresponding to the sample category characteristics and intra-class style prior distribution corresponding to the intra-sample style characteristics;
and the clustering module is used for clustering corresponding data to be clustered according to the category characteristics in the data characteristics.
19. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 9 when executing the computer program.
20. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN202010400391.1A 2020-05-13 2020-05-13 Data clustering processing method and device, computer equipment and storage medium Active CN111598153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010400391.1A CN111598153B (en) 2020-05-13 2020-05-13 Data clustering processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010400391.1A CN111598153B (en) 2020-05-13 2020-05-13 Data clustering processing method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111598153A CN111598153A (en) 2020-08-28
CN111598153B true CN111598153B (en) 2023-02-24

Family

ID=72188754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010400391.1A Active CN111598153B (en) 2020-05-13 2020-05-13 Data clustering processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111598153B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11830476B1 (en) * 2021-06-08 2023-11-28 Amazon Technologies, Inc. Learned condition text-to-speech synthesis
CN115083442B (en) * 2022-04-29 2023-08-08 马上消费金融股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN116361671B (en) * 2023-06-01 2023-08-22 浪潮通用软件有限公司 Post-correction-based high-entropy KNN clustering method, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871896A (en) * 2019-02-26 2019-06-11 北京达佳互联信息技术有限公司 Data classification method, device, electronic equipment and storage medium
CN110490306A (en) * 2019-08-22 2019-11-22 北京迈格威科技有限公司 A kind of neural metwork training and object identifying method, device and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024193B2 (en) * 2006-10-10 2011-09-20 Apple Inc. Methods and apparatus related to pruning for concatenative text-to-speech synthesis
JP5346756B2 (en) * 2009-09-25 2013-11-20 Kddi株式会社 Image classification device
EP3035274A1 (en) * 2014-12-17 2016-06-22 Tata Consultancy Services Limited Interpretation of a dataset
CN110020078B (en) * 2017-12-01 2021-08-20 北京搜狗科技发展有限公司 Method and related device for generating relevance mapping dictionary and verifying relevance
CN109145978A (en) * 2018-08-15 2019-01-04 大连海事大学 A kind of weak relevant cluster method of the feature of shoe sole print image

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871896A (en) * 2019-02-26 2019-06-11 北京达佳互联信息技术有限公司 Data classification method, device, electronic equipment and storage medium
CN110490306A (en) * 2019-08-22 2019-11-22 北京迈格威科技有限公司 A kind of neural metwork training and object identifying method, device and electronic equipment

Also Published As

Publication number Publication date
CN111598153A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN112990054B (en) Compact linguistics-free facial expression embedding and novel triple training scheme
CN111598153B (en) Data clustering processing method and device, computer equipment and storage medium
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN111582348B (en) Training method, device, equipment and storage medium for condition generation type countermeasure network
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN110210032B (en) Text processing method and device
CN112949786A (en) Data classification identification method, device, equipment and readable storage medium
CN116935169B (en) Training method for draft graph model and draft graph method
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
JP2023523029A (en) Image recognition model generation method, apparatus, computer equipment and storage medium
CN114021524B (en) Emotion recognition method, device, equipment and readable storage medium
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN112131883A (en) Language model training method and device, computer equipment and storage medium
CN112861524A (en) Deep learning-based multilevel Chinese fine-grained emotion analysis method
CN114818691A (en) Article content evaluation method, device, equipment and medium
CN114065848A (en) Chinese aspect level emotion classification method based on pre-training emotion embedding
CN113962965A (en) Image quality evaluation method, device, equipment and storage medium
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN111368524A (en) Microblog viewpoint sentence recognition method based on self-attention bidirectional GRU and SVM
CN115758218A (en) Three-modal emotion analysis method based on long-time and short-time feature and decision fusion
CN114861671A (en) Model training method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40027938

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant