CN111160468B

CN111160468B - Data processing method and device, processor, electronic equipment and storage medium

Info

Publication number: CN111160468B
Application number: CN201911395340.8A
Authority: CN
Inventors: 黄厚钧; 何悦; 李�诚; 王贵杰; 王子彬
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2024-01-12
Anticipated expiration: 2039-12-30
Also published as: CN111160468A

Abstract

The application discloses a data processing method and device, a processor, electronic equipment and a storage medium. The method comprises the following steps: acquiring a data set to be processed and a clustering network; the clustering network is obtained by taking the purity of cluster pairs as supervision information; the purity of the cluster pairs is used for representing the purity of reference categories in the cluster pairs, wherein the reference categories are the categories with the largest amount of data in the cluster pairs; and processing the data set to be processed by using the clustering network to obtain a clustering result of the data set to be processed.

Description

Data processing method and device, processor, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a processor, an electronic device, and a storage medium.

Background

Clustering is one of the key technologies in the fields of data mining, machine learning and the like. Clustering processes partition dissimilar data into different clusters by partitioning similar data in a dataset into the same cluster. The current clustering technology mainly comprises a plurality of clustering methods such as partition clustering, merging clustering, density-based clustering, grid clustering, spectral clustering and the like. The merging and clustering method is widely applied to various fields due to the characteristics of simple concept, easy interpretation, layering and obvious structure of clustering results and the like. However, the traditional merging and clustering method has low merging accuracy.

Disclosure of Invention

The application provides a data processing method and device, a processor, electronic equipment and a storage medium.

In a first aspect, a data processing method is provided, the method comprising:

acquiring a data set to be processed and a clustering network; the clustering network is obtained by taking the purity of cluster pairs as supervision information; the purity of the cluster pairs is used for representing the purity of reference categories in the cluster pairs, wherein the reference categories are the categories with the largest amount of data in the cluster pairs;

and processing the data set to be processed by using the clustering network to obtain a clustering result of the data set to be processed.

In this aspect, the cluster network obtained by training with the purity as the supervision information is used to process the data set to be processed, so that the purity of the cluster pairs in the data set to be processed can be obtained, and the merging order of the cluster pairs can be obtained according to the purity of the cluster pairs. The cluster pairs in the data set to be processed are combined according to the combination sequence, so that the probability of error combination can be reduced, and the combination accuracy is improved.

In combination with any embodiment of the present application, training to obtain the clustering network by using the purity of the cluster pairs as the supervision information includes:

acquiring a network to be trained, a first cluster pair to be trained and a second cluster pair to be trained;

Processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain a first probability of merging the first cluster pair to be trained;

obtaining a second probability of merging the first cluster pair to be trained first according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained;

obtaining the loss of the network to be trained according to the difference between the first probability and the second probability;

and adjusting parameters of the network to be trained based on the loss to obtain the clustering network.

In the training process of the network to be trained, the purity of the cluster pairs to be trained is used as the supervision information of the network to be trained, so that the clustering network obtained through training has the capability of determining whether to merge the cluster pairs according to the purity of the cluster pairs, and the accuracy of merging the cluster pairs of the clustering network is further improved.

In combination with any one of the embodiments of the present application, before the second probability of merging the first cluster pair first is obtained according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, the method further includes:

determining the quantity of data contained in at least one category of the first cluster to be trained as a first quantity set according to the labeling data of the data in the first cluster to be trained;

Determining the quantity of data contained in at least one category of the second cluster to be trained as a second quantity set according to the labeling data of the data in the second cluster to be trained; the labeling data carries class information of the data;

obtaining the first purity according to the maximum value in the first quantity set and the quantity of data in the first cluster to be trained;

and obtaining the second purity according to the maximum value in the second quantity set and the quantity of data in the second cluster to be trained.

Determining a first quantity set and a second quantity set according to the categories of the data in the first cluster to be trained and the second cluster to be trained, determining the purity of the categories in the first cluster to be trained, namely the purity of the first cluster to be trained according to the maximum value in the first quantity set and the quantity of the data in the first cluster to be trained, and determining the purity of the categories in the second cluster to be trained, namely the purity of the second cluster to be trained according to the maximum value in the second quantity set and the quantity of the data in the second cluster to be trained.

In combination with any one of the embodiments of the present application, the obtaining the first purity according to the maximum value in the first number set and the number of the data in the first cluster to be trained, and obtaining the second purity according to the maximum value in the second number set and the number of the data in the second cluster to be trained includes:

Taking the ratio of the maximum value in the first quantity set to the quantity of data in the first cluster to be trained as the first purity;

and taking the ratio of the maximum value in the second quantity set to the quantity of data in the second cluster to be trained as the second purity.

determining the quantity of data contained in each category in the first cluster to be trained as a third quantity set according to the marking data of the data in the second cluster to be trained, and determining the quantity of data contained in each category in the second cluster to be trained as a fourth quantity set according to the marking data of the data in the second cluster to be trained; the labeling data carries class information of the data;

and obtaining the first purity according to the number of the elements in the third number set and the number of the data in the first cluster to be trained, and obtaining the second purity according to the number of the elements in the fourth number set and the number of the data in the first cluster to be trained.

Determining a third quantity set and a fourth quantity set according to the types of the data in the first cluster to be trained and the second cluster to be trained, determining the purity of the first cluster to be trained in the types according to the elements in the third quantity set and the quantity of the data in the first cluster to be trained, namely the purity of the first cluster to be trained, and determining the purity of the second cluster to be trained in the types according to the elements in the fourth quantity set and the quantity of the data in the second cluster to be trained, namely the purity of the second cluster to be trained.

In combination with any one of the embodiments of the present application, the obtaining the first purity according to the number of the elements in the third number set and the number of the data in the first to-be-trained cluster pair, and obtaining the second purity according to the number of the elements in the fourth number set and the number of the data in the first to-be-trained cluster pair includes:

determining the sum of squares of each element in the third quantity set to obtain a first intermediate number, and determining the sum of squares of each element in the fourth quantity set to obtain a second intermediate number;

determining the square of the number of the data in the first cluster to be trained to obtain a third intermediate number, and determining the square of the number of the data in the second cluster to be trained to obtain a fourth intermediate number;

The ratio of the first intermediate number to the third intermediate number is taken as the first purity, and the ratio of the second intermediate number to the fourth intermediate number is taken as the second purity.

determining the quantity of data contained in at least one category of the first cluster to be trained as a fifth quantity set according to the labeling data of the data in the first cluster to be trained;

determining the quantity of data contained in each category in the second cluster to be trained as a sixth quantity set according to the labeling data of the data in the second cluster to be trained;

obtaining the first purity according to the maximum value in the fifth quantity set and the quantity of data in the first cluster to be trained;

and obtaining the second purity according to the elements in the sixth quantity set and the quantity of the data in the first cluster to be trained.

The first purity obtained by this embodiment is more focused on the purity of the number of reaction reference classes, and the second purity obtained by this embodiment is more focused on the purity of the reaction second training cluster pair classes.

In combination with any one of the embodiments of the present application, the obtaining, according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, the second probability of merging the first cluster pair to be trained first includes:

determining that the second probability is a first value if the first purity is greater than the second purity;

determining that the second probability is a second value if the first purity is equal to the second purity;

and determining that the second probability is a third value if the first purity is less than the second purity.

And determining a fourth probability according to the magnitude relation between the first purity and the second purity, so that the network to be trained can obtain the capability of determining whether to merge cluster pairs according to the purity of the cluster pairs through training.

In combination with any one of the embodiments of the present application, the obtaining, according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, a fourth probability of merging the first cluster pair to be trained first includes:

determining a difference between the first purity and the second purity to obtain a fourth value;

determining that the second probability is a fifth value when the fourth value is within the first value range;

Determining that the second probability is a sixth value when the fourth value is within a second value range; the fifth value and the sixth value are both non-negative numbers less than or equal to 1, and the fifth value is different from the sixth value; there is no intersection between the first range of values and the second range of values.

And determining a fourth probability according to the relation between the fourth value and the value range (comprising the first value range and the second value range), so that the network to be trained can obtain the capability of determining whether to merge cluster pairs according to the purity of the cluster pairs through training.

In combination with any one of the embodiments of the present application, the processing, by the network to be trained, the first cluster to be trained and the second cluster to be trained to obtain a first probability of merging the first cluster to be trained first includes:

processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain a third probability of merging the first cluster pair to be trained and a fourth probability of merging the second cluster pair to be trained;

and obtaining the first probability according to the third probability and the fourth probability.

And obtaining a third probability and a fourth probability through the neural network to be trained. And obtaining the first probability according to the third probability and the fourth probability so as to obtain the probability of combining the first cluster pairs to be trained through the neural network to be trained.

In a second aspect, there is provided a data processing apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the data set to be processed and the clustering network; the clustering network is obtained by taking the purity of cluster pairs as supervision information; the purity of the cluster pairs is used for representing the purity of a reference class in the cluster pairs, wherein the reference class is the purity of the class with the largest amount of data in the cluster pairs;

and the processing unit is used for processing the data set to be processed by using the clustering network to obtain a clustering result of the data set to be processed.

In combination with any of the embodiments of the present application, the apparatus is further configured to:

before the second probability of merging the first cluster pair is obtained according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, determining the quantity of data contained in at least one category of the first cluster pair to be trained as a first quantity set according to the labeling data of the data in the first cluster pair to be trained;

Taking the ratio of the maximum value in the first quantity set to the quantity of data contained in the first cluster to be trained as the first purity;

and taking the ratio of the maximum value in the second quantity set to the quantity of data contained in the second cluster to be trained as the second purity.

before the second probability of merging the first cluster pair according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained is obtained, determining the quantity of data contained in each category in the first cluster pair to be trained according to the marking data of the data in the first cluster pair to be trained as a third quantity set, and determining the quantity of data contained in each category in the second cluster pair to be trained according to the marking data of the data in the second cluster pair to be trained as a fourth quantity set; the labeling data carries class information of the data;

before the second probability of merging the first cluster pair is obtained according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, determining the quantity of data contained in at least one category of the first cluster pair to be trained as a fifth quantity set according to the labeling data of the data in the first cluster pair to be trained;

In a third aspect, a processor is provided for performing the method of the first aspect and any one of its possible implementation manners described above.

In a fourth aspect, there is provided an electronic device comprising: a processor, a transmitting means, an input means, an output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method as described in the first aspect and any one of its possible implementation manners.

In a fifth aspect, a computer readable storage medium is provided, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to carry out a method as in the first aspect and any one of the possible implementations thereof.

In a sixth aspect, there is provided a computer program product comprising a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect and any one of its possible implementations.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the application.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

fig. 2 is a flow chart of a network training method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another data processing method according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for determining a first purity according to an embodiment of the present application;

FIG. 5 is a flow chart of another method for determining a first purity according to an embodiment of the present application;

FIG. 6 is a flow chart of a method for determining a first purity and a first purity provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic hardware structure of a data processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flow chart of a data processing method according to an embodiment (a) of the present application.

101. And acquiring the data set to be processed and a clustering network.

The execution body of the embodiment is a first terminal. The first terminal may be a server, a mobile phone, a computer, a tablet computer, etc.

In the embodiment of the application, the data set to be processed may be vector data. The vector data may be a sentence vector, the vector data may also be a feature vector of an image, and the vector data may also be a feature vector of audio.

The clustering network is a network having a function of clustering data in a set of data to be processed. For example, a clustering network may be stacked or composed of convolutional, pooling, normalization, fully connected, downsampling, upsampling, classifier, etc. network layers in a certain manner. The structure of the clustering network is not limited in the application.

In one possible implementation manner, the clustering network comprises a plurality of convolution layers, a pooling layer and a full-connection layer, and the clustering of the data set to be processed can be completed by carrying out convolution processing and normalization processing on the data set to be processed through the plurality of convolution layers, the pooling layer and the full-connection layer in the clustering network in sequence, so that a clustering result of the data set to be processed is obtained.

In the embodiment of the application, the clustering network is obtained by taking the purity of the class of the data in the cluster pair as the supervision information for training. The cluster pairs include at least two clusters. The number of data categories contained in the cluster pairs is at least 1. Clearly, the fewer the data categories in a cluster pair, the higher the accuracy of characterizing the merging of at least two clusters in the cluster pair. For example, the cluster pair 1 includes a cluster a and a cluster B, wherein the class of data in the cluster a is a, the class of data in the cluster B is B, and if the cluster C obtained by combining the cluster a and the cluster B includes data with the class of a and data with the class of B. At this time, it is not reasonable to determine either the category of cluster C as a or as B, i.e., the accuracy of merging cluster a and cluster B is low.

In addition, if the class with the largest amount of data in the cluster pair is referred to as a reference class, the higher the ratio of the amount of data in the reference class to the amount of data in the cluster pair, the higher the accuracy of merging at least two clusters in the cluster pair is characterized. For example (example 1), the cluster pair 1 includes a cluster a and a cluster B, wherein the class of data in the cluster a is a, the class of data in the cluster B is B, and the number of data in the cluster a is 10000, and the number of data in the cluster B is 1. If cluster C obtained by combining cluster a and cluster B contains 10001 data. Although the cluster C contains data of class a and data of class b, the number of data of class a is much larger than the number of data of class b, and determining the class of cluster C as a would result in that only 1 class of data is wrong, but 10000 classes of data are correct. It is apparent that the accuracy of the category of data in the cluster obtained by the merging is still high (10000/10001=99.99%), i.e. the accuracy of merging cluster a and cluster B is high.

Before proceeding with the following explanation, the purity of cluster pairs is first defined. In the embodiment of the application, the purity of the cluster pair is the purity of the reference class in the cluster pair. Assuming that the reference class contains n number of data and m number of data in the cluster pair, the purity of the reference class is n/m. In example 1, the category having the largest amount of data included in the cluster pair 1 is a, and the purity of the category a is: 10000/10001=99.99%.

It can be seen from example 1 that the purity of the cluster pairs is positively correlated with the accuracy of combining the clusters, that is, the purity of the cluster pairs can be used as a basis for whether to combine the clusters in the cluster pairs. Based on the above, the embodiment of the application trains the network by taking the purity of the cluster pairs as the supervision information to obtain the clustering network, so that the clustering network can obtain the purity of the cluster pairs in the data set to be processed when the data set to be processed is processed.

The first terminal may acquire the first to-be-processed data set or the clustering network by receiving the first to-be-processed data set input by a user through an input component, where the input component includes: a keyboard, a mouse, a touch screen, a touch pad, an audio input device, and the like. The first terminal may acquire the first to-be-processed data set or the cluster network by receiving the first to-be-processed data set or the cluster network sent by the second terminal, where the second terminal includes a mobile phone, a computer, a tablet computer, a server, and the like.

102. And processing the data set to be processed by using the clustering network to obtain a clustering result of the data set to be processed.

In step 101, the data set to be processed is processed through the clustering network, so that the purity of the cluster pairs in the data set to be processed can be obtained, and further, the clustering of the data set to be processed can be completed according to the purity of the cluster pairs in the data set to be processed, so as to obtain the clustering result of the data set to be processed.

In one possible implementation, the cluster pairs in the data set to be processed may be sorted in order of the purity of the cluster pairs in the data set to be processed from large to small, as a merging order of the cluster pairs in the data set to be processed, and the cluster pairs may be merged in the merging order. For example, the data set to be processed includes a cluster a, a cluster B, and a cluster C, wherein the purity of a cluster pair 1 consisting of the cluster a and the cluster B is 90%, the purity of a cluster pair 2 consisting of the cluster a and the cluster C is 60%, the purity of a cluster pair 3 consisting of the cluster B and the cluster C is 78%, and the merging order obtained by sorting the cluster pairs in the data set to be processed in the order of the purity of the cluster pairs from large to small is: 1. cluster pair 1;2. cluster pair 3;3. cluster pair 2. Cluster pair 1 may be first merged to obtain cluster D according to the merging order.

According to the embodiment of the application, the clustering network obtained by training the purity as the supervision information is used for processing the data set to be processed, so that the purity of the cluster pairs in the data set to be processed can be obtained, and the merging sequence of the cluster pairs can be obtained according to the purity of the cluster pairs. The cluster pairs in the data set to be processed are combined according to the combination sequence, so that the probability of error combination can be reduced, and the combination accuracy is improved.

In the embodiment (one), the clustering result of the data set to be processed can be obtained by processing the data set to be processed using the clustering network, and the process of training the obtaining of the clustering network will be explained in detail next.

Referring to fig. 2, fig. 2 is a flow chart of a network training method according to the second embodiment of the present application.

201. And acquiring a network to be trained, a first cluster pair to be trained and a second cluster pair to be trained.

The execution body of the present embodiment may be the same as or different from the execution body of the embodiment (a), and the execution body of the present embodiment is not limited in this application. For convenience of description, the execution subject of the present embodiment will be hereinafter referred to as a training terminal.

In this embodiment, the structure of the network to be trained is the same as that of the clustering network in step 101. The cluster pairs to be trained (comprising a first cluster pair to be trained and a second cluster pair to be trained) comprise at least two clusters, the types of data in the same cluster are the same, and the types of data in different clusters can be the same or different. The data in the cluster pair to be trained all comprise label data, and the label data carries category information of the data.

The training terminal may acquire the network to be trained or the first cluster to be trained or the second cluster to be trained by receiving the network to be trained or the first cluster to be trained or the second cluster to be trained input by the user through the input component, where the input component includes: a keyboard, a mouse, a touch screen, a touch pad, an audio input device, and the like. The mode of the training terminal obtaining the network to be trained or the first cluster to be trained or the second cluster to be trained may also be to receive the network to be trained or the first cluster to be trained or the second cluster to be trained sent by the third terminal, where the third terminal includes a mobile phone, a computer, a tablet computer, a server, and the like.

202. And processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain a first probability of combining the first cluster pair to be trained.

In the embodiment of the application, the to-be-trained cluster pairs (including the first to-be-trained cluster pair and the second to-be-trained cluster pair) are processed through the to-be-trained network, so that the first probability of merging the first to-be-trained cluster pairs can be obtained, and the first probability can be used for representing the merging accuracy of merging the first to-be-trained cluster pairs.

In one possible implementation, the network to be trained includes a convolutional layer, a pooling layer, and a fully-connected layer. The characteristic data of the cluster pair to be trained is obtained by processing the cluster pair to be trained through the convolution layer and the pooling layer, the characteristic data carries the characteristic information of each data in the cluster pair to be trained and the similarity information among different data, and the first probability of combining the first cluster pair to be trained can be obtained by processing the characteristic data of the cluster pair to be trained through the full connection layer.

In another possible implementation, the network to be trained includes a convolutional layer, a pooling layer, and a fully-connected layer. The characteristic data of the cluster pairs to be trained are obtained by processing the cluster pairs to be trained through the convolution layer and the pooling layer, the characteristic data carry the characteristic information of each data in the cluster pairs to be trained and the similarity information between different data, and the probability of merging the first cluster pairs to be trained and the probability of merging the second cluster pairs to be trained can be obtained by processing the characteristic data of the cluster pairs to be trained through the full connection layer. And obtaining the first probability of combining the first cluster pair to be trained first according to the probability of combining the first cluster pair to be trained and the probability of combining the second cluster pair to be trained.

203. And obtaining a second probability of combining the first cluster pair to be trained first according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained.

In this embodiment, the purity of the cluster pairs to be trained (including the purity of the first cluster pair to be trained and the purity of the second cluster pair to be trained) has the same meaning as that of the cluster pair in the embodiment (one), and will not be described here again.

In this embodiment, the probability of merging the first cluster pair to be trained (i.e., the second probability) is determined according to the first purity and the second purity.

In one possible implementation, if the first purity is greater than the second purity, characterizing the first cluster to be trained preferentially may increase the accuracy of the obtained clustering result. If the first purity is smaller than the second purity, the characterization and the preferential combination of the second clusters to be trained can improve the accuracy of the obtained clustering result. And if the first purity is equal to the second purity, the accuracy of the clustering result obtained by the characterization and the preferential combination of the first cluster to be trained or the preferential combination of the second cluster to be trained is the same. Based on this, in case the first purity is greater than the second purity, the second probability is determined to be a first value, the first value being a positive number, optionally the first value being 1. In case the first purity is equal to the second purity, the second probability is determined to be a second value, the second value being a positive number smaller than the first value, optionally the second value being 1/2. In case the first purity is smaller than the second purity, the second probability is determined to be a third value, the third value being a non-negative number smaller than the second value, optionally the third value being 0.

In another possible implementation, the difference between the first purity and the second purity (i.e., the first purity minus the second purity) is determined to yield a fourth value. And determining the second probability as a fifth value under the condition that the fourth value is in the first value range. And determining the second probability as a sixth value when the fourth value is in the second value range. There is no intersection between the first value range and the second value range, and the union of the first value range and the second value range is greater than or equal to-1 and less than or equal to 1. The fifth value and the sixth value are both non-negative numbers less than or equal to 1, and the fifth value is different from the sixth value.

For convenience of description, the value interval of a or more and b or less will be denoted by [ a, b ], the value interval of c or less and d will be denoted by (c, d), and the value interval of e or more and f will be denoted by [ e, f ].

In the above possible implementation manner, the fourth value range is greater than or equal to-1 and less than or equal to 1 (hereinafter referred to as a reference interval), and the first value range and the second value range have no intersection, and the union of the first value range and the second value range is the reference interval, i.e., the reference interval is divided into two sub-intervals (i.e., the first value range and the second value range). In practical application, the reference interval can be divided into three sub-intervals or more than three sub-intervals, and when the fourth value is in different sub-intervals, the value of the second probability is different. The number of subintervals is not limited in this application.

For example, the reference interval is divided into six sub-intervals of [ -1, 0), [0,0], (0, 0.3), [0.3,0.7 ], and [0.7,1], respectively. In case the fourth value is at-1, 0), the first purity is characterized as smaller than the second purity, i.e. the second cluster pairs to be processed should be combined first, so the second probability value can be taken as 0. And under the condition that the fourth value is 0, the first purity and the second purity are equal, namely the first cluster pair to be processed is combined first or the second cluster pair to be processed is combined first, so that the second probability value can be taken as 0.5. In the case where the fourth value is (0, 0.3), the first purity is characterized as being greater than the second purity, but the difference between the first purity and the second purity (which will be referred to as the first difference hereinafter) is small, so the second probability value can be taken as a positive number greater than 0.5, such as: 0.6. in the case where the fourth value is at [0.3,0.7 ], the first purity is characterized as being greater than the second purity, and the difference between the first purity and the second purity (hereinafter, will be referred to as the second difference) is greater than the first difference, so the second probability value can be taken as a positive number greater than 0.6, such as: 0.8. in the case where the fourth value is at [0.7,1], the first purity is characterized as greater than the second purity, and the difference between the first purity and the second purity is greater than the second difference, so the second probability value may be taken as a positive number greater than 0.8, such as: 1.

The range of values of the second probability obtained according to the magnitude relation between the first purity and the second purity is only three numbers (1, 1/2, 0), and since the difference between the first purity and the second purity may be any one number between [ -1,1], i.e. the distribution of the difference between the first purity and the second purity is wider, it is obvious that the effect of representing the difference between the first purity and the second purity by three numbers is poor. The value range of the second probability obtained by dividing the reference interval into at least two sub-intervals and according to the relation between the magnitude of the fourth value and the sub-intervals comprises at least two values, and the value in the value range of the second probability is any number between (0, 1). The second probability obtained by dividing the reference interval into at least two sub-intervals and by the relation between the magnitude of the fourth value and the sub-intervals is closer to the distribution of the difference between the first purity and the second purity than the second probability is determined by the magnitude relation between the first purity and the second purity. Therefore, the training effect of the network to be trained can be improved by dividing the reference interval into at least two sub-intervals and performing subsequent processing (for example, as supervision information) according to the second probability obtained by the relation between the magnitude of the fourth value and the sub-intervals.

204. And obtaining the loss of the network to be trained according to the difference between the first probability and the second probability.

And taking the second probability obtained based on the first purity and the second purity as first probability obtained by supervising the network to be trained by using the supervision information, so that the network to be trained can learn the probability of merging cluster pairs first according to the purity of the cluster pairs.

In one possible implementation, the cross entropy loss between the first probability and the second probability may be obtained as a loss of the network to be trained by substituting the first probability and the second probability into a cross entropy (cross entropy) function calculation.

205. And adjusting parameters of the network to be trained based on the loss to obtain the clustering network.

Based on the loss of the network to be trained, training the network to be trained in a reverse gradient propagation mode until convergence, and completing training of the network to be trained to obtain a clustering network.

In this embodiment, in the training process of the network to be trained, the purity of the cluster pair to be trained is used as the supervision information of the network to be trained, so that the clustering network obtained by training can have the capability of determining whether to merge the cluster pairs according to the purity of the cluster pairs, and the accuracy of merging the cluster pairs by the clustering network is further improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method for providing one possible implementation of step 202 according to the third embodiment of the present application.

301. And processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain a third probability of merging the first cluster pair to be trained and a fourth probability of merging the second cluster pair to be trained.

In the embodiment of the application, the to-be-trained cluster pairs (including the first to-be-trained cluster pair and the second to-be-trained cluster pair) are processed through the to-be-trained network, so that the probability of merging the to-be-trained cluster pairs can be obtained, and the probability can be used for representing the accuracy of merging the to-be-trained cluster pairs.

In one possible implementation, the network to be trained includes a convolutional layer, a pooling layer, and a fully-connected layer. The characteristic data of the cluster pairs to be trained are obtained by processing the cluster pairs to be trained through the convolution layer and the pooling layer, the characteristic data carry the characteristic information of each data in the cluster pairs to be trained and the similarity information among different data, and the probability of merging the cluster pairs to be trained can be obtained by processing the characteristic data of the cluster pairs to be trained through the full connection layer.

It should be understood that in the actual processing, the number of the cluster pairs to be trained may be more than two, and the number of the cluster pairs to be trained is not limited in the present application. And assuming that the number of the cluster pairs to be trained is n, wherein n is a positive integer greater than or equal to 2, processing the n cluster pairs to be trained through a network to be trained to obtain n combining probabilities, and the n combining probabilities are in one-to-one correspondence with the n cluster pairs to be trained.

And processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain the probability of combining the first cluster pair to be trained (namely third probability) and the probability of combining the second cluster pair to be trained (namely fourth probability).

302. And obtaining a first probability of combining the first cluster pair to be trained according to the third probability and the fourth probability.

If the data to be clustered contains at least three clusters, different clusters need to be combined when the data are clustered, and different combining sequences can generate different clustering results. For example (example 2), the data to be clustered includes a cluster a, a cluster B and a cluster C, wherein the data in the cluster a is classified into a type a, the data in the cluster B is classified into a type B, and the data in the cluster C is classified into a type C. The degree of similarity between cluster a and cluster B (hereinafter, will be referred to as a first degree of similarity) is 80%, the degree of similarity between cluster B and cluster C (hereinafter, will be referred to as a second degree of similarity) is 60%, and the degree of similarity between cluster a and cluster C (hereinafter, will be referred to as a third degree of similarity) is 70%. If the condition for performing the merging is that the similarity between two clusters is 55% or more, the cluster a may be merged with the cluster B, the cluster B may be merged with the cluster C, or the cluster a may be merged with the cluster C. If the cluster A and the cluster C are combined to obtain the cluster D, and the similarity between the cluster D and the cluster B is 45%, the final clustering result is the cluster B and the cluster D. If the cluster A and the cluster B are combined to obtain the cluster E, and the similarity between the cluster E and the cluster C is 50%, the final clustering result is the cluster C and the cluster E.

As can be seen from example 2, the merge order will affect the clustering result. In addition, in example 2, since the first similarity is larger than the second similarity and the third similarity, it is apparent that the cluster a and the cluster B should be combined, that is, the accuracy of the cluster result obtained by combining the cluster a and the cluster B is higher than the accuracy of the cluster result obtained by combining the cluster B and the cluster C and the accuracy of the cluster result obtained by combining the cluster a and the cluster C, compared to combining the cluster B and the cluster C and combining the cluster a and the cluster C. However, in example 2, if cluster a and cluster C are first combined to obtain cluster D, the similarity between cluster D and cluster B is 45% or less than 55%, and thus cluster a and cluster B cannot be combined. That is, the merge order will affect the accuracy of the clustering results.

In this embodiment, the to-be-trained network may obtain the probability of combining the first to-be-trained cluster pair (i.e., the first probability) according to the third probability of combining the first to-be-trained cluster pair and the first probability of combining the second to-be-trained cluster pair. And the sequence of combining the first cluster pairs to be trained and the second cluster pairs to be trained can be determined according to the first probability.

In one possibilityIn an implementation manner, the probability of merging the first cluster pair to be trained is assumed to be s ₁ Combining the probability of the second cluster pair to be trained as s ₂ The probability P of the first cluster pair to be trained is combined ₁ Satisfies the following formula:

similarly, the probability P of the second cluster pair to be trained is combined first ₂ Satisfies the following formula:

the third probability and the fourth probability are obtained through the neural network to be trained. And obtaining the first probability according to the third probability and the fourth probability so as to obtain the probability of combining the first cluster pairs to be trained through the neural network to be trained.

The embodiment of the application also provides two methods for determining the purity of the cluster pairs, and the description is given below taking the first purity of the first cluster pair to be trained as an example.

Referring to fig. 4, fig. 4 is a flow chart of a method for determining a first purity according to an embodiment (fourth) of the present application.

401. And determining the quantity of the data contained in at least one category of the first cluster to be trained as a first quantity set according to the labeling data of the data in the first cluster to be trained.

In this embodiment, the labeling data of each data in the first to-be-trained cluster pair carries class information of the data, for example, the labeling data of the data a in the first to-be-trained cluster pair carries information that the data a is class a, that is, the data a belongs to class a.

In this embodiment, the purity of the first to-be-trained cluster pair is the purity of the reference class in the first to-be-trained cluster. Before determining the purity of the first cluster to be trained, the amount of data contained in at least one category of the first cluster to be trained is determined as a first amount set. And taking the category corresponding to the maximum value in the first quantity set as a reference category. For example (example 3), the first cluster to be trained includes data a, data B, data C, data D, and data E, wherein the data a is classified as a, the data B is classified as B, the data C is classified as C, the data D is classified as a, and the data E is classified as C. The number of data contained in category a may be determined to be 2 and the number of data contained in category b may be determined to be 1, resulting in a first set of numbers: 1. 2. It may also be determined that the number of data contained in category a is 2 and that the number of data contained in category c is 2, the resulting first set of numbers is: 2. 2. It may also be determined that the number of data contained in category b is 1 and that the number of data contained in category c is 2, resulting in a first set of numbers: 1. 2. The number of data contained in the category a may also be determined to be 2, the number of data contained in the category b may be determined to be 1, the number of data contained in the category c may be determined to be 2, and the first set of numbers may be obtained as follows: 1. 2, 2.

Alternatively, the number of data included in each category in the first cluster to be trained may be determined separately as the first number set, then example 3 continues with example 4, where the number of data included in category a is determined to be 2, the number of data included in category b is determined to be 1, the number of data included in category c is determined to be 2, and the first number set is obtained as: 1. 2, 2.

402. And obtaining a first purity according to the maximum value in the first quantity set and the quantity of data in the first cluster to be trained.

In one possible implementation manner, the category corresponding to the maximum value in the first number set is a reference category, and the ratio of the number of data contained in the reference category (i.e., the maximum value in the first number set) to the number of data in the first cluster pair to be trained is calculated to obtain the purity of the first cluster pair to be trained, i.e., the first purity.

It should be understood that, if there are at least two maximum values in the first number set (e.g., 2 maximum values in the first number set in example 4), the category corresponding to any one of the at least two maximum values is taken as the reference category (e.g., the reference category in example may be the category a or the category c).

Taking example 4 as an example, the maximum value in the first number set is 2, and the number of data in the first cluster to be trained is 5, the first purity of the first cluster to be trained is 2/5=40%.

In another possible implementation, where α represents the maximum value in the first set of numbers and β represents the number of data in the first cluster to be trained, the first purity C satisfies the following equation:

where a is a real number.

wherein b is a real number.

It should be understood that the method for determining the purity of the first cluster pair to be trained provided in this embodiment may also be used to determine the purity of other cluster pairs (including the second cluster pair to be trained and the cluster pair in the data set to be processed).

For example, the number of data included in at least one category in the second cluster to be trained may be determined as the second number set according to the labeling data of the data in the second cluster to be trained. And obtaining the second purity according to the maximum value in the second quantity set and the quantity of data in the second cluster pair to be trained. Alternatively, the ratio of the maximum value in the second data set to the number of data in the second cluster to be trained may be taken as the second purity.

According to the technical scheme provided by the embodiment, the purity of the cluster pair category, namely the purity of the cluster pair, can be determined by determining the number of data contained in the reference category and the number of data in the cluster pair.

The purity of the cluster pairs obtained by the technical solution provided in the fourth embodiment may be used to characterize the purity of the reference class (i.e. the ratio of the amount of data contained in the reference class to the amount of data in the cluster pairs), but cannot characterize the purity of the class in the cluster pairs. The purity of a cluster pair class refers to the number of cluster pair classes, and the purity of a cluster pair class is positively correlated to the number of cluster pair classes.

For example (example 5), the first pair of clusters to be trained 1 comprises: data A, data B, data C, data D and data E, wherein the type of the data A is a, the type of the data B is B, the type of the data C is C, the type of the data D is a, and the type of the data E is C. The number of data with category a in the first cluster to be trained pair 1 is 2, the number of data with category b is 1, and the number of data with category c is 2. The first cluster pair 2 to be trained comprises: data F, data G, data H, data I and data J, wherein the type of the data F is a, the type of the data G is b, the type of the data H is c, the type of the data I is d, and the type of the data J is a. In the first to-be-trained cluster pair 2, the number of data with the category a is 2, the number of data with the category b is 1, the number of data with the category c is 1, and the number of data with the category d is 1. Obviously, the number of categories in the first pair 1 is smaller than the number of categories in the first pair 2, i.e. the purity of the categories in the first pair 2 is higher. If the purity of the first cluster to be trained pair 1 and the purity of the first cluster to be trained pair 2 are calculated according to the technical scheme provided in the embodiment (four), the purity of the first cluster to be trained pair 1 is 2/5=40%, and the purity of the first cluster to be trained pair 2 is 2/5=40%. That is, the purity of the first cluster to be trained pair 1 and the purity of the first cluster to be trained pair 2 obtained by the technical scheme provided in the embodiment (four) are equal, and it is obvious that the purity of the category in the first cluster to be trained pair 1 cannot be properly reflected.

To this end, embodiments of the present application also provide a method of determining purity of a class in a cluster pair. Referring to fig. 5, fig. 5 is a flow chart of another method for determining the first purity according to the fifth embodiment of the present application.

501. And determining the quantity of the data contained in each category in the first cluster to be trained as a third quantity set according to the labeling data of the data in the first cluster to be trained.

According to the labeling data of each data in the first cluster to be trained, the quantity of the data contained in each category in the first cluster to be trained can be determined and used as a third quantity set. For example (example 6), the first cluster pair to be trained comprises: data A, data B, data C, data D and data E, wherein the category of the data A is a, the category of the data B is B, the category of the data C is a, the category of the data D is C, and the category of the data E is B. The number of data contained in the class a in the first cluster to be trained is 2, the number of data contained in the class b in the first cluster to be trained is 2, the number of data contained in the class c in the first cluster to be trained is 1, and the third number set is as follows: {2,2,1}.

502. And obtaining the first purity according to the elements in the third quantity set and the quantity of the data in the first cluster to be trained.

In this embodiment, the elements in the third number set are the number of data included in each category in the first cluster to be trained, and the third number set in example 6 includes three elements: {2, 1}.

In one possible implementation, the sum of the squares of each element in the third number set is determined, resulting in a first intermediate number. And determining the square of the number of data in the first cluster to be trained to obtain a third intermediate number. The ratio of the first intermediate number to the second intermediate number is taken as the first purity. Taking example 8 as an example, the third number set contains three elements: 2. 2, 1, the sum of squares of each element in the third number set is: 2 ² +2 ² +1 ² =9. The square of the number of data in the first cluster to be trained is: 5 ² =25. Then the first purity is: 9/25=36%.

In another possible implementation, the sum of the cubes of each element in the third number set is determined, resulting in a fifth intermediate number. Determining data in a first cluster to be trainedCube of the number, a sixth intermediate number is obtained. The ratio of the fifth intermediate number to the sixth intermediate number is taken as the first purity. Taking example 6 as an example, the third number set contains three elements: 2. 2, 1, the cubic sum of each element in the third number set being: 2 ³ +2 ³ +1 ³ =17. The square of the number of data in the first cluster to be trained is: 5 ³ =125. Then the first purity is: 17/125=13.6%.

In yet another possible implementation, a sum of squares of each element in the third number set is determined, resulting in a seventh intermediate number. And determining the sum of the seventh intermediate number and a seventh value to obtain an eighth intermediate number, wherein the seventh value is a real number. And determining the square of the number of data in the first cluster to be trained to obtain a ninth intermediate number. The ratio of the eighth intermediate number to the ninth intermediate number is taken as the first purity. Taking example 6 as an example, assume that the seventh value is 0.1, and the third number set contains three elements: 2. 2, 1, the sum of squares of each element in the third number set is: 2 ² +2 ² +1 ² =9. The square of the number of data in the first cluster to be trained is: 5 ² =25. Then the first purity is: (9+0.1)/25=36.4%. The sum of the cubes of each element in the third number set is determined, resulting in a fifth intermediate number. And determining the cube of the number of data in the first cluster to be trained to obtain a sixth intermediate number. The ratio of the fifth intermediate number to the sixth intermediate number is taken as the first purity. Taking example 6 as an example, the third number set contains three elements: 2. 2, 1, the cubic sum of each element in the third number set being: 2 ³ +2 ³ +1 ³ =17. The square of the number of data in the first cluster to be trained is: 5 ³ =125. Then the first purity is: 17/125=13.6%.

For example, the number of data included in each category in the second cluster to be trained may be determined as the fourth number set according to the labeling data of the data in the second cluster to be trained. And obtaining the second purity according to the elements in the fourth quantity set and the quantity of the data in the first cluster to be trained. Alternatively, the sum of the squares of each element in the fourth number set may be determined, resulting in a second intermediate number. And determining the square of the number of data in the second cluster to be trained to obtain a fourth intermediate number. The ratio of the second intermediate number to the fourth intermediate number is taken as the second purity.

According to the technical scheme provided by the embodiment, the purity of the class in the cluster pair can be obtained, taking example 5 as an example, and the purity of the first cluster pair to be trained 1 and the purity of the first cluster pair to be trained 2 obtained based on the technical scheme provided by the embodiment (fourth) are both 40%. The purity of the first to-be-trained cluster pair 1 obtained based on the technical scheme provided in the embodiment (fifth) may be: (2 ² +1 ² )/3 ² The purity of the second cluster pair 2 to be trained obtained based on the technical scheme provided in embodiment (five) may be: (2 ² +1 ² +1 ² )/4 ² =6/16=37.5%. Obviously, the purity of the first to-be-trained cluster pair 1 and the purity of the first to-be-trained cluster pair 2 obtained by the technical solution provided in the embodiment (fifth) can better reflect the purity of the class in the first to-be-trained cluster pair 1 and the purity of the class in the first to-be-trained cluster pair 2.

Alternatively, the first purity may be obtained according to the technical scheme provided in embodiment (four), and the second purity may be obtained according to the technical scheme provided in embodiment (five). Referring to fig. 6, fig. 6 is a flow chart of a method for determining a first purity and a second purity according to an embodiment (six) of the present application.

601. And determining the quantity of the data contained in at least one category of the first cluster to be trained as a fifth quantity set according to the labeling data of the data in the first cluster to be trained.

The implementation of this step may be referred to as step 401, where the fifth number set in this step corresponds to the first number set in step 401.

602. And determining the quantity of the data contained in each category in the second cluster to be trained as a sixth quantity set according to the labeling data of the data in the second cluster to be trained.

The implementation of this step may be referred to as step 501, where the sixth number set in this step corresponds to the third number set in step 401.

603. And obtaining the first purity according to the maximum value in the fifth quantity set and the quantity of the data in the first cluster to be trained.

The implementation of this step may be referred to as step 402, where the fifth number set in this step corresponds to the first number set in step 402.

604. And obtaining the second purity according to the elements in the sixth quantity set and the quantity of the data in the second cluster to be trained.

The implementation process of this step may refer to step 502, where the sixth number set in this step corresponds to the first number set in step 502, the second cluster to be trained in this step corresponds to the first cluster to be trained in step 502, and the second purity in this step corresponds to the first purity in step 502.

In this example, the first purity is more focused on the purity of the number of reaction reference classes and the second purity is more focused on the purity of the reaction second training cluster pair class.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the apparatus 1 includes: an acquisition unit 11 and a processing unit 12, wherein:

an acquisition unit 11 for acquiring a data set to be processed and a clustering network; the clustering network is obtained by taking the purity of cluster pairs as supervision information; the purity of the cluster pairs is used for representing the purity of a reference class in the cluster pairs, wherein the reference class is the purity of the class with the largest amount of data in the cluster pairs;

and the processing unit 12 is used for processing the data set to be processed by using the clustering network to obtain a clustering result of the data set to be processed.

In combination with any embodiment of the present application, the processing, by the network to be trained, the first cluster to be trained and the second cluster to be trained to obtain a second probability of merging the first cluster to be trained first includes:

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Fig. 8 is a schematic hardware structure of a data processing apparatus according to an embodiment of the present application. The data processing means 2 comprise a processor 21, a memory 22, input means 23, output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors, including various interfaces, transmission lines or buses, etc., as not limited in this application. It should be understood that in various embodiments of the present application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

A processor may include one or more processors, including for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single core CPU or a multi-core CPU. Alternatively, the processor 21 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the present application is not limited.

Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It will be appreciated that, in the embodiment of the present application, the memory 22 may be used to store not only the related instructions, but also the purity of the cluster pairs, for example, the memory 22 may be used to store the set of data to be processed acquired through the input device 23, or the memory 22 may also be used to store the clustering result obtained through the processor 21, etc., and the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that figure 8 shows only a simplified design of a data processing apparatus. In practical applications, the data processing apparatus may also include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all data processing apparatuses capable of implementing the embodiments of the present application are within the scope of protection of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments herein are provided with emphasis, and that the same or similar parts may not be explicitly described in different embodiments for the sake of convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in the description of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital versatile disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method of data processing, the method comprising:

acquiring a data set to be processed and a clustering network; the clustering network is obtained by taking the purity of cluster pairs as supervision information; the purity of the cluster pairs is used for representing the purity of reference categories in the cluster pairs, wherein the reference categories are the categories with the largest amount of data in the cluster pairs; the clustering network is obtained by training the purity of cluster pairs as supervision information, and comprises the following steps: acquiring a network to be trained, a first cluster pair to be trained and a second cluster pair to be trained; processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain a first probability of merging the first cluster pair to be trained; obtaining a second probability of merging the first cluster pair to be trained first according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, wherein the first purity is the purity of the reference class in the first cluster pair to be trained, and the second purity is the purity of the reference class in the second cluster pair to be trained; obtaining the loss of the network to be trained according to the difference between the first probability and the second probability; adjusting parameters of the network to be trained based on the loss to obtain the clustering network;

2. The method of claim 1, wherein prior to the deriving the second probability of merging the first pair of clusters based on the first purity of the first pair of clusters to be trained and the second purity of the second pair of clusters to be trained, the method further comprises:

3. The method of claim 2, wherein the obtaining the first purity according to the maximum value in the first number set and the number of data in the first cluster to be trained, and the obtaining the second purity according to the maximum value in the second number set and the number of data in the second cluster to be trained, comprises:

4. The method of claim 3, wherein prior to the deriving the second probability of merging the first pair of clusters based on the first purity of the first pair of clusters to be trained and the second purity of the second pair of clusters to be trained, the method further comprises:

5. The method of claim 4, wherein the obtaining the first purity from the number of elements in the third number of sets and the data in the first cluster to be trained and the obtaining the second purity from the number of elements in the fourth number of sets and the data in the first cluster to be trained comprises:

6. The method according to any one of claims 1 to 5, wherein the deriving a second probability of merging the first pair of clusters first based on the first purity of the first pair of clusters and the second purity of the second pair of clusters comprises:

7. The method according to any one of claims 2 to 5, wherein the obtaining a fourth probability of merging the first pair of clusters first based on the first purity of the first pair of clusters to be trained and the second purity of the second pair of clusters to be trained comprises:

8. The method of claim 1, wherein the processing the first pair of clusters to be trained and the second pair of clusters to be trained via the network to be trained to obtain a first probability of merging the first pair of clusters to be trained first comprises:

9. A data processing apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the data set to be processed and the clustering network; the clustering network is obtained by taking the purity of cluster pairs as supervision information; the purity of the cluster pairs is used for representing the purity of reference categories in the cluster pairs, wherein the reference categories are the categories with the largest amount of data in the cluster pairs; the clustering network is obtained by training the purity of cluster pairs as supervision information, and comprises the following steps: acquiring a network to be trained, a first cluster pair to be trained and a second cluster pair to be trained; processing the first cluster pair to be trained and the second cluster pair to be trained through the network to be trained to obtain a first probability of merging the first cluster pair to be trained; obtaining a second probability of merging the first cluster pair to be trained first according to the first purity of the first cluster pair to be trained and the second purity of the second cluster pair to be trained, wherein the first purity is the purity of the reference class in the first cluster pair to be trained, and the second purity is the purity of the reference class in the second cluster pair to be trained; obtaining the loss of the network to be trained according to the difference between the first probability and the second probability; adjusting parameters of the network to be trained based on the loss to obtain the clustering network;

10. A processor for performing the method of any one of claims 1 to 8.

11. An electronic device, comprising: a processor, transmission means, input means, output means and memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 8.

12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to perform the method of any of claims 1 to 8.