CN112035663A

CN112035663A - Cluster analysis method, device, equipment and storage medium

Info

Publication number: CN112035663A
Application number: CN202010883445.4A
Authority: CN
Inventors: 岳小芬
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-04
Anticipated expiration: 2040-08-28
Also published as: CN112035663B

Abstract

The embodiment of the application provides a cluster analysis method, a device, equipment and a storage medium, which are used for respectively obtaining a first clustering result of a first object set of a target system at a first moment and a second clustering result of a second object set of the target system at a second moment based on a pre-configured clustering model, determining purity information of the second clustering result relative to the first clustering result according to the first clustering result and the second clustering result, and finally determining whether the clustering model needs to be updated according to the purity information. In the technical scheme, the purity information of the second clustering result relative to the first clustering result is used as a judgment index, so that the suitable model retraining time can be determined, the problem of inaccurate clustering of users can be avoided, the problem of manpower and time waste can be avoided, and the strategy implementation effect is improved.

Description

Cluster analysis method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of natural language processing, in particular to a cluster analysis method, a device, equipment and a storage medium.

Background

Clustering is the process of dividing a group of physical or abstract objects into similar object classes, often called classes or class clusters. In practical application, in order to improve business transformation and promote transactions, enterprises need to make different strategy schemes according to different user groups, so that how to cluster users is a precondition for realizing strategy processing.

At present, a user clustering method mainly clusters users based on a clustering model, a large amount of crowd marking data are needed for establishing the clustering model, the crowd marking data are obtained by analyzing user behaviors in a certain period by professionals, but the user behavior data in different periods may change greatly, the user behaviors need to be analyzed again and then retrained again to obtain the clustering model, and the clustering model is usually trained at regular intervals in the prior art due to the fact that the workload of crowd marking is large and the cost is high, so that the low training cost and the applicability of the model are guaranteed.

However, in the process of implementing the present invention, the inventors found that the above solution has at least the following problems: in the prior art, a fixed time interval is set by a researcher according to the historical update time of a clustering model, and can only represent the time interval of model update in a historical period of time, and can not accurately represent that the clustering model in use can be applicable in a future period of time.

For example, before a preset fixed time interval is reached, if the user behavior data changes abnormally and far exceeds the previous result, but the model researchers do not know that the problem that objects such as crowds are classified incorrectly due to the fact that the time point of model retraining is not reached, the electronic equipment still uses the original clustering model for clustering, and therefore the strategy is not matched with the user, and the strategy implementation effect is poor is caused; for another example, when a preset fixed time interval is reached, if the change of user behavior data is small and the model is stable in operation, but model researchers can still re-analyze the model result and re-train the model at the time point of model re-training, and at this time, the difference between the re-trained clustering model and the original clustering model is small, and the problem of resource waste such as manpower, time and the like exists.

In summary, the existing clustering model analysis methods all have the problems that due to inaccurate training opportunities of the clustering model, object classification is wrong, so that the implementation effect of strategy schemes corresponding to different object sets is poor, or the unnecessary model is retrained, so that resources such as manpower and time are wasted.

Disclosure of Invention

The embodiment of the application provides a cluster analysis method, a cluster analysis device, a cluster analysis equipment and a storage medium, which are used for solving the problem that object classification is wrong or resources are wasted due to the fact that the training time of the conventional cluster model is not appropriate.

In a first aspect, an embodiment of the present application provides a cluster analysis method, including:

respectively acquiring a first clustering result of a first object set and a second clustering result of a second object set based on a pre-configured clustering model, wherein the first object set is a target object set of a target system at a first moment, and the second object set is a target object set of the target system at a second moment;

according to the first clustering result and the second clustering result, determining purity information of the second clustering result relative to the first clustering result, wherein the purity information is used for indicating object clustering change information of the first object set and the second object set;

and determining whether the clustering model needs to be updated or not according to the purity information.

In a possible design of the first aspect, the determining purity information of the second clustered result relative to the first clustered result according to the first clustered result and the second clustered result includes:

determining object clustering correlation information of the first object set and the second object set according to the first clustering result and the second clustering result;

determining an object clustering incidence matrix according to the object clustering incidence information, wherein rows in the object clustering incidence matrix are used for representing clustering information corresponding to the first clustering result, and columns in the object clustering incidence matrix are used for representing clustering information corresponding to the second clustering result;

and determining the purity information of the second clustering result relative to the first clustering result according to the object clustering incidence matrix and a preset purity calculation formula.

Optionally, the purity calculation formula is as follows:

wherein, P is the purity information of the second clustering result relative to the first clustering result, i is the variable of clustering in the first clustering result, j is the variable of clustering in the second clustering result, and k is the total number of clustering;

m_ijchanging the elements of the ith row and the jth column in the object cluster matrix, and representing the object intersection of the ith cluster in the first cluster result and the jth cluster in the second cluster result;

a total number of objects after combining and de-duplicating the objects in the first set of objects and the second set of objects;

the total number of the objects belonging to the ith cluster in the first clustering result is obtained; p is a radical of_iAnd obtaining the purity information of the ith cluster in the first clustering result.

In another possible design of the first aspect, the obtaining a first clustering result of the first object set and a second clustering result of the second object set based on a pre-configured clustering model respectively includes:

obtaining clustering information of the clustering model, wherein the clustering information comprises: the number of clusters and the class center of each cluster;

and based on the cluster number and the class center of each cluster, performing cluster division on the objects in the first object set to obtain a first cluster result, and performing cluster division on the objects in the second object set to obtain a second cluster result.

Exemplarily, for any one target object set of the first object set and the second object set, performing cluster partitioning on the objects in the target object set based on the number of clusters and the class center of each cluster to obtain the target clustering result, including:

calculating the Euclidean distance between each object in the target object set and each class center;

and dividing each object in the target object set into a cluster to which a class center with the closest Euclidean distance belongs to obtain a target clustering result corresponding to the target object set.

In yet another possible design of the first aspect, the determining whether the clustering model needs to be updated according to the purity information includes:

when the value of the purity information is larger than or equal to a preset threshold value, determining that the clustering model does not need to be updated;

and when the value of the purity information is smaller than a preset threshold value, determining that the clustering model needs to be updated.

Optionally, when it is determined that the cluster model needs to be updated, the method further includes:

sending model update information, wherein the model update information is used for indicating that the clustering model needs to be updated;

and when a model updating instruction of a user is received, updating the clustering model according to the clustering information of the clustering model and the attribute information of each object in the second object set.

In a second aspect, the present application provides a cluster analysis apparatus, comprising: the device comprises an acquisition module, a processing module and a determination module;

the acquisition module is used for respectively acquiring a first clustering result of a first object set and a second clustering result of a second object set based on a pre-configured clustering model, wherein the first object set is a target object set of a target system at a first moment, and the second object set is a target object set of the target system at a second moment;

the processing module is configured to determine, according to the first clustering result and the second clustering result, purity information of the second clustering result relative to the first clustering result, where the purity information is used to indicate object clustering change information of the first object set and the second object set;

and the determining module is used for determining whether the clustering model needs to be updated or not according to the purity information.

In a third aspect, embodiments of the present application further provide an electronic device, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the method according to the first aspect and possible designs.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer is caused to execute the method according to the first aspect and each possible design.

The application provides a cluster analysis method, a device, equipment and a storage medium, which are based on a pre-configured cluster model, respectively obtain a first cluster result of a first object set of a target system at a first moment and a second cluster result of a second object set of the target system at a second moment, determine purity information of the second cluster result relative to the first cluster result according to the first cluster result and the second cluster result, and finally determine whether the cluster model needs to be updated according to the purity information. In the technical scheme, the purity information of the second clustering result relative to the first clustering result is used as a judgment index, so that the suitable model retraining time can be determined, the problem of inaccurate clustering of users can be avoided, the problem of manpower and time waste can be avoided, and the strategy implementation effect is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic flow diagram of policy enforcement in a target system;

fig. 2 is a schematic view of an application scenario of a cluster analysis method provided in an embodiment of the present application;

fig. 3 is a schematic flow chart of a first embodiment of a cluster analysis method provided in the present application;

fig. 4 is a schematic flow chart of a second embodiment of a cluster analysis method provided in the present application;

fig. 5 is a schematic flow chart of a third embodiment of a cluster analysis method provided in the present application;

FIG. 6 is a schematic diagram illustrating a process of clustering a set of target objects using a predetermined clustering model;

fig. 7 is a schematic flowchart of a fourth embodiment of a cluster analysis method provided in the present application;

FIG. 8 is a schematic diagram of object clustering distribution after clustering a model in the embodiment of the present application;

fig. 9 is a schematic structural diagram of an embodiment of a cluster analysis apparatus provided in the present application;

fig. 10 is a schematic structural diagram of an electronic device for performing a cluster analysis method according to an embodiment of the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

With the rapid development of internet technology, in order to improve competitiveness, an enterprise or e-commerce website can take customer requirements as a starting point, a strategy is formulated according to obtained user behavior information, and by accurately implementing the strategy, the business conversion rate can be improved and the transaction rate can be promoted.

Specifically, in a target system (client system), in order to improve the accuracy of policy enforcement, it is often necessary to cluster similar users in advance to find their commonalities, which is what is commonly referred to as clustering, before care or advertising information is periodically sent to the users. Clustering is the process of dividing a group of physical objects or abstract objects into similar object classes, and in practical applications, the similar object classes are usually called classes or class clusters.

For example, based on the registration time of the target website, a user who registers for about 1 month is defined as a new user, and a user who registers for more than 1 month is defined as an old user, which is a very simple cluster. In a policy implementation scenario, different policy schemes are often formulated according to different user groups, how to cluster a large number of users of a target website according to behavior data of the users to obtain a plurality of clusters, and processing each cluster according to a uniform policy is a key part of the present application.

Illustratively, fig. 1 is a flow chart illustrating policy enforcement in a target system. As shown in fig. 1, in the target system, a member is a relatively stable user within a certain period of time, and the behavior of the user does not change particularly greatly, so the target object set of the present application generally refers to the member users of the target system, and by clustering the member users by using a pre-established distance model, a plurality of classes, for example, class 1, … …, class k, k being an integer greater than 1, can be obtained, and then a policy, i.e., policy 1, … …, policy k, is formulated for the user of each class. So that the corresponding policy is enforced periodically for the users within the corresponding class.

In practical application, since the member users of the target system are usually stable, and after the clustering result is obtained, it is an extremely heavy task to perform inductive analysis, search for commonalities and label the population on the clustering result, in the user clustering stage, the users do not need to be clustered again every time the policy is implemented, that is, the clustering model does not need to be changed frequently.

However, in different time periods, the user behavior may also change greatly, and at this time, when the user clustering is performed by using the original clustering model, there is a case of wrong user classification, so when the user behavior changes greatly, the clustering model needs to be retrained, the commonality between users needs to be generalized again, and a policy is made for each cluster.

In the conventional practice, a researcher sets a fixed time interval, for example, 3 months or 5 months, according to the historical update time of the clustering model, and then trains the clustering model again after every fixed time interval. However, the strategy of updating the model at regular time intervals may have the problems of wrong classification of people due to too late training time, mismatching of strategy and user, or waste of manpower and time due to too early training time. In general, the existing scheme has the problem of inaccurate model training opportunity.

Aiming at the problems, the technical idea process of the application is as follows: because the model training time in the existing target system is inaccurate, the inventor of the application thinks whether the stability of the model operation can be represented by one index, and then when the index represents the stability of the original model operation, the original clustering model is continuously used, and when the data change exceeds the threshold value, the model is retrained, and then the user is clustered again. Specifically, in the user clustering stage, an index for calculating the stability of the model is given, for object sets corresponding to different moments, the original clustering model can be used for respectively obtaining the clustering result of each moment, so that the object clustering change of the two clustering results is obtained, and the stability of the original clustering model is determined based on the change.

Based on the technical concept, the embodiment of the application provides a cluster analysis method, and the technical scheme is as follows: respectively acquiring a first clustering result of a first object set and a second clustering result of a second object set based on a pre-configured clustering model, wherein the first object set is a target object set of a target system at a first moment, the second object set is a target object set of the target system at a second moment, determining purity information of the second clustering result relative to the first clustering result according to the first clustering result and the second clustering result, and the purity information is used for indicating object clustering change information of the first object set and the second object set, and finally determining whether the clustering model needs to be updated according to the purity information. In the technical scheme, the purity information of the second clustering result relative to the first clustering result is used as a judgment index, so that the suitable model retraining time can be determined, the problem of inaccurate clustering of users can be avoided, the problem of manpower and time waste can be avoided, and the strategy implementation effect is improved.

Exemplarily, fig. 2 is a schematic view of an application scenario of the cluster analysis method provided in the embodiment of the present application. As shown in fig. 2, the application scenario may include: a target system 11 having a communication connection and an electronic device 12. The target system 11 may include a server 110 and a data center 111, the server 110 being a processing center of the target system 11, which may respond to various requests from the outside; the data center 111 may be a data storage center of the target system 11, which may be used for storing various requests and operation results corresponding to the requests, and the like.

It is understood that the embodiments of the present application do not limit the specific functions of the server 110 and the data center 111, and may be determined according to actual situations. For example, if the target system 11 is an e-commerce system, the server 110 processes various requests such as a browsing request, an order request, a user classification request, and a product classification request externally issued, and outputs a processing result. Optionally, the data center 111 may be configured to store each request issued from the outside and a processing result of the server for each request, and the processing result may be represented by user behavior data, for example.

For example, in the application scenario shown in fig. 1, the electronic device 12 may obtain the processing result from the data center 111 of the target system, perform cluster analysis on target object sets of the target system at different times by using a pre-configured cluster model according to the processing result, and obtain cluster analysis results of different object sets, and further, the electronic device 12 may execute a program code of a cluster analysis method according to the obtained cluster analysis results of different object sets to determine whether to update the preset cluster model.

Optionally, in the embodiment of the present application, the electronic device 12 may be a device independent from the target system 11, or may be a device integrated in the target system 11, for example, implemented by the server 111. Therefore, fig. 1 is only a schematic diagram of an application scenario provided in the embodiment of the present application, and devices included in each application scenario may be set according to actual requirements, which is not described herein again.

Optionally, the cluster analysis method provided in the embodiment of the present application is explained by using an electronic device in the application scenario shown in fig. 1 as an execution subject, and optionally, the electronic device may be a terminal device or a server, which is not limited in the embodiment of the present application.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Exemplarily, fig. 3 is a schematic flow chart of a first embodiment of a cluster analysis method provided in the present application. As shown in fig. 3, the cluster analysis method may include the steps of:

s301, respectively obtaining a first clustering result of the first object set and a second clustering result of the second object set based on a pre-configured clustering model.

Wherein the first set of objects is a set of objects of the target system at a first time and the second set of objects is a set of objects of the target system at a second time.

Optionally, in the time dimension, the second time is a time after the first time. For example, a target system may have a first data set corresponding to a target data set at a first time of day and a second data set corresponding to a second time after the target system has been in operation for a period of time. For example, the first time may be a time of day and the second time may be the same time of day after the day.

In the embodiment of the application, a pre-configured clustering model is pre-loaded in the electronic device, and the clustering model is obtained by training a preset network by using labeled user behavior data. The electronic equipment utilizes the clustering model to perform clustering analysis on the target object set of the target system at the first moment and the target object set at the second moment respectively, so that a first clustering result corresponding to the first object set and a second clustering result corresponding to the second object set can be obtained respectively.

It is understood that the first time instant and the second time instant are times of two adjacent rounds of the electronic device clustering the set of target objects in the target system, respectively. And respectively carrying out clustering analysis on the target object sets of the target system at different moments by using the same clustering model, and determining whether the clustering model needs to be updated or not by analyzing two clustering results.

S302, according to the first clustering result and the second clustering result, determining the purity information of the second clustering result relative to the first clustering result.

Wherein the purity information is used to indicate object cluster variation information for the first set of objects and the second set of objects.

In the embodiment of the application, when the electronic device obtains the first clustering result and the second clustering result, the association relationship between the first clustering result and the second clustering result may be analyzed, so as to determine the object clustering change information in the first object set and the second object set, and by analyzing the object clustering change information, the purity information of the second clustering result relative to the first clustering result may be determined, so as to analyze whether the clustering model configured in advance in the electronic device is still applicable over time.

Illustratively, the index of purity information is used to characterize the change of object classification in different clustering results (e.g., new and old clustering results).

As an example, if the variation of the object classification result is small, it indicates that when the model is retrained by using the class center of the original clustering model as the initial class center, the offset of the class center should be small, the probability of the object being re-partitioned is also small, and at this time, the value of the purity information of the second clustering result relative to the first clustering result is greater than the preset threshold.

As another example, if a large number of objects in the object set are reclassified, the change of the object classification result is large, if the model is retrained by using the class center of the original clustering model as the initial class center, the offset of the class center should be large, and the original clustering model cannot accurately explain the user behavior of the current object set, so the original clustering model needs to be updated.

And S303, determining whether the clustering model needs to be updated according to the purity information.

In an embodiment of the application, according to the definition of the purity information in S202, a value of the purity information may be used to represent a variation of the object classification in different clustering results, and therefore, the electronic device determines whether to update the clustering model according to a value of the purity information of the second clustering result relative to the first clustering result.

Illustratively, when the value of the purity information of the second clustering result relative to the first clustering result is greater than or equal to a preset threshold, it indicates that the behavior change of the object in the second object set relative to the first object set in the current target system is small, and at this time, it is determined that the clustering model does not need to be updated.

Optionally, when the value of the purity information of the second clustering result relative to the first clustering result is smaller than a preset threshold, it is indicated that the behavior of the second object set in the current target system changes greatly relative to the behavior of the objects in the first object set, and it is determined that the clustering model needs to be updated.

The clustering analysis method provided by the embodiment of the application is based on a pre-configured clustering model, respectively obtains a first clustering result and a second clustering result corresponding to a target object set of a target system at different moments, determines the purity information of the second clustering result relative to the first clustering result according to the first clustering result and the second clustering result, and finally determines whether the clustering model needs to be updated according to the purity information. In the technical scheme, the purity information of the second clustering result relative to the first clustering result is used as a judgment index, so that the suitable model retraining time can be determined, the problem of inaccurate clustering of users can be avoided, the problem of manpower and time waste can be avoided, and the strategy implementation effect is improved.

Exemplarily, fig. 4 is a schematic flow chart of a second embodiment of the cluster analysis method provided in the present application. As shown in fig. 4, the above S302 may be implemented by the following steps:

s401, determining object clustering correlation information of the first object set and the second object set according to the first clustering result and the second clustering result.

In an embodiment of the present application, the first clustering result is used to characterize results of a plurality of clusters (also referred to as clusters) formed by clustering and clustering the respective objects in the first object set, that is, the first clustering result may include: the number of objects divided into clusters is determined. Similarly, the second clustering result is used to characterize a plurality of clustering results formed by clustering the objects in the second object set, that is, the second clustering result may also include: the number of objects divided into clusters is determined.

It can be understood that, since the first clustering result and the second clustering result are obtained based on the preset clustering model, the number of clusters corresponding to the first clustering result and the second clustering result is the same.

Optionally, the electronic device may calculate object cluster association information of the first object set and the second object set respectively according to object data included in each cluster corresponding to the first clustering result and the second clustering result, that is, the number of objects belonging to the same cluster or different clusters.

Illustratively, table 1 is a distribution diagram of object cluster association information of a first object set in the first clustering result and a second object set in the second clustering result. Optionally, the first clustering result and the second clustering result in table 1 correspond to 4 clusters respectively for example, and table 1 shows object clustering association information of the first object set and the second object set.

TABLE 1

Illustratively, as shown in the above table, the first set of objects and the second set of objects are each divided into 4 clusters, wherein the interpretation of the values for the elements in the first row is as follows: the number of objects in the first object set and the second object set which are classified as cluster 1 is 100; the number of objects in the first set of objects classified as cluster 1, but in the second set of objects classified as cluster 2, is 2; the number of objects in the first set of objects classified as cluster 1, but in the second set of objects classified as cluster 3, is 3; the number of objects in the first set of objects classified as cluster 1, but in the second set of objects classified as cluster 4, is 4; the rest of the lines are analogized, and the description is omitted here.

S402, determining an object cluster incidence matrix according to the object cluster incidence information.

The rows in the object cluster incidence matrix are used for representing the clustering information corresponding to the first clustering result, and the columns in the object cluster incidence matrix are used for representing the clustering information corresponding to the second clustering result.

For example, the electronic device may represent the object cluster association information determined in S401 in the form of an object cluster association matrix for subsequent calculation based thereon.

Optionally, for the object cluster association information shown in table 1, the number of objects in each cluster in the first clustering result is represented by a row of an object cluster association matrix, and specifically, the sum of values of elements in each row is equal to the number of objects in each cluster in the first clustering result; and expressing the number of the objects of each cluster in the second clustering result by the columns of the object clustering association matrix, wherein the sum of the values of the elements in each column is equal to the number of the objects of each cluster in the second object result.

Correspondingly, each element in the object cluster incidence matrix is used for representing the number of objects belonging to the cluster corresponding to the row in the first clustering result and belonging to the cluster corresponding to the column in the second clustering result.

For example, using the object cluster association information shown in table 1, the object cluster association matrix can be determined as

And S403, determining the purity information of the second clustering result relative to the first clustering result according to the object clustering incidence matrix and a preset purity calculation formula.

For example, a preset purity calculation formula is loaded in the electronic device, and therefore, the value of each element in the obtained object clustering incidence matrix is substituted into the preset purity calculation formula, so that the purity information of the second clustering result relative to the first clustering result can be obtained.

Optionally, in the embodiment of the present application, the purity calculation formula is as follows:

and P is the purity information of the second clustering result relative to the first clustering result, i is a variable of clustering in the first clustering result, j is a variable of clustering in the second clustering result, and k is the total number of clustering.

m_ijAnd changing the elements of the ith row and the jth column in the object cluster change matrix, and representing the object intersection of the ith cluster in the first clustering result and the jth cluster in the second clustering result.

And (4) the total number of the objects after the objects in the first object set and the second object set are combined and de-duplicated.

The total number of the objects belonging to the ith cluster in the first clustering result is obtained; p is a radical of_iAnd the purity information of the ith cluster in the first clustering result is obtained.

Illustratively, the object cluster incidence matrix determined in S402 is used for explanation.

Assume that the object cluster incidence matrix is

Then m is_iIs the total number of users per line, e.g. m₁100+2+3+4 109; m is the total number of the objects after the de-duplication in the first object set and the second object set, that is, the total number of the objects corresponding to the object cluster incidence matrix, that is, the number of all users to be classified, that is, m is 100+2+3+4+5+101+6+7+10+11+102+13+8+9+14+200 is 595; purity information of the 1 st cluster in the first clustering result

In the same way, m₂＝5+101+6+7＝119，

m₃＝10+11+102+13＝136，

m₄＝8+9+14+200＝231，

Accordingly, purity information of the second clustered result relative to the first clustered result

According to the cluster analysis method provided by the embodiment of the application, object cluster association information of a first object set and an object cluster association information of a second object set are determined according to a first cluster result and a second cluster result, an object cluster association matrix is determined according to the object cluster association information, rows in the object cluster association matrix are used for representing the cluster information corresponding to the first cluster result, columns in the object cluster association matrix are used for representing the cluster information corresponding to the second cluster result, and finally purity information of the second cluster result relative to the first cluster result is determined according to the object cluster association matrix and a preset purity calculation formula. In the technical scheme, the object clustering correlation information of the first clustering result and the second clustering result is expressed in the form of the object clustering correlation matrix, so that the purity calculation process is simplified, the accuracy of purity calculation is improved, and a foundation is laid for determining whether the clustering model needs to be updated or not in the follow-up process.

Exemplarily, fig. 5 is a schematic flow chart of a third embodiment of the cluster analysis method provided in the present application. As shown in fig. 5, S301 may be implemented by:

s501, obtaining clustering information of the clustering model, wherein the clustering information comprises: the number of clusters and the class center of each cluster.

In the embodiment of the application, the clustering model is obtained by training a preset network by using labeled object behavior data. Specifically, when the number of clusters to be clustered is determined, the labeled object behavior data is used as the input of the preset network, and the parameters of the preset network are adjusted, so that the number of clusters output by the preset network is multiple, and each cluster meets the preset condition.

Optionally, after the cluster model is obtained, numbering is performed on each cluster according to a rule, a class center of each cluster is determined, and then the cluster information of each cluster is sequentially stored from small to large according to the cluster number. It is understood that the cluster numbers are stored in sequence from small to large, and the cluster centers are also stored in descending order.

Accordingly, in the embodiment of the present application, after the electronic device loads the pre-trained cluster model, first, the cluster information of the loaded cluster model, for example, the cluster number of the cluster model determined in the training process and the class center of each cluster (i.e., the position information of the class center), is obtained.

Illustratively, the number of clusters is used to indicate the number of clusters output when the target object set is clustered by using the clustering model, and the class center of each cluster is used to indicate a position that can be used as the class center in all objects corresponding to each cluster.

For example, assuming that the object in the target object set is a user, the cluster information may be a plurality of clusters determined by cluster analysis on the user, for example, in an e-commerce system, cluster 1: a group of users greater than or equal to 18 years old and less than 30 years old, the class center being, for example, a 24 year old user; clustering 2: a group of users greater than or equal to 30 years old and less than or equal to 50 years old, the class centers being, for example, users 40 years old; clustering 3: a group of users older than 50 years, for example, a class center is a 60 year old user. The above example is only an exemplary illustration, and the specific clusters and the centers of the clusters may be determined according to actual needs, which are not described herein again.

S502, based on the cluster number and the class center of each cluster, carrying out cluster division on the objects in the first object set to obtain a first cluster result, and carrying out cluster division on the objects in the second object set to obtain a second cluster result.

In the embodiment of the application, when the electronic device obtains the clustering number capable of being clustered by the clustering model and the class center of each cluster, the clustering model can be utilized to perform clustering analysis on the target object set to be clustered by using the determined class center of each cluster, so as to obtain a plurality of clusters of the clustering number.

For example, the electronic device performs cluster analysis on the first object set by using the cluster number corresponding to the cluster model as a reference and using the class center of each cluster as an initial class center at a first time to obtain a first cluster result, and performs cluster analysis on the second object set by using the cluster number corresponding to the cluster model as the reference and using the class center of each cluster as the initial class center at a second time to obtain a second cluster result.

As an example, for any one target object set of the first object set and the second object set, based on the number of clusters and the class center of each cluster, performing cluster division on the objects in the target object set to obtain a target clustering result may be implemented by:

and A1, calculating the Euclidean distance between each object in the target object set and the center of each class.

And A2, dividing each object in the target object set into clusters to which class centers with the Euclidean distances being the closest belong, and obtaining target clustering results corresponding to the target object set.

Optionally, the electronic device may perform cluster division on the target object set by using a preset clustering model, specifically, first calculate a euclidean distance between each object and each class center of the clustering model, and classify each object into a cluster (class cluster) closest to the object center, so as to obtain a target clustering result corresponding to the target object set.

Fig. 6 is a schematic diagram illustrating a process of clustering a set of target objects by using a preset clustering model. As shown in fig. 6, the target object set includes nine objects, i.e., u11, u12, u13, u21, u22, u31, u32, u41, and u 42. The class center 1, the class center 2, the class center 3 and the class center 4 are all stored in a preset clustering model during training.

Optionally, the classification process for the object u11 is as follows: the euclidean distances from u11 to class center 1, class center 2, class center 3 and class center 4 are calculated respectively, u11 is divided into clusters to which the class center point with the shortest distance belongs, and as can be seen from fig. 6, u11 has the shortest distance from class center 1, so that u11 can be divided into cluster 1.

Similarly, the electronic device may cluster the objects u12, u13, …, and u42, respectively, based on the clustering process of u11, thereby completing the partition of the clusters. Referring to the right diagram of fig. 6, u11, u12, and u13 are classified into cluster 1 to which class center 1 belongs, u21 and u22 are classified into cluster 2 to which class center 2 belongs, u31 and u32 are classified into cluster 3 to which class center 3 belongs, and u41 and u42 are classified into cluster 3 to which class center 3 belongs.

In the cluster analysis method provided in the embodiment of the present application, by obtaining the cluster information of the cluster model, the cluster information includes: and then, based on the cluster quantity and the class center of each cluster, carrying out cluster division on the objects in the first object set to obtain a first cluster result, and carrying out cluster division on the objects in the second object set to obtain a second cluster result. According to the technical scheme, the first object set and the second object set are subjected to clustering analysis based on the clustering number of the preset clustering model and the class center of each cluster, clustering results with certain relevance can be obtained, and a realization premise is provided for obtaining purity information of the second object set relative to the first object set subsequently.

Fig. 7 is a schematic flow chart of a fourth embodiment of the cluster analysis method provided by the present application. As shown in fig. 7, S303 may be implemented by:

s701, judging whether the value of the purity information is larger than or equal to a preset threshold value or not; if yes, go to S702; if not, go to S703.

S702, determining that the clustering model does not need to be updated.

And S703, determining that the clustering model needs to be updated.

In the embodiment of the application, since the purity information is used to represent the object clustering change information of the first object set and the second object set, it can be determined whether the clustering model is suitable for clustering analysis of the second object set based on the size relationship between the specific value of the purity information and the preset threshold.

As an example, in the first clustering result and the second clustering result, if the number of objects classified into the same cluster is larger, the value of the purity information of the second clustering result relative to the first clustering result is larger, the object variation is smaller, and the applicability of the preset clustering model to the second object set is higher. That is, if the value of the purity information is greater than or equal to the preset threshold, it is determined that the cluster model does not need to be updated.

As another example, in the first clustering result and the second clustering result, if the number of objects classified as the same cluster is small, the value of the purity information of the second clustering result relative to the first clustering result is smaller, the object variation is larger, at this time, the preset clustering model is no longer applicable to the second object set, at this time, the clustering model needs to be retrained, that is, the clustering model is updated, and the second object set is subjected to clustering analysis by using the updated clustering model. That is, when the value of the purity information is smaller than the preset threshold, it is determined that the clustering model needs to be updated.

For example, in practical applications, the preset threshold may be 0.85, that is, when the value of the purity information is greater than or equal to 0.85, the electronic device may continue to perform cluster analysis on the second object set using the preset clustering model, and when the value of the purity information is less than 0.85, the electronic device determines that the preset clustering model is no longer suitable for the cluster analysis of the second object set.

Illustratively, referring to fig. 7, after S703, the method may further include the following steps:

and S704, sending model updating information, wherein the model updating information is used for indicating that the clustering model needs to be updated.

In the embodiment of the application, when the electronic device determines that the preset clustering model needs to be updated, model update information can be sent out so as to inform a worker that the clustering model needs to be updated.

S705, when a model updating instruction of the user is received, updating the clustering model according to the clustering information of the clustering model and the attribute information of each object in the second object set.

Alternatively, when the user issues a model update instruction, it may instruct the electronic device to perform an update process of the clustering model. For example, when the electronic device receives a model update instruction from a user, first, a class center of a preset clustering model may be used as an initial class center of a cluster (instead of a random point being used as the class center), a cluster number of the preset clustering model is used as a cluster number of an updated clustering model, and an original clustering model is subjected to iterative analysis sequentially according to attribute information of each object in a second object set, so as to update the original preset clustering model.

Exemplarily, fig. 8 is a schematic diagram of object cluster distribution after a cluster model in the embodiment of the present application. Referring to the left diagram of fig. 8, the schematic diagram still uses the target object set including nine objects, i.e., u11, u12, u13, u21, u22, u31, u32, u41, and u42, and the initial class centers are class center 1, class center 2, class center 3, and class center 4, respectively.

Optionally, as shown in the left diagram in fig. 8, the electronic device retrains the model by using the stored 4 class centers as the initial class centers of the updated model, and since the initial class centers are the original class center points, the number of clusters is not changed. The right diagram in fig. 8 shows the result after convergence of the clustering model, and the class centers after convergence are class center 1 ', class center 2', class center 3 ', and class center 4', respectively. As shown, after convergence, class center 1 is shifted greatly from original class center 1 to class center 1', which results in changes in u12 and u13 originally assigned to cluster 1 and u21 originally assigned to the second class, and thus, if the definitions of original cluster 1, cluster 2, cluster 3, etc. are still used, there is a large error. Therefore, when re-classification is performed using the updated clustering model, the accuracy of classification can be used.

In the embodiment of the application, the original class center is used as the initial class center, the algorithm convergence speed is greatly improved, the convergence can be realized only by several iterations, and the model training speed is improved.

According to the cluster analysis method provided by the embodiment of the application, when the value of the purity information is larger than or equal to the preset threshold, it is determined that the cluster model does not need to be updated, when the value of the purity information is smaller than the preset threshold, it is determined that the cluster model needs to be updated, and when it is determined that the cluster model needs to be updated, the cluster model is updated according to the cluster information of the cluster model and the attribute information of each object in the second object set. According to the technical scheme, the time when the clustering model needs to be updated can be determined in time through the purity information and the preset threshold value, object clustering errors are avoided, training cost is reduced, in addition, the preset clustering model is updated when the clustering model needs to be updated, and the accuracy of clustering analysis is improved.

According to the records of the embodiments, the purity index capable of reflecting the change of the model is skillfully set through the classification change of the objects in the first clustering result and the second clustering result, the applicability of the original clustering model can be timely and scientifically evaluated, illustratively, the electronic equipment can execute the evaluation flow process of the clustering model every other preset time period (every day), and only when the value of the purity information is lower than the preset threshold value, the alarm information is sent to the developer of the model, so that the developer of the model can accurately know when to replace the clustering model, the training time of the clustering model is saved, and the development efficiency is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 9 is a schematic structural diagram of an embodiment of a cluster analysis apparatus according to the present application. As described with reference to fig. 9, the cluster analysis apparatus may include: an acquisition module 901, a processing module 902 and a determination module 903.

The obtaining module 901 is configured to obtain a first clustering result of a first object set and a second clustering result of a second object set based on a pre-configured clustering model, where the first object set is a target object set of a target system at a first time, and the second object set is a target object set of the target system at a second time;

a processing module 902, configured to determine, according to the first clustering result and the second clustering result, purity information of the second clustering result relative to the first clustering result, where the purity information is used to indicate object clustering change information of the first object set and the second object set;

a determining module 903, configured to determine whether the clustering model needs to be updated according to the purity information.

In one possible design of the embodiment of the present application, the processing module 902 is specifically configured to:

Optionally, the purity calculation formula is as follows:

In another possible design of the embodiment of the present application, the obtaining module 901 is specifically configured to obtain clustering information of the clustering model, where the clustering information includes: the number of clusters and the class center of each cluster;

the processing module 902 is further configured to perform cluster partitioning on the objects in the first object set based on the number of clusters and the class center of each cluster to obtain the first cluster result, and perform cluster partitioning on the objects in the second object set to obtain the second cluster result.

Optionally, for any one target object set of the first object set and the second object set, the processing module 902 is configured to perform cluster division on the objects in the target object set based on the cluster number and the class center of each cluster to obtain the target clustering result, and specifically:

the processing module 902 is specifically configured to calculate a euclidean distance between each object in the target object set and each class center, and divide each object in the target object set into clusters to which class centers closest to the euclidean distance belong, to obtain a target clustering result corresponding to the target object set.

In another possible design of the embodiment of the present application, the determining module 903 is specifically configured to determine that the clustering model does not need to be updated when the value of the purity information is greater than or equal to a preset threshold, and determine that the clustering model needs to be updated when the value of the purity information is smaller than the preset threshold.

Illustratively, in an embodiment of the present application, the apparatus further includes: a sending module 904;

a sending module 904, configured to send model update information when the determining module 903 determines that the clustering model needs to be updated, where the model update information is used to indicate that the clustering model needs to be updated;

the processing module 902 is further configured to, when a model update instruction of a user is received, update the clustering model according to the clustering information of the clustering model and the attribute information of each object in the second object set.

The apparatus provided in the embodiment of the present application may be used to execute the method in the embodiments shown in fig. 3 to fig. 8, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a function of the processing module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call program code. As another example, these modules may be integrated together, implemented in the form of a system-on-a-chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Fig. 10 is a schematic structural diagram of an electronic device for performing a cluster analysis method according to an embodiment of the present application. As shown in fig. 10, the electronic device may include: the system comprises a processor 1001, a memory 1002, a communication interface 1003 and a system bus 1004, wherein the memory 1002 and the communication interface 1003 are connected with the processor 1001 through the system bus 1004 and are used for achieving mutual communication, the memory 1002 is used for storing computer execution instructions, the communication interface 1003 is used for communicating with other equipment, and the processor 1001 executes the computer execution instructions to achieve the scheme of the embodiment shown in the figures 3 to 8.

In fig. 10, the processor 1001 may be a general-purpose processor, including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The memory 1002 may include a Random Access Memory (RAM), a read-only memory (RAM), and a non-volatile memory (non-volatile memory), such as at least one disk memory.

The communication interface 1003 is used for communication between the database access device and other devices (e.g., client, read-write library, and read-only library).

The system bus 1004 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

Optionally, an embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored, and when the computer instructions are executed on a computer, the computer is caused to execute the method according to the embodiment shown in fig. 3 to 8.

Optionally, an embodiment of the present application further provides a chip for executing the instruction, where the chip is configured to execute the method in the embodiment shown in fig. 3 to 8.

Embodiments of the present application further provide a program product, where the program product includes a computer program, where the computer program is stored in a computer-readable storage medium, and the computer program can be read by at least one processor from the computer-readable storage medium, and the at least one processor can implement the method in the embodiments shown in fig. 3 to 8 when executing the computer program.

In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship; in the formula, the character "/" indicates that the preceding and following related objects are in a relationship of "division". "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application. In the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of cluster analysis, comprising:

2. The method of claim 1, wherein determining the purity information of the second clustered result relative to the first clustered result based on the first clustered result and the second clustered result comprises:

3. The method of claim 2, wherein the purity calculation is as follows:

4. The method according to any one of claims 1 to 3, wherein the obtaining a first clustering result of a first object set and a second clustering result of a second object set based on a pre-configured clustering model respectively comprises:

5. The method of claim 4, wherein for any one of the first set of objects and the second set of objects, cluster partitioning the objects in the set of objects based on the number of clusters and the class center of each cluster to obtain the target clustering result comprises:

6. The method according to any one of claims 1-3, wherein said determining whether the clustering model needs to be updated according to the purity information comprises:

7. The method of claim 6, wherein when the determination requires updating the clustering model, the method further comprises:

8. A cluster analysis apparatus, comprising: the device comprises an acquisition module, a processing module and a determination module;

9. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of the claims 1-7 when executing the program.

10. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, perform the method of any one of claims 1-7.