CN112541530A

CN112541530A - Data preprocessing method and device for clustering model

Info

Publication number: CN112541530A
Application number: CN202011409579.9A
Authority: CN
Inventors: 熊涛; 赵文龙; 吴若凡; 漆远
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-12-06
Filing date: 2020-12-06
Publication date: 2021-03-23
Anticipated expiration: 2040-12-06
Also published as: CN112541530B

Abstract

The embodiment of the specification provides a method for preprocessing data of a clustering model and clustering business entities by using an attribute graph, and provides a method for representing the attribute graph by using a representation vector and training the clustering model by using information loss transferred between the representation vector and a prototype vector of a clustering class based on an information theory. And, this loss of information is measured by the similarity between the token vector and the mapping vector determined based on the prototype vector. Further, using empirical probability distributions in place of the expectation of the overall distribution in determining mutual information provides a way in which mutual information can be approximated empirically. The method can effectively utilize the information theory, thereby providing a more effective business entity clustering method utilizing the attribute graph.

Description

Data preprocessing method and device for clustering model

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for preprocessing data of a clustering model and clustering business entities by using an attribute map.

Background

With the development of computer technology, the application of graph data is more and more extensive. Wherein, the graph data is a data form for describing the association relationship among various entities. The visual representation of the graph data is, for example, a relationship network, a knowledge graph, or the like. The graph data may typically comprise a plurality of nodes, each node corresponding to a respective business entity. In the case where the business entities have predefined associations, the respective nodes of the graph data may have respective associations therebetween. For example, in graph data represented by several triples, the triplet (a, r, b) represents that the node a and the node b have an association relationship r. In the visualized relational network, the node a and the node b can be represented by a connecting edge corresponding to the relational relation r.

The attribute graph is graph data in which each node is described by a plurality of attributes, and the attribute graph may have some nodes with more discrete attributes. As such, in the service processing based on the attribute map, the related service processing becomes complicated. Therefore, how to perform effective business processing on the attribute graph, especially on graph data containing discrete attribute nodes, is a problem worthy of study.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method and apparatus for data preprocessing for a clustering model and clustering business entities using an attribute map, so as to solve one or more of the problems mentioned in the background art.

According to a first aspect, a data preprocessing method for a clustering model is provided, the clustering model is used for clustering business entities by using an attribute graph, wherein the attribute graph comprises a plurality of nodes corresponding to a plurality of business entities one by one, each node has a feature vector determined based on attributes of the corresponding business entity, the clustering model comprises a coding module, a mapping module and a judging module, and the plurality of nodes comprise a first node; the method comprises the following steps: processing the attribute graph by using the coding module to obtain each characterization vector corresponding to each node, wherein the first node corresponds to the first characterization vector; determining, by the mapping module, a first mapping vector that maps the first node to a plurality of cluster categories by using the first characterization vector, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first characterization vector; based on the discrimination module, detecting the similarity degree of the first characterization vector and the first mapping vector so as to determine the clustering loss of the clustering model, wherein the similarity degree between the first characterization vector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function through the empirical distribution of the characterization vector and the mapping vector instead of the overall distribution, and the clustering loss is inversely related to the similarity degree between the first characterization vector and the first mapping vector; and aiming at minimizing the clustering loss, adjusting the model parameters of the coding module, each prototype vector and a middle vector in a discriminant function in the discriminant module, thereby training the clustering model.

According to one embodiment, the encoding module is a graph neural network, and the first characterization vector is determined based on a fusion result of the feature vector of the first node and feature vectors of neighboring nodes.

According to one embodiment, the first mapping vector is determined by: determining each importance coefficient corresponding to each prototype vector based on the first characterization vector and each prototype vector; and combining the prototype vectors in a weighted summation mode according to the combination parameters determined by the importance coefficients to obtain the first mapping vector.

According to one embodiment, each importance coefficient is determined based on an attention mechanism, each prototype vector comprises a first prototype vector, and the first importance coefficient corresponding to the first prototype vector is positively correlated with the similarity of the first prototype vector and the first token vector and negatively correlated with the sum of the similarities of each prototype vector and the first token vector.

According to one embodiment, said detecting, based on said discrimination module, a degree of similarity of said first characterization vector to said first mapping vector comprises: determining a similarity of the first token vector and the first mapping vector based on a product of the first token vector, an intermediate vector of the discriminant function, and the first mapping vector.

According to one embodiment, the cluster loss is also positively correlated with the degree of similarity between the first token vector and other mapping vectors corresponding to other nodes.

According to one embodiment, said detecting, based on said discrimination module, a degree of similarity of said first characterization vector to said first mapping vector comprises: updating the first token vector with a weighting vector of the first token vector and the first mapping vector; and detecting the similarity degree of the updated first characterization vector and the first mapping vector based on the judging module.

According to a second aspect, a data preprocessing method for a clustering model is provided, the clustering model is used for clustering business entities by using an attribute graph, wherein the attribute graph comprises a plurality of nodes in one-to-one correspondence with a plurality of business entities, each node has a feature vector determined based on attributes of the corresponding business entity, the clustering model comprises a coding module, a mapping module and a judging module, and the plurality of nodes comprise a first node; the method comprises the following steps: processing the attribute graph by using the coding module to obtain each characterization vector corresponding to each node, wherein the first node corresponds to the first characterization vector; determining the coding loss of the coding module based on the similarity degree between the first characterization vector and a first feature vector corresponding to the first node; adjusting model parameters of the coding module with the aim of minimizing the coding loss; processing the attribute graph by using a coding module with adjusted model parameters to obtain a third eigenvector corresponding to the first node; determining, by the mapping module, a first mapping vector that maps the first node to a plurality of cluster categories by using the third token vector, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first token vector; based on the discrimination module, detecting the similarity degree of the third eigenvector and the first mapping vector so as to determine the clustering loss of the clustering model, wherein the similarity degree between the third eigenvector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function through the empirical distribution of the eigenvector and the mapping vector instead of the overall distribution, and the clustering loss is inversely related to the similarity degree between the third eigenvector and the first mapping vector; and aiming at minimizing the clustering loss, adjusting each prototype vector and a middle vector in the discriminant function, thereby training the mapping module and the discriminant module.

According to one embodiment, the encoding module is a graph neural network, and the characterization vector of the first node is determined based on a fusion result of the feature vector of the first node and the feature vectors of the neighboring nodes.

According to one embodiment, the degree of similarity between the first token vector and the first feature vector corresponding to the first node is measured via a first discriminant function based on the similarity of the first token vector and the first mapping vector determined by the product of the first token vector, the intermediate vector of the first discriminant function, and the first mapping vector.

According to one embodiment, the attribute map corresponds to a variation map with randomly adjusted feature vectors, the variation map has a second node corresponding to the first node, and the second node corresponds to a second characterization vector obtained by processing the variation map through the encoding module; the coding loss further comprises a degree of similarity between the first feature vector and the second token vector.

According to a third aspect, a method for clustering business entities is provided, which is used for clustering business entities by using an attribute graph through a pre-trained clustering model, wherein the attribute graph comprises a plurality of nodes in one-to-one correspondence with a plurality of business entities, each node has a feature vector determined based on attributes of the corresponding business entities, and the clustering model comprises a coding module, a mapping module and a judging module; the method comprises the following steps: processing the attribute graph by using the coding module to obtain each characterization vector corresponding to each node; respectively determining mapping vectors obtained by mapping each node to a plurality of clustering categories by using each characterization vector through the mapping module, wherein a single mapping vector is formed by combining prototype vectors respectively corresponding to each clustering category, and combination parameters are determined based on corresponding characterization vectors; detecting the similarity degree of the characterization vector of the first node and the mapping vector of the second node based on the discrimination module; and under the condition that the similarity degree meets a preset condition, determining that the service entity corresponding to the first node and the service entity corresponding to the second node belong to the same clustering class.

According to a fourth aspect, a data preprocessing apparatus for a clustering model is provided, where the clustering model is configured to perform service entity clustering by using an attribute map, where the attribute map includes a plurality of nodes in one-to-one correspondence with a plurality of service entities, each node has a feature vector determined based on an attribute of a corresponding service entity, the clustering model includes a coding module, a mapping module, and a discrimination module, and the plurality of nodes includes a first node; the device comprises:

the coding unit is configured to process the attribute graph by using the coding module to obtain each characterization vector corresponding to each node, and the first node corresponds to the first characterization vector;

the mapping unit is configured to determine, by the mapping module, a first mapping vector for mapping the first node to a plurality of cluster categories by using the first characterization vector, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first characterization vector;

a discrimination unit configured to detect a similarity degree between the first token vector and the first mapping vector based on the discrimination module, so as to determine a clustering loss of the clustering model, wherein the similarity degree between the first token vector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function by replacing an overall distribution with an empirical distribution of token vectors and mapping vectors, and the clustering loss is inversely related to the similarity degree between the first token vector and the first mapping vector;

and the adjusting unit is configured to adjust the model parameters of the coding module, each prototype vector and the intermediate vector in the discriminant function in the discriminant module to minimize the clustering loss, so as to train the clustering model.

According to a fifth aspect, a data preprocessing apparatus for a clustering model is provided, where the clustering model is configured to perform business entity clustering by using an attribute graph, where the attribute graph includes a plurality of nodes in one-to-one correspondence with a plurality of business entities, each node has a feature vector determined based on an attribute of the corresponding business entity, the clustering model includes a coding module, a mapping module, and a discrimination module, and the plurality of nodes includes a first node; the device comprises:

a first judging unit configured to determine a coding loss of the coding module based on a similarity degree between the first characterization vector and a first feature vector corresponding to the first node;

a first adjusting unit configured to adjust a model parameter of the encoding module with a goal of minimizing the encoding loss;

the encoding unit is further configured to process the attribute map by using an encoding module with adjusted model parameters to obtain a third eigenvector corresponding to the first node;

the mapping unit is configured to determine, through the mapping module, a first mapping vector for mapping the first node to a plurality of cluster categories by using the third token vector, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first token vector;

a second judging unit, configured to detect a similarity degree between the third eigenvector and the first mapping vector based on the judging module, so as to determine a clustering loss of the clustering model, wherein the similarity degree between the third eigenvector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function by replacing an overall distribution with an empirical distribution of the eigenvector and the mapping vector, and the clustering loss is inversely related to the similarity degree between the third eigenvector and the first mapping vector;

and the second adjusting unit is configured to adjust each prototype vector and the intermediate vector in the discriminant function by taking the minimization of the clustering loss as a target, so as to train the mapping module and the discriminant module.

According to a sixth aspect, there is provided an apparatus for clustering service entities, configured to perform service entity clustering by using an attribute graph through a pre-trained clustering model, where the attribute graph includes a plurality of nodes in one-to-one correspondence with a plurality of service entities, each node has a feature vector determined based on an attribute of a corresponding service entity, and the clustering model includes a coding module, a mapping module, and a discrimination module; the device comprises:

the coding unit is configured to process the attribute graph by using the coding module to obtain each characterization vector corresponding to each node;

the mapping unit is configured to respectively determine mapping vectors obtained by mapping each node to a plurality of clustering categories by using each characterization vector through the mapping module, wherein a single mapping vector is formed by combining prototype vectors respectively corresponding to each clustering category, and combination parameters are determined based on corresponding characterization vectors;

the judging unit is configured to detect the similarity degree of the characterization vector of the first node and the mapping vector of the second node based on the judging module;

and the determining unit is configured to determine that the service entity corresponding to the first node and the service entity corresponding to the second node belong to the same cluster category when the similarity meets a predetermined condition.

According to a seventh aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first, second or third aspect.

According to an eighth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and the processor, when executing the executable code, implements the method of the first, second or third aspect.

The method and the device provided by the embodiment of the specification are based on the information theory, the attribute graph is represented by the representation vector, and the clustering model is trained by utilizing the information loss of the transition between the representation vector and the prototype vector of the clustering class. Further, using empirical probability distributions in place of the expectation of the overall distribution in determining mutual information provides a way in which mutual information can be approximated empirically. The method can effectively utilize the information theory and provide a more effective business entity clustering method utilizing the attribute graph.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a specific implementation architecture under the technical concept of the present specification;

FIG. 2 is a diagram illustrating a specific architecture of a clustering model under the technical concept of the present specification;

FIG. 3 illustrates a flow diagram of a method of data pre-processing for a clustering model, according to one embodiment;

FIG. 4 shows a flow diagram of a method of data pre-processing for a clustering model according to another embodiment;

FIG. 5 illustrates a flow diagram of a method of clustering for business entities, according to another embodiment;

FIG. 6 shows a schematic block diagram of an apparatus for data pre-processing of a clustering model according to one embodiment;

FIG. 7 shows a schematic block diagram of an apparatus for data pre-processing of a clustering model according to another embodiment;

fig. 8 shows a schematic block diagram of an apparatus for clustering business entities according to another embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

First, a description will be given with reference to an embodiment shown in fig. 1. As shown in fig. 1, it is a specific implementation scenario for node clustering based on graph data. In this implementation scenario, the computing platform may obtain pre-constructed graph data and then cluster nodes in the graph data through a clustering model for the graph data.

Each entity corresponding to each node of the graph data is associated with a specific service scenario. In the case that a specific service scenario is related to a user, such as community discovery, user grouping, etc., each service entity corresponding to each node in the graph data may be a user, for example. In a specific scenario of a paper classification, a social platform article classification, and the like, each business entity corresponding to each node in the graph data may be an article, for example. In other specific service scenarios, the service entity corresponding to the graph data may also be any other reasonable entity, which is not limited herein.

In graph data, entities corresponding to a single node may have various attributes on which a cluster depends, and thus, may also be referred to as an attribute graph. For example, attributes such as age, income, location of stay, and the like may correspond to the business entity of the user. For the business entities of the articles, the attributes of the key words, the belongings, the article sections and the like can be corresponded. In an optional embodiment, each two nodes having an association relationship may further have an association attribute, and the association attribute may also be used as an edge attribute of a corresponding connection edge. For example, users associated through social behavior may have social attributes (such as chat frequency, transfer behavior, red parcel behavior, etc.) therebetween, i.e., attributes of association between respective two nodes, which may be edge attributes of a connecting edge between respective two nodes.

The computing platform may process the attribute maps described above through a clustering model to classify the nodes into a plurality of categories. As shown in fig. 1, the service entities corresponding to the nodes X1, X3, X8, and X9 … … are in a cluster type, and the service entities corresponding to the nodes X2, X4, X6, and X7 … … are in a cluster type … …

Under the condition that discrete attributes exist in service entities corresponding to the nodes, clustering is directly performed through the node expression vectors determined by the attributes, and the accuracy of clustering results can be influenced. For this reason, under the implementation architecture of the present specification, the clustering model can be divided into two parts, the characterization of the business entity and the clustering performed based on the characterization. To this end, the present specification provides a clustering model of the encoding module, the mapping module, and the discrimination module as shown in fig. 1. Wherein the encoding module may be implemented as an encoder for determining respective characterization vectors for the respective nodes. The token vector may more fully and accurately describe the corresponding node. The token vector can be seen as a hidden representation that makes the nodes better clustered. The mapping module may perform mapping of the characterization vectors to the cluster categories and determine corresponding mapping vectors. The discrimination module is used for clustering each node based on discrimination of the similarity degree of the feature vector and the mapping vector.

For ease of understanding, the technical idea process of the present specification is described below in conjunction with the model architecture shown in fig. 2.

Assuming that a single node extracts features through attributes, the obtained feature vector is marked as X, and each dimension of the X passes through the X_iAnd (4) showing. The graph data at this time is "original graph", G (V, E, X) in the input graph data shown in fig. 2. Assuming each feature, the samples are biased towards a cluster class, a single feature x_iThe mapped cluster category is denoted as y_i. Assuming that there are K cluster classes, y_iValues are taken in K cluster categories. According to the information theory, mutual information of two random variables is KL information divergence of a product of joint distribution and marginal distribution of the two random variables. The larger the information divergence is, the larger the difference between the product of marginal distribution and the joint distribution is, and the high correlation between the two random variables is contained. For example, the feature vector X and the cluster Y of the sample can be used as two random variables, and the information loss from X to Y can be expressed as follows by taking KL divergence as an example:

those skilled in the art will readily appreciate that in the processing of data by a model, it may be viewed as the transfer of information between two variables, input and output. This information transfer may be described in terms of mutual information. When the mutual information is maximum, the information loss may be considered to be 0. Therefore, in the clustering process, it is desirable to maximize mutual information between the input and output. To obtain an empirical estimate, p (X) is approximated by the empirical distribution of X, and a clustering model p is proposed_θ(y | x). The mutual experience information thus generated

Can be as follows:

to better characterize the nodes, a characterization vector Z of the nodes is defined, which may be the result of the encoding of the respective feature. Suppose that a code pattern epsilon is used_θThen, there is a relationship Z ═ epsilon between the token vector Z and the feature vector X_θ(X). In this way, discrete features can be fused in an encoded manner. The weaves thereinThe coding method may be any suitable coding method, such as the GCN coding method shown in fig. 2, or the embedding method.

By taking a graph neural network as an example, the characteristic vectors of the node and the neighboring nodes can be fused in the encoding process, so that the encoding vector for more comprehensively expressing each node is obtained. The single-layer graph neural network can update the expression vectors of the current expression vectors of the single-layer graph neural network and the neighbor nodes according to the fusion vectors of the current expression vectors of the single-layer graph neural network and the neighbor nodes. The neighbor nodes of a single node may be nodes to which they are connected by connecting edges. The neighbor nodes used in the encoding process may be predefined, such as neighbor nodes within 3 th order, or a random predetermined number (e.g., 5) of nodes in the first order neighbor nodes, etc. The fusion of the current expression vectors (which may be initially the feature vectors) of the node itself and the neighboring nodes may be, for example: one of summing, averaging, weighted summing, taking a maximum, and the like. Taking weighted summation as an example, the weight may be determined in various ways, such as negatively correlated with the degree of the node (the number of first-order neighbor nodes), an importance value determined by an attention mechanism, and the like. In a specific example, for a node v, after fusing neighbor node current expression vectors, a single-layer graph neural network (layer l) may update its node expression vector as:

wherein W represents the model parameter of the l-th layer, d represents the degree of the node, and N_vAnd h represents a current node expression vector determined by the l-1 layer, and when l is equal to 1, the current node expression vector determined by the l-1 layer is a feature vector of the corresponding node. Through the processing of several layers of graph neural networks, corresponding characterization vectors, for example, denoted as epsilon in fig. 2, can be obtained for each node in the original graph_θ(X)。

According to the information theory principle, it is assumed that each cluster corresponds to a vector, called prototype vector, for example, denoted as μ, and the token vector Z is randomly quantized to each prototype vector, i.e. one of the classes must be selected whose mutual information should be maximized. And randomly quantizing Z to obtain a mapping vector of each prototype vector, and recording the mapping vector as Q (Z), so that the mutual information of the Z mapped to each cluster through the clustering model can be recorded as:

wherein the mapping vector Q (Z) can be associated with each prototype vector μ_jAnd (4) associating. Q x (Z) and each prototype vector μ_jThe association of (a) can be determined in various reasonable ways. As one specific example, the determination may be made using an attention mechanism. For example, let the attention value of each prototype vector be:

where Z is a specific example of a random variable Z, such as a token vector for a specific node. The attention value represents the importance of the corresponding cluster category relative to a single node, and may also be referred to as an importance value. Further, the prototype vectors may be fused according to the importance value to obtain a mapping vector. The fusion method is, for example, weighting by using the importance value as a weight, and includes:

it is to be noted that Q here^att(z, μ) is Q only^*In the event of a specific instance of (a), it may be defined in any reasonable manner and is not limited thereto. Z is a specific value of the random variable Z, for example, a characterization vector Z for the node a.

The prototype vector mu can be used as a model parameter, can be randomly defined initially, and can be continuously adjusted in the mutual information maximization process.

In the above formula, K is the number of cluster categories. In common clustering methods, such as K-means, the determination of the number of cluster categories is usually easy to determine. For information-based clustering methods, the complexity of cluster embedding is usually penalized following the Regularized Information Maximization (RIM) principle, with the goal of minimizing cluster losses. For example, a method for determining cluster loss associated with the number of cluster categories is:

wherein g is a non-negative regularization function, which can be set according to actual situations. Selecting L within a predetermined range_rimThe minimized parameter K, as a cluster category.

Those skilled in the art will appreciate that maximizing the accurate estimate of mutual information may learn entangled and useless representations, and therefore, the technical concepts of the present specification provide a way to approximate the determination of mutual information that may further generalize the mutual information principle to allow for more flexible information preservation measures. For example, since information retention is usually two distributed f-divergences, and mutual information is one of the selection ways, the mutual information can be generalized to the f-divergences and recorded as f-information. The f information generalizes the measure of the distribution difference from the KL divergence to the f divergence. The f-divergence may be selected to include the types of divergence commonly used, such as Kullback-Leiber divergence, chisquare divergence, Jensen-Shannon divergence, and the like. Thus, for two random variables X, Y, the f-information after the generalization can be, for example: i is_f(X，Y)＝D_f(P_XY||P_XP_Y). Thus, the mutual information of the token vector Z and the mapping vector Q (Z) is generalized as:

wherein sup represents the maximum, D_φAs a discriminant function, e.g. defined as_φ(h，h′)＝σ(h^TPhih ') for determining, via the intermediate vector phi, the degree of similarity of the distribution of the other two vectors (e.g., h and h'),f is a convex conjugate function of f.

Representing compliance distribution P with respect to Z_ZThe mathematical expectation of (a) is that,

representing obedience global distribution P with respect to Z_ZAnd Z' follows an overall distribution P_Q*(Z)The mathematical expectation of (2).

During the model training process, μ and φ may be model parameters to be adjusted. Since the mathematical expectation of the overall distribution is not available in the above equation, the mutual information is not easy to calculate directly. In order to be able to obtain a sufficiently approximate mutual information representation, the present specification proposes to use an empirical distribution instead of a global distribution (such as P)_Z、P_Q*(Z)) Thereby obtaining mutual information of the approximation representation.

Further, as the distribution between the characteristic vector of the node itself and the mapping vector of the node itself is to be consistent as much as possible, that is, the mutual information determined by using the discriminant function is as large as possible, whereas the distribution between the characteristic vector of the node itself and the mapping vectors of other nodes is to be inconsistent as much as possible, accordingly, the model loss can be determined by using the node construction sample. The model loss may be, for example, negatively correlated to mutual information of the same node and positively correlated to mutual information of different nodes. Therefore, in the process of reducing model loss, mutual information of the same node tends to be maximized as much as possible, and mutual information of different nodes tends to be minimized.

As a specific example, the model loss may be, for example:

wherein epsilon_θ(x) For one coding definition of Z,

e.g., Q (Z) above, is determined from the prototype vector μ. N being, for example, the number of positive samples, M being the number of negative samplesQuantity, C, represents a negative sample pair. In one embodiment, M is a positive integer (e.g., 20), which indicates that for a certain node, M nodes different from it are selected as negative samples. Epsilon_θ(x⁺) In the case of a token vector representing the current node,

a mapping vector of negative examples selected for the node may be represented.

The model loss may also be determined in other manners, which is not described herein again. In this way, the model parameters can be adjusted in the direction of reduced model loss, thereby training the clustering model. Taking the gradient descent method as an example, the model parameters can be adjusted by the gradient of the model loss with respect to each model parameter through the gradient descent method, the newton method, or the like.

Under the implementation framework of the present specification, the characterization part (e.g., corresponding to the encoding module) and the clustering part (e.g., corresponding to the mapping module and the discrimination module) may be integrally trained (corresponding to the model parameters are adjusted), or may be individually trained. In the training process of the clustering part, the information loss of mapping the characterization vectors obtained by the characterization part to the clustering classes can be minimized, or the mutual information can be maximized as a target. In case the parameters of the characterising part are substituted into the clustering module, the model parameters of the characterising part and the clustering part, e.g. epsilon, may be adjusted simultaneously_θAnd μ. In some possible designs, the characterization part may also measure model loss alone, so as to adjust corresponding model parameters, and substitute the adjusted model parameters into the clustering part, so that the clustering part adjusts corresponding parameters.

As an example, the encoding module of the characterization part may determine the characterization vector based on the depth map (Deep Graph Infomax, DGI). Depth maps are typically implemented by an information-based convolutional neural network, which can encode the node feature vectors into a hidden representation, i.e., a token vector, that is more suitable for clustering in an unsupervised manner. For depth maps, the coding loss can be directly considered, for example:

wherein R is a read operator operating on a subset of X and outputting a vector having a predetermined dimension, and is parameterized by a discriminant function D by ψ_ψAnd (6) carrying out constraint. The first term in the equation may correspond to the "+" term in fig. 2, i.e., the positive sample term, and the second term may correspond to the "-" term in fig. 2, i.e., the negative sample term. Wherein, the negative sample item in the second item can be constructed via a certain node and a node different from the certain node. For example, for node v, M paired nodes are randomly selected from nodes other than node v to construct a negative sample

It is understood that for node v, X may be a random variable X having a certain value, and X may be understood as having a specific value.

According to one embodiment, negative examples may be constructed using the input graph data. As shown in FIG. 2, the node characteristics can be randomly changed without disturbing the connection structure, assuming that the graph data of the change result is called a change graph, such as

Then by matching the data from the change map

The node set of (a) is sampled to construct negative samples. For example, for node v, from the change graph

Node construction negative sample for taking node paired with node v in node set

The number of negative sample samples is for example M.

In one embodiment, the coding loss L may be based on_dgiAdjusting the model parameters of the coding module so as to determine the structure epsilon of the coding module_θEach of (1) toAnd (4) parameters. For example, by L_dgiFor epsilon_θIs graded and applied to L according to methods such as gradient descent, Newton's method, etc_dgiThe decreasing direction adjusts the respective model parameters. In the code pattern epsilon_θWhen the parameters are well adjusted (e.g., tend to converge), the parameters can be substituted into the model training of the subsequent part. At this time, the model of the clustering section loses L_clusterMiddle, epsilon_θIs determined, the undetermined parameter is only mu, then L can be calculated_clusterAnd the reduced direction adjusts the model parameter mu to train a mapping module and a discrimination module of the clustering part.

In an alternative implementation, to improve the encoding module so that it better gets the token vectors that are favorable for clustering, feedback of the mapping vectors can be added in the encoding module. For example, the token vector of the encoding result is modified to:

wherein ε is a predetermined hyper parameter, typically having a value between 0 and 1, e.g. 0.1. The improved encoding module may balance between the token part and the cluster part, epsilon may be understood as a balance weight parameter. Applying the improved encoding result to the foregoing process, a depth map and information theory-based clustering model, such as called cluster-aware depth map infomax (CADGI), can be obtained.

Based on the above theory, the present specification provides a method for preprocessing data of a clustering model and clustering business entities by using an attribute map. The attribute map may include a plurality of nodes, and each node may correspond to a plurality of service entities one to one. Each node has a feature vector determined based on the attributes of the corresponding business entity. The attribute of the service entity is determined according to actual service requirements, for example, in the case that the service entity is a user, the corresponding attribute may include user gender, action trajectory, and the like, and in the case that the service entity is stable, the corresponding attribute may include a keyword, a domain to which the service entity belongs, and the like. The values in the dimensions of the feature vector may be quantized to describe the attributes. Based on the theory described above, the clustering model proposed in the present specification may include an encoding module, a mapping module, and a discriminating module.

FIG. 3 is a schematic flow diagram of data preprocessing for a clustering model according to one embodiment. The execution subject of the flow may be a device, apparatus, or server with certain computing power, such as the computing platform shown in fig. 1.

As shown in fig. 3, taking a first node of the plurality of nodes in the attribute map as an example, the process includes the following steps: step 301, processing the attribute map by using an encoding module to obtain each characterization vector corresponding to each node, wherein a first node corresponds to a first characterization vector; step 302, determining a first mapping vector for mapping a first node to a plurality of cluster categories by using the first characterization vector through a mapping module, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and combination parameters are determined based on the first characterization vector; step 303, based on the discrimination module, detecting a similarity degree between the first characterization vector and the first mapping vector, so as to determine a clustering loss of the clustering model, wherein the similarity degree between the first characterization vector and the first mapping vector is determined by substituting an empirical distribution of the characterization vector and the mapping vector for a total distribution, so that an empirical mutual information is constructed based on the discrimination function, and the clustering loss is negatively correlated with the similarity degree between the first characterization vector and the first mapping vector; and step 304, aiming at minimizing the clustering loss, adjusting the model parameters of the coding module, each prototype vector and the intermediate vector in the discrimination function in the discrimination module, thereby training the clustering model.

First, in step 301, an attribute map is processed by a coding module to obtain respective characterization vectors corresponding to respective nodes, where a first node corresponds to a first characterization vector. The encoding module here can be any reasonable model, such as a convolutional neural network, a graph neural network, and so on. In the case where the encoding module is a graph neural network, the characterization vector of each node may be based on the feature vector of the corresponding node and the features of its neighboring nodesAnd determining the fusion result of the vectors. For example, the first characterization vector is determined based on a fusion of the feature vector of the first node and the feature vectors of its neighboring nodes. The fusion mode of the feature vectors of the single node and the neighboring nodes thereof may be, for example, the mode provided by formula (3), and is not described herein again. The relationship between the token vector Z and the vector X may be noted as Z ═ epsilon, for example_θ(X)。

Next, a first mapping vector for mapping the first node to the plurality of cluster categories is determined by the mapping module using the first token vector, via step 302. It will be appreciated that, in accordance with the principles of information theory described hereinbefore, the effect of each feature (as used herein in the token vector) on a single cluster class is considered as a transfer of information for a single node. The total information transfer between the token vector of a node and each cluster category can be represented by the mutual information between the mapping vector obtained by mapping the token vector to each cluster category and the token vector. And the mapping vector is determined based on the token vector and may be associated with the token vector.

Assuming that each node is clustered into K cluster classes, each cluster class may correspond to a prototype vector, e.g., the prototype vector corresponding to the ith cluster class may be denoted as μ_i. Then for a single node, since certain information may be transferred to each cluster category, it is desirable that the information mapped by the token vector of the node to each node is consistent with the token vector. Under the implementation architecture of the present specification, the information of mapping the characterization vector of a single node to each node can be represented by a mapping vector. The mapping vector may be the result of a combination of K prototype vectors. The combination here may be a linear combination or a nonlinear combination, and is not limited here. Wherein the combination parameters, alternatively referred to as mapping parameters, required in the combination process may be associated with the corresponding characterization vectors.

Taking the first node as an example, the corresponding first mapping vectors are combined based on the prototype vectors respectively corresponding to the cluster categories, and the combination parameters may be determined based on the association relationship between the first characterization vector and the prototype vectors. This is because the association between the first token vector and the prototype vector represents the amount of information that the first token vector assigns to the corresponding prototype vector. The first mapping vector corresponding to the first node may be determined by the formulas (5), (6), for example.

The number K of cluster categories may be predetermined or determined according to a machine learning method. In an alternative embodiment, the number K of cluster categories may be determined by a Regularization Information Maximization (RIM) principle as described in formula (7), which is not described herein again.

Then, in step 303, based on the discrimination module, the similarity between the first characterization vector and the first mapping vector is detected, so as to determine the clustering loss of the clustering model. Wherein the degree of similarity between the first token vector and the first mapping vector may be described by mutual information. According to the foregoing principle, since the determination process of mutual information has great difficulty, the mutual information is generalized to f-information to obtain an expression of empirical mutual information constructed based on a discriminant function, as described in equation (8). After conversion, however, this expectation is difficult to determine due to the expectation related to the probability distribution of the vectors. For this purpose, it is considered to use empirical probability distributions instead of desired distributions, so that the discriminant functions approximate the corresponding mutual information. Wherein the discriminant function is, for example, D_φ(h，h′)＝σ(h^TPhi h ') to distinguish the distribution similarity between the two vectors h and h' through the intermediate vector.

It can be understood that in the clustering process, the better the clustering result is, the greater the information retention degree between the characterization vector of a single node and its own mapping vector is, that is, the greater the mutual information is, the smaller the loss is. Therefore, for the token vector and the mapping vector of a single node, the clustering loss is inversely related to the corresponding mutual information. For example, the cluster loss is inversely related to the degree of similarity between the first token vector and the first mapping vector. As shown in equation (9). At this time, the current node and itself may also be considered to constitute a positive sample.

On the other hand, for the current node, its token vector should theoretically have a different distribution than the mapping vectors of the other nodes. Therefore, in an alternative embodiment, the cluster loss may also be positively correlated with the similarity between the token vector of the current node and other mapping vectors corresponding to other nodes. At this time, the current node and the other nodes may also be considered to constitute negative examples.

According to a possible design, the cluster loss may be a sum of losses corresponding to a plurality of nodes (e.g., all nodes) in the attribute map, as shown in equation (9).

In an alternative embodiment, the attribute map may also be changed to perform negative sample sampling. Such as randomly perturbing the feature vectors of the various nodes, etc.

Next, in step 304, the model parameters of the coding module, the prototype vectors and the intermediate vector in the discriminant function in the discriminant module are adjusted to train the clustering model with the objective of minimizing the clustering loss. It can be understood that, in the above process, the model parameters of the coding module, the prototype vectors in the mapping module, and the intermediate vectors in the discriminant function in the discriminant module are all the undetermined model parameters in the clustering model. The model loss is actually the difference between the actual situation and the expected result, and in order to make the clustering model have a better clustering effect, each model parameter can be adjusted to the direction of reducing the model loss. For example, the gradient of model loss versus each model parameter may be determined and then the model parameters adjusted using a gradient descent method, a Newton method, or the like.

In a possible design, feedback information can be supplemented for the characterization vectors by using the mapping vectors, so that characterization information more suitable for clustering is obtained. For example, for a first node, a first token vector may be updated with a weighting vector of the first token vector and a first mapping vector; and detecting the similarity degree of the updated first characterization vector and the first mapping vector based on the discrimination module. The weighting weights may be artificially set hyper-parameters, such as parameters between 0-1. Refer to a specific example of formula (11).

The flow shown in fig. 3 is an embodiment of adjusting the model parameters in the encoding process and the subsequent clustering process together. FIG. 4 illustrates a data pre-processing flow for a clustering model according to another embodiment. In the flow shown in fig. 4, the model parameters in the encoding process and the subsequent clustering process may also be adjusted separately. Still taking the first node as an example, the flow shown in fig. 4 includes the following steps:

step 401, processing the attribute map by using a coding module to obtain each characterization vector corresponding to each node. And the first node corresponds to the first characterization vector. In the case that the encoding module is a graph neural network, the characterization vector of the first node may be determined based on a fusion result of the feature vector of the first node and feature vectors of its neighboring nodes.

Step 402, determining the coding loss of the coding module based on the similarity between the first characterization vector and the first feature vector corresponding to the first node. The degree of similarity between the first token vector and the first feature vector may be measured, for example, by the similarity of the first token vector and the first mapping vector determined by the first discriminant function based on the product of the first token vector, the intermediate vector of the first discriminant function, and the first mapping vector.

In an alternative embodiment, the comparison of a single node with itself may be regarded as a positive sample, and the comparison of a single node with other nodes may be regarded as a negative sample, and the coding loss may be negatively correlated with the degree of similarity of the positive sample and positively correlated with the degree of similarity of the negative sample. In practice, the coding penalty may comprise the sum of penalties determined by a plurality of nodes in the attribute map. As shown in equation (10).

Step 403, adjusting the model parameters of the coding module with the goal of minimizing coding loss. For example, the gradient of the coding loss to each model parameter may be determined, and then the model parameters may be adjusted using a gradient descent method, a newton method, or the like.

And step 404, processing the attribute map by using the coding module with the adjusted model parameters to obtain a third eigenvector corresponding to the first node. After the coding module is adjusted, the trained coding module can be used for processing the attribute graph, so that the characterization vectors which are more beneficial to clustering are obtained for subsequent processing. At this time, the characterization vectors of the nodes can be determined and used repeatedly in the process of training the subsequent modules, so that the single data calculation amount is reduced.

Step 405, determining, by the mapping module, a first mapping vector for mapping the first node to the plurality of cluster categories by using the third eigenvector. The first mapping vector is formed by combining prototype vectors corresponding to the clustering classes respectively, and the combination parameters are determined based on the first characterization vector.

And 406, detecting the similarity degree of the third feature vector and the first mapping vector based on the judging module, so as to determine the clustering loss of the clustering model. Wherein the similarity degree between the third feature vector and the first mapping vector is determined by constructing the empirical mutual information based on the discriminant function through the empirical distribution of the feature vector and the mapping vector instead of the overall distribution. The cluster loss is inversely related to the degree of similarity between the third eigenvector and the first mapping vector.

In an optional embodiment, the attribute graph corresponds to a variation graph in which the feature vectors are randomly adjusted, the variation graph has a second node corresponding to the first node, and the second node corresponds to a second characterization vector obtained by processing the variation graph by the encoding module; the coding loss further comprises a degree of similarity between the first feature vector and the second token vector.

In another optional embodiment, a third node different from the first node exists in the attribute graph, and the third node corresponds to a fourth characterization vector obtained by processing the change graph by the encoding module; the coding loss further comprises a degree of similarity between the first feature vector and the fourth feature vector.

Step 407, aiming at minimizing the clustering loss, adjusting each prototype vector and the intermediate vector in the discriminant function, thereby training the mapping module and the discriminant module.

It should be noted that, in the flow shown in fig. 4, the steps 401, 405-407 are respectively similar to the steps 301-304 in fig. 3. The difference is that in the process shown in fig. 4, the loss of the coding module is determined separately, and the model parameter of the coding module is adjusted, so that the trained coding module is used to process the attribute map, and the characterization vector of each node (corresponding to each service entity) is obtained. The characterization vector may be reused for model training in subsequent steps.

Fig. 5 illustrates a process of clustering business entities according to an embodiment of the present disclosure, which may use a trained clustering model to cluster business entities by using an attribute map. The clustering model may include an encoding module, a mapping module, and a discrimination module, and may be trained through a process such as fig. 3, 4. The attribute map may include a plurality of nodes in one-to-one correspondence with a plurality of business entities, each node having a feature vector determined based on attributes of the respective business entity.

As shown in fig. 5, the process of clustering business entities may include the following steps:

step 501, processing the attribute graph by using a coding module to obtain each characterization vector corresponding to each node.

Step 502, respectively determining, by a mapping module, a mapping vector obtained by mapping each node to a plurality of cluster categories by using each characterization vector. The single mapping vector is formed by combining prototype vectors respectively corresponding to the clustering classes, and the combination parameters are determined based on corresponding characterization vectors.

Step 503, based on the discrimination module, detecting the similarity between the characterization vector of the first node and the mapping vector of the second node. In this step, the similarity between the characterization vector and the mapping vector between every two nodes may be detected by the determination module, that is, for any two nodes, namely, the first node and the second node, the similarity between the characterization vector of the first node and the mapping vector of the second node, or the similarity between the characterization vector of the second node and the mapping vector of the first node is detected. This step can be performed for any pair of nodes.

Step 504, under the condition that the similarity degree meets the preset condition, determining that the service entity corresponding to the first node and the service entity corresponding to the second node belong to the same cluster category. In the case where the degree of similarity is characterized by a degree of similarity defined by a discriminant function, the predetermined condition herein may be, for example, that the obtained degree of similarity is greater than a predetermined threshold. Taking the first node and the second node as an example, the predetermined condition may be that either one of the similarity between the characterization vector of the first node and the mapping vector of the second node, the similarity between the characterization vector of the second node and the mapping vector of the first node is greater than a predetermined threshold, or both of them are greater than a predetermined threshold, or when one of them is greater than a first predetermined threshold (e.g. 0.8), the other is not less than a second predetermined threshold (e.g. 0.5).

It should be noted that the information theory principle described above based on fig. 1 and fig. 2 is a theoretical basis of the clustering model training process shown in fig. 3 and fig. 4 and the clustering process shown in fig. 5, and therefore, some of the alternatives involved in the information theory principle described above based on fig. 1 and fig. 2 may also be applied to the embodiments described in fig. 3, fig. 4 and fig. 5, and are not described again here.

Reviewing the above process, the method provided in the embodiments of the present specification, based on the information theory, provides a method for characterizing an attribute map by a characterization vector, and training a clustering model by using information loss of transitions between the characterization vector and prototype vectors of a clustering class. And, this loss of information is measured by the similarity between the token vector and the mapping vector determined based on the prototype vector. Further, using empirical probability distributions instead of population distributions in determining mutual information provides a way in which mutual information can be approximated empirically. The method can effectively utilize the information theory and provide a more effective business entity clustering method utilizing the attribute graph.

According to an embodiment of another aspect, a data preprocessing device for a clustering model is also provided. FIG. 6 illustrates a data pre-processing apparatus for a clustering model according to one embodiment of the present description. The apparatus 600 comprises:

the encoding unit 61 is configured to process the attribute map by using an encoding module to obtain each characterization vector corresponding to each node, where the first node corresponds to the first characterization vector;

the mapping unit 62 is configured to determine, by using a mapping module, a first mapping vector for mapping a first node to a plurality of cluster categories by using a first characterization vector, where the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and a combination parameter is determined based on the first characterization vector;

a discrimination unit 63 configured to detect a similarity degree between the first token vector and the first mapping vector based on the discrimination module, so as to determine a clustering loss of the clustering model, wherein the similarity degree between the first token vector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function by replacing an overall distribution with an empirical distribution of the token vector and the mapping vector, and the clustering loss is inversely related to the similarity degree between the first token vector and the first mapping vector;

the adjusting unit 64 is configured to adjust the model parameters of the coding module, the prototype vectors and the intermediate vectors in the discriminant function in the discriminant module, so as to train the clustering model.

According to an embodiment of another aspect, another data preprocessing device for a clustering model is also provided. FIG. 7 illustrates a data pre-processing apparatus for a clustering model according to one embodiment of the present description. The apparatus 700 comprises:

the encoding unit 71 is configured to process the attribute map by using an encoding module to obtain each characterization vector corresponding to each node, where the first node corresponds to the first characterization vector;

a first judging unit 72 configured to determine a coding loss of the coding module based on a similarity degree between the first characterization vector and the first feature vector corresponding to the first node;

a first adjusting unit 73 configured to adjust model parameters of the encoding module with a goal of minimizing encoding loss;

the encoding unit 71 is further configured to process the attribute map by using the encoding module with the adjusted model parameter to obtain a third eigenvector corresponding to the first node;

a mapping unit 74 configured to determine, by a mapping module, a first mapping vector for mapping the first node to a plurality of cluster categories by using the third token vector, where the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first token vector;

a second judging unit 75 configured to detect a similarity degree between the third eigenvector and the first mapping vector based on the judging module, so as to determine a clustering loss of the clustering model, wherein the similarity degree between the third eigenvector and the first mapping vector is determined by constructing empirical mutual information based on a judging function by replacing an overall distribution with an empirical distribution of the eigenvector and the mapping vector, and the clustering loss is negatively correlated with the similarity degree between the third eigenvector and the first mapping vector;

a second adjusting unit 76, configured to adjust the prototype vectors and the intermediate vectors in the discriminant function to train the mapping module and the discriminant module with the objective of minimizing the clustering loss.

Fig. 8 shows an apparatus for clustering business entities according to one embodiment. The apparatus 800 comprises:

the encoding unit 81 is configured to process the attribute map by using an encoding module to obtain each characterization vector corresponding to each node;

a mapping unit 82 configured to determine, through a mapping module, mapping vectors obtained by mapping each node to a plurality of cluster categories respectively by using each characterization vector, wherein a single mapping vector is formed by combining prototype vectors respectively corresponding to each cluster category, and the combination parameters are determined based on corresponding characterization vectors;

a discrimination unit 83 configured to detect a degree of similarity between the characterization vector of the first node and the mapping vector of the second node based on the discrimination module;

the determining unit 84 is configured to determine that the service entity corresponding to the first node and the service entity corresponding to the second node belong to the same cluster category when the similarity degree satisfies a predetermined condition.

It should be noted that the

apparatuses

600, 700, and 800 shown in fig. 6, 7, and 8 are apparatus embodiments corresponding to the method embodiments shown in fig. 3, 4, and 5, respectively, and the corresponding descriptions in the method embodiments shown in fig. 3, 4, and 5 are also applicable to the

apparatuses

600, 700, and 800, and are not repeated herein.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 3, 4 or 5.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in conjunction with fig. 3, fig. 4, or fig. 5.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of this specification may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments are intended to explain the technical idea, technical solutions and advantages of the present specification in further detail, and it should be understood that the above-mentioned embodiments are merely specific embodiments of the technical idea of the present specification, and are not intended to limit the scope of the technical idea of the present specification, and any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the present specification should be included in the scope of the technical idea of the present specification.

Claims

1. A data preprocessing method aiming at a clustering model is used for clustering service entities by utilizing an attribute graph, wherein the attribute graph comprises a plurality of nodes which are in one-to-one correspondence with a plurality of service entities, each node is provided with a characteristic vector determined based on the attribute of the corresponding service entity, the clustering model comprises a coding module, a mapping module and a judging module, and the plurality of nodes comprise a first node; the method comprises the following steps:

processing the attribute graph by using the coding module to obtain each characterization vector corresponding to each node, wherein the first node corresponds to the first characterization vector;

determining, by the mapping module, a first mapping vector that maps the first node to a plurality of cluster categories by using the first characterization vector, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first characterization vector;

based on the discrimination module, detecting the similarity degree of the first characterization vector and the first mapping vector so as to determine the clustering loss of the clustering model, wherein the similarity degree between the first characterization vector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function through the empirical distribution of the characterization vector and the mapping vector instead of the overall distribution, and the clustering loss is inversely related to the similarity degree between the first characterization vector and the first mapping vector;

and aiming at minimizing the clustering loss, adjusting the model parameters of the coding module, each prototype vector and a middle vector in a discriminant function in the discriminant module, thereby training the clustering model.

2. The method of claim 1, the encoding module being a graph neural network, the first characterization vector being determined based on a result of fusion of a feature vector of the first node with feature vectors of its neighboring nodes.

3. The method of claim 1, wherein the first mapping vector is determined by:

determining each importance coefficient corresponding to each prototype vector based on the first characterization vector and each prototype vector;

and combining the prototype vectors in a weighted summation mode according to the combination parameters determined by the importance coefficients to obtain the first mapping vector.

4. The method of claim 3, wherein each importance coefficient is determined based on an attention mechanism, each prototype vector comprising a first prototype vector corresponding to a first importance coefficient positively correlated to a similarity of the first prototype vector and the first token vector and negatively correlated to a sum of the similarities of each prototype vector and the first token vector.

5. The method of claim 1, wherein the detecting, based on the discrimination module, a degree of similarity of the first characterization vector to the first mapping vector comprises:

determining a similarity of the first token vector and the first mapping vector based on a product of the first token vector, an intermediate vector of the discriminant function, and the first mapping vector.

6. The method of claim 1, wherein the cluster loss is also positively correlated with the degree of similarity between the first token vector and other mapping vectors corresponding to other nodes.

7. The method of claim 1, wherein the detecting, based on the discrimination module, a degree of similarity of the first characterization vector to the first mapping vector comprises:

updating the first token vector with a weighting vector of the first token vector and the first mapping vector;

and detecting the similarity degree of the updated first characterization vector and the first mapping vector based on the judging module.

8. A data preprocessing method aiming at a clustering model is used for clustering service entities by utilizing an attribute graph, wherein the attribute graph comprises a plurality of nodes which are in one-to-one correspondence with a plurality of service entities, each node is provided with a characteristic vector determined based on the attribute of the corresponding service entity, the clustering model comprises a coding module, a mapping module and a judging module, and the plurality of nodes comprise a first node; the method comprises the following steps:

determining the coding loss of the coding module based on the similarity degree between the first characterization vector and a first feature vector corresponding to the first node;

adjusting model parameters of the coding module with the aim of minimizing the coding loss;

processing the attribute graph by using a coding module with adjusted model parameters to obtain a third eigenvector corresponding to the first node;

determining, by the mapping module, a first mapping vector that maps the first node to a plurality of cluster categories by using the third token vector, wherein the first mapping vector is formed by combining prototype vectors respectively corresponding to the cluster categories, and the combination parameter is determined based on the first token vector;

based on the discrimination module, detecting the similarity degree of the third eigenvector and the first mapping vector so as to determine the clustering loss of the clustering model, wherein the similarity degree between the third eigenvector and the first mapping vector is determined by constructing empirical mutual information based on a discriminant function through the empirical distribution of the eigenvector and the mapping vector instead of the overall distribution, and the clustering loss is inversely related to the similarity degree between the third eigenvector and the first mapping vector;

and aiming at minimizing the clustering loss, adjusting each prototype vector and a middle vector in the discriminant function, thereby training the mapping module and the discriminant module.

9. The method of claim 8, wherein the coding module is a graph neural network, and the characterization vector of the first node is determined based on a fusion result of the feature vector of the first node and feature vectors of neighboring nodes.

10. The method of claim 8, wherein a degree of similarity between the first token vector and a first feature vector corresponding to the first node is measured via a first discriminant function based on a product of the first token vector, an intermediate vector of the first discriminant function, and the first mapping vector, the determined degree of similarity of the first token vector and the first mapping vector.

11. The method of claim 8, wherein the cluster loss is also positively correlated with the degree of similarity between the first token vector and other mapping vectors corresponding to other nodes.

12. The method according to claim 11, wherein the attribute map corresponds to a variation map in which feature vectors are randomly adjusted, and the variation map has a second node corresponding to the first node, and the second node corresponds to a second characterization vector obtained by processing the variation map through the encoding module; the coding loss further comprises a degree of similarity between the first feature vector and the second token vector.

13. A method for clustering service entities is used for clustering the service entities by using an attribute graph through a pre-trained clustering model, wherein the attribute graph comprises a plurality of nodes which are in one-to-one correspondence with a plurality of service entities, each node is provided with a characteristic vector determined based on the attribute of the corresponding service entity, and the clustering model comprises a coding module, a mapping module and a judging module; the method comprises the following steps:

processing the attribute graph by using the coding module to obtain each characterization vector corresponding to each node;

respectively determining mapping vectors obtained by mapping each node to a plurality of clustering categories by using each characterization vector through the mapping module, wherein a single mapping vector is formed by combining prototype vectors respectively corresponding to each clustering category, and combination parameters are determined based on corresponding characterization vectors;

detecting the similarity degree of the characterization vector of the first node and the mapping vector of the second node based on the discrimination module;

and under the condition that the similarity degree meets a preset condition, determining that the service entity corresponding to the first node and the service entity corresponding to the second node belong to the same clustering class.

14. A data preprocessing device aiming at a clustering model, wherein the clustering model is used for clustering service entities by utilizing an attribute graph, the attribute graph comprises a plurality of nodes which are in one-to-one correspondence with a plurality of service entities, each node is provided with a characteristic vector determined based on the attribute of the corresponding service entity, the clustering model comprises a coding module, a mapping module and a judging module, and the plurality of nodes comprise a first node; the device comprises:

15. A data preprocessing device aiming at a clustering model, wherein the clustering model is used for clustering service entities by utilizing an attribute graph, the attribute graph comprises a plurality of nodes which are in one-to-one correspondence with a plurality of service entities, each node is provided with a characteristic vector determined based on the attribute of the corresponding service entity, the clustering model comprises a coding module, a mapping module and a judging module, and the plurality of nodes comprise a first node; the device comprises:

16. A device for clustering service entities is used for clustering the service entities by utilizing an attribute graph through a pre-trained clustering model, wherein the attribute graph comprises a plurality of nodes which are in one-to-one correspondence with a plurality of service entities, each node is provided with a characteristic vector determined based on the attribute of the corresponding service entity, and the clustering model comprises a coding module, a mapping module and a judging module; the device comprises:

17. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-13.

18. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-13.