CN115859110A

CN115859110A - Data processing method, device and equipment

Info

Publication number: CN115859110A
Application number: CN202211591784.0A
Authority: CN
Inventors: 刘杰; 孟昌华; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-03-28

Abstract

The embodiment of the specification discloses a data processing method, a device and equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining a target graph to be processed, inputting relation information of edges between different nodes in the target graph and attribute information of each node into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, wherein the encoder is a sub-graph sample obtained by removing one or more nodes in a historical graph sample, the sub-graph sample is used, joint training is carried out on the encoder and a decoder corresponding to the encoder based on determined local reconstruction loss information and/or global loss information, and finally, clustering processing is carried out on the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering types.

Description

Data processing method, device and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, and device.

Background

At present, with the continuous development of intelligent information service applications, graph gragh has been widely applied to the fields of intelligent search, intelligent question and answer, personalized recommendation, intelligence analysis, anti-fraud and the like, and in addition, information, data and link relations on the Web can be gathered into knowledge through the graph, so that information resources are easier to calculate, understand and evaluate, and a set of Web semantic knowledge base is formed. The graph can lay a solid foundation for knowledge interconnection on the network by using strong semantic processing capability and open interconnection capability. The community discovery is a way of distinguishing nodes in a graph, finding out nodes with similar attributes or similar structures in communities, and gathering the nodes together according to the similarity to obtain nodes in the same community which are relatively close and nodes in different communities which are relatively distant, and the community discovery is an important community division way in the graph, however, how to perform unsupervised community discovery, so that the problem that the risk possibly existing in the mining service is an important problem which needs to be solved currently, and therefore, a better unsupervised community discovery or risk discovery method in the service needs to be provided.

Disclosure of Invention

It is an object of embodiments of the present specification to provide a more optimal method for unsupervised community discovery or risk discovery in a business.

In order to implement the above technical solution, the embodiments of the present specification are implemented as follows:

the data processing method provided by the embodiment of the specification comprises the following steps: and acquiring a target graph to be processed, wherein the target graph comprises a plurality of different nodes and attribute information of each node. The method comprises the steps of inputting relationship information of edges between different nodes in a target graph and attribute information of each node into a pre-trained coder to obtain a hidden vector corresponding to each node in the target graph, wherein the coder is a coder obtained by removing one or more nodes in a historical graph sample and performing joint training on the coder and a decoder corresponding to the coder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample by using the sub-graph sample. And clustering the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories.

An embodiment of this specification provides a data processing apparatus, the apparatus includes: the map acquisition module acquires a target map to be processed, wherein the target map comprises a plurality of different nodes and attribute information of each node. The encoder is a sub-image sample obtained by removing one or more nodes in a historical image sample, and performs joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to the sub-image sample and/or global loss information corresponding to the sub-image sample by using the sub-image sample. And the clustering module is used for clustering the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes in different clustering categories.

An embodiment of the present specification provides a data processing apparatus, including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: and acquiring a target graph to be processed, wherein the target graph comprises a plurality of different nodes and attribute information of each node. The method comprises the steps of inputting relationship information of edges between different nodes in a target graph and attribute information of each node into a pre-trained coder to obtain a hidden vector corresponding to each node in the target graph, wherein the coder is a coder obtained by removing one or more nodes in a historical graph sample and performing joint training on the coder and a decoder corresponding to the coder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample by using the sub-graph sample. And clustering the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories.

The present specification also provides a storage medium for storing computer executable instructions, which when executed by a processor implement the following procedures: and acquiring a target graph to be processed, wherein the target graph comprises a plurality of different nodes and attribute information of each node. The method comprises the steps of inputting relationship information of edges between different nodes in a target graph and attribute information of each node into a pre-trained coder to obtain a hidden vector corresponding to each node in the target graph, wherein the coder is a coder obtained by removing one or more nodes in a historical graph sample and performing joint training on the coder and a decoder corresponding to the coder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample by using the sub-graph sample. And clustering the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 illustrates an embodiment of a data processing method of the present disclosure;

FIG. 2 is another embodiment of a data processing method described herein;

FIG. 3 is a schematic representation of a map according to the present description;

FIG. 4 is a diagram illustrating an embodiment of a data processing apparatus according to the present disclosure;

fig. 5 is a diagram illustrating an embodiment of a data processing apparatus according to the present disclosure.

Detailed Description

The embodiment of the specification provides a data processing method, a data processing device and data processing equipment.

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

Example one

As shown in fig. 1, an execution subject of the method may be a terminal device or a server, where the terminal device may be a certain terminal device such as a mobile phone and a tablet computer, may also be a computer device such as a notebook computer or a desktop computer, or may also be an IoT device (specifically, a smart watch, a vehicle-mounted device, etc.). The server may be an independent server, or a server cluster formed by a plurality of servers, and the server may be a background server of financial service or online shopping service, or a background server of an application program. In this embodiment, a server is taken as an example to describe in detail, and for the execution process of the terminal device, reference may be made to the following relevant contents, which are not described herein again. The method may specifically comprise the steps of:

in step S102, a target graph to be processed is acquired, where the target graph includes a plurality of different nodes and attribute information of each node.

In practical applications, the target graph may also be a knowledge graph or the like, the target graph may be a graph used for describing various entities and concepts existing in the real world and the association relationship between the entities and the concepts, and the graph may be a semantic network. The target graph may include a plurality of different nodes, and there may be an association relationship between different nodes, where the association relationship may be represented by an edge connected between nodes, and the edge may be a directional edge having a direction, or an undirected edge having no direction, and may be specifically set according to an actual situation. The target graph may include a graph of a directed multi-edge graph, or may be a mixed graph of a directed graph and an undirected graph, and specifically, if a bidirectional connected edge exists between two nodes in a certain graph, the part may be regarded as a local part of the graph as undirected, and there is a case where only a unidirectional connected edge exists between other nodes, and the like. The node may represent any entity, for example, a node may represent a user, or a node may represent an account, or a node may represent a book, or a node may represent a tv show or a movie, which may be set according to actual situations, and this is not limited in this embodiment of the specification. The attribute information of the node may include multiple types, for example, a name of the node, a category to which the node belongs, an attribute of a numerical class (specifically, a size, an occupied space size, a height, and the like), a location, and the like, which may be specifically set according to an actual situation, and this is not limited in this embodiment of the specification.

In implementation, with the continuous development of intelligent information service applications, graph gragh is widely used in the fields of intelligent search, intelligent question and answer, personalized recommendation, intelligence analysis, anti-fraud and the like, and in addition, information, data and link relations on the Web can be gathered into knowledge through the graph gragh, so that information resources are easier to calculate, understand and evaluate, and a set of Web semantic knowledge base is formed. The graph gragh can lay a solid foundation for knowledge interconnection on the network by virtue of strong semantic processing capability and open interconnection capability of the graph gragh. The community discovery is a way of distinguishing nodes in a graph gragh, finding out nodes with similar attributes or similar structures in the community, and gathering the nodes together according to the similarity to obtain a way that the nodes in the same community are relatively close and the nodes in different communities are relatively distant. The embodiment of the present specification provides an achievable processing method, which may specifically include the following:

when a certain map (i.e. a target map) needs to be processed, a service corresponding to a target knowledge map may be determined, then, corresponding service data may be obtained from the determined service, the service data may be analyzed to determine various entities (which may include nodes (specifically, such as accounts, user identifiers, and the like) and association relations between the various entities included therein, specifically, if a user a participates in a tv series K, the entities may include the user a and the tv series K, and further, an edge (which may be represented as "participating") between the user a and the tv series K may be included, a corresponding map or map may be constructed based on the determined entities and the edge, and finally, a target map may be obtained, and for example, a user 1 transfers 100 elements to a user 2, and then the entities may include the user 1 and the user 2, and further, an edge directed to the user 2 by the user 1 (which may be represented as "transfer 100 elements") may be constructed based on the determined entities and the edge, and finally, a target map corresponding to-be obtained, and a plurality of target maps may be stored, and a plurality of target maps may be obtained in a storage area may be specified by this embodiment.

In step S104, the relationship information of the edges between different nodes in the target graph and the attribute information of each node are input into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, where the encoder is an encoder obtained by removing one or more nodes in the history graph sample and performing joint training on the encoder and a decoder corresponding to the encoder based on the local reconstruction loss information corresponding to the sub-graph sample and/or the global loss information corresponding to the sub-graph sample by using the sub-graph sample.

The encoder may be constructed in a plurality of different manners, for example, the encoder may be constructed by a neural network model, the neural network model may include a plurality of types, such as a convolutional neural network model, a cyclic neural network model, and the like, and may be specifically set according to an actual situation, and the decoder may also be constructed in a plurality of different manners, for example, the decoder may be constructed by a neural network model, the neural network model may include a plurality of types, such as a convolutional neural network model, a cyclic neural network model, and the like, and may be specifically set according to an actual situation, in an actual application, the neural network model (specifically, the cyclic neural network model, and the like) may be split into two parts according to functions that are respectively required to be implemented by the encoder and the decoder, one part may be used as the encoder (for example, a network layer in the neural network model that performs dimensionality reduction may be used as the encoder), the other part may be used as the decoder (for example, a network layer in the neural network model that performs dimensionality reduction may be used as the encoder), or the decoder and may be separately arranged by another algorithm, for example, and the encoder and may be specifically set according to an actual situation, which is not limited by this embodiment of this specification. The hidden vector corresponding to the node may be a vector hidden in the original vector corresponding to the node.

In an implementation, a corresponding algorithm may be obtained, and the encoder and the corresponding decoder may be constructed based on the algorithm, for example, a corresponding recurrent neural network model may be constructed by a recurrent neural network algorithm, the recurrent neural network model may include the encoder and the corresponding decoder, input data of the encoder may be a graph, output data may be hidden vectors of each node in the graph, input data of the decoder may be hidden vectors of each node in the graph, and output data may be a characterization vector of each node in the graph. Then, training samples (i.e., history map samples) for training the encoder and the decoder may be obtained, one or more nodes in the history map samples may be removed to perform a masking operation on the one or more nodes in the history map samples to obtain masked sub-map samples, local reconstruction loss information corresponding to the sub-map samples from which the nodes are removed may be calculated using the sub-map samples, and/or global loss information corresponding to the sub-map samples may be calculated using the sub-map samples, any one of the loss information or the 2 pieces of loss information may be used as a training loss function, and the sub-map samples may be used to jointly train the encoder and the decoder until the encoder and the decoder converge, so as to finally obtain a trained encoder.

After the target graph is obtained in the above manner, the relationship information of the edges between different nodes in the target graph and the attribute information of each node may be input into the pre-trained encoder, and corresponding output data, that is, the hidden vector corresponding to each node in the target graph, may be obtained through processing by the encoder.

In step S106, based on the hidden vector corresponding to each node in the target graph, clustering is performed on the nodes included in the target graph to obtain one or more nodes of different clustering categories.

In implementation, after the hidden vector corresponding to each node in the target graph is obtained in the above manner, the distance between the hidden vectors corresponding to any two nodes in the target graph may be calculated, two nodes whose distances are smaller than a preset threshold may be clustered into the same cluster category, two nodes whose distances are greater than the preset threshold may be clustered into different cluster categories, and in the above manner, the nodes included in the target graph may be clustered into one or more different cluster categories, thereby obtaining the nodes included in each cluster category.

Based on the above clustering result, the method may be applied to corresponding services, for example, the same recommendation information may be pushed to nodes of the same clustering category, different recommendation information may be pushed to nodes of different clustering categories, or, if one node in a certain clustering category has a certain risk (e.g., a fraud risk, etc.), the node in the clustering category may be determined to have the risk, the node in the clustering category may be a node in a fraud group, etc., and may be specifically set according to actual service requirements, which is not limited in this description embodiment.

The embodiment of the specification provides a data processing method, a target graph comprising a plurality of different nodes and attribute information of each node to be processed is obtained, the relationship information of edges between the different nodes in the target graph and the attribute information of each node are input into a pre-trained encoder, a hidden vector corresponding to each node in the target graph is obtained, the encoder is an encoder obtained by removing one or more nodes in a historical graph sample, the encoder is obtained by performing joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to a sub-graph sample and/or global loss information corresponding to the sub-graph sample, finally, the nodes in the target graph are clustered based on the hidden vector corresponding to each node in the target graph, one or more different clustering class nodes are obtained, in this way, information of one node member in the encoder is removed, then information of the node is reconstructed by using the node and information of neighbor nodes thereof, the reconstruction loss information and/or the global loss information of the community are determined, and further, whether the nodes with different clustering classes are obtained by performing effective clustering classification according to the encoder, and the cluster classification of the encoder.

Example two

As shown in fig. 2, an execution subject of the method may be a terminal device or a server, where the terminal device may be a certain terminal device such as a mobile phone and a tablet computer, may also be a computer device such as a notebook computer or a desktop computer, or may also be an IoT device (specifically, a smart watch, a vehicle-mounted device, etc.). The server may be an independent server, or a server cluster formed by a plurality of servers, and the server may be a background server of financial service or online shopping service, or a background server of an application program. In this embodiment, a server is taken as an example to describe in detail, and for the execution process of the terminal device, reference may be made to the following relevant contents, which are not described herein again. The method may specifically comprise the steps of:

in step S202, a history map sample is acquired, the history map sample including a plurality of different nodes and attribute information of each node.

In implementation, a service corresponding to a history graph sample may be determined, then, corresponding history service data may be obtained from the determined service, the history service data may be analyzed, various different entities (which may include a node (specifically, an account, a user identifier, and the like) and association relations between different entities and the like included in the history service data may be determined, and a corresponding graph may be constructed based on the determined entities and the association relations, and finally, the history graph sample may be obtained, or the history graph sample has been constructed and is stored in a specified storage area, at this time, a storage area of the history graph sample may be obtained, and the history graph sample may be obtained from the storage area, and in addition, the history graph sample may also be obtained in a variety of different manners, which may be specifically set according to actual situations, and this is not limited in this specification embodiment.

In step S204, one or more different nodes in the history map sample are removed, and a first sub-map sample is obtained.

In implementation, one or more different nodes may be randomly selected from the history map sample, and the selected one or more different nodes may be removed from the history map sample to obtain the first sub-map sample.

In step S206, the relationship information of the edges between different nodes in the first sub-graph sample and the attribute information of each node are input into a pre-constructed encoder to obtain a first sample node hidden vector, where the first sample node hidden vector includes a hidden vector corresponding to a reconstructed node obtained after the removed node is subjected to node reconstruction processing.

The node reconstruction processing may be a processing mode of reconstructing the relevant information of the removed node through a certain algorithm or a reconstruction mode to construct a corresponding node.

In implementation, relationship information of edges between different nodes in a first sub-graph sample and attribute information of each node may be input into a pre-constructed encoder, the encoder reconstructs the removed nodes to obtain reconstructed nodes, a history graph sample may be restored by the reconstructed nodes and the nodes that are not removed, and then the encoder processes the restored history graph sample to obtain first sample node hidden vectors, where the first sample node hidden vectors include hidden vectors corresponding to the reconstructed nodes obtained by reconstructing the removed nodes through the nodes, hidden vectors corresponding to the nodes that are not removed, and the like.

In step S208, the first sample node hidden vector is input into a decoder corresponding to the encoder, and a representation vector of a node corresponding to the first sample node hidden vector is obtained.

The nodes corresponding to the hidden vectors of the first sample node may include nodes that are not removed and reconstructed nodes obtained by the nodes that are removed through node reconstruction processing, and correspondingly, the token vectors of the nodes corresponding to the hidden vectors of the first sample node may include token vectors of the nodes that are not removed and token vectors of the reconstructed nodes obtained by the nodes that are removed through node reconstruction processing. The encoder and the decoder may be constructed by a neural network model, wherein the neural network model may include one or more of a cyclic neural network model, a convolutional neural network model, and an attention-based neural network model, and in practical applications, the neural network model may include a plurality of different models in addition to the above models, and may be specifically set according to practical situations.

In step S210, based on the historical map sample, the characterization vector of the node corresponding to the first sample node hidden vector, and the first sample node hidden vector, the encoder and the decoder are jointly trained according to a gradient descent algorithm, so as to obtain a trained encoder.

In implementation, a generation rule of a loss function required for performing joint training on the encoder and the decoder may be set in advance, a corresponding loss function may be generated through a history pattern based on the generation rule, the loss function may be local reconstruction loss information, global loss information, or a loss function obtained by performing specified calculation based on the local reconstruction loss information and the global loss information, and then, the encoder and the decoder may be subjected to joint training according to a gradient descent algorithm by combining a token vector of a node corresponding to the first sample node hidden vector and the first sample node hidden vector based on any one of the local reconstruction loss information, the global loss information, and the loss function obtained by performing specified calculation based on the local reconstruction loss information and the global loss information to obtain a trained encoder.

The specific processing manner of step S210 may be various, and an alternative processing manner is provided below, and may specifically include the following processing of step A2 and step A4.

In step A2, local reconstruction loss information is determined based on the removed node token vector and a reconstructed node token vector obtained by the removed node through node reconstruction processing.

In implementation, the distance between the token vector of the removed node and the token vector of the reconstructed node obtained by the node reconstruction processing of the removed node may be calculated, and the calculated distance is used as the local reconstruction loss information. The distance between the two characterization vectors for the same node or its neighboring nodes should be as small as possible (i.e. the distance between the two characterization vectors for the same node or its neighboring nodes is smaller than the first preset distance threshold).

In step A4, based on the historical pattern book and the local reconstruction loss information, the encoder and the decoder are jointly trained according to the gradient descent algorithm, so as to obtain a trained encoder.

In implementation, the local reconstruction loss information may be used as loss information corresponding to a loss function required for performing joint training on the encoder and the decoder, and then, based on the local reconstruction loss information, the encoder and the decoder may be subjected to joint training according to a gradient descent algorithm by combining a representation vector of a node corresponding to the first sample node hidden vector and the first sample node hidden vector, so as to obtain a trained encoder.

The specific processing manner of step S210 may be various, and an alternative processing manner is provided below, and may specifically include the following processing from step B2 to step B6.

In step B2, local reconstruction loss information is determined based on the removed node token vector and the reconstructed node token vector obtained by the removed node through the node reconstruction processing.

In step B4, global loss information is determined based on the sample node hidden vector corresponding to each of the two different nodes, where the two different nodes are reconstructed nodes obtained by performing node reconstruction processing on the removed node.

In implementation, a distance between sample node hidden vectors corresponding to each of two different nodes (the two different nodes are reconstructed nodes obtained by performing node reconstruction processing on the removed nodes) may be calculated, and the calculated distance is used as global loss information. Wherein, the distance between the hidden vectors of the sample nodes corresponding to the similar nodes (e.g. two nodes having a common neighbor structure, such as the black node in fig. 3, or the node of the tilted coil in fig. 3, or the node of the grid ring in fig. 3, etc.) should be as small as possible (i.e. the distance between the hidden vectors of the sample nodes corresponding to the similar nodes is smaller than the second preset distance threshold).

In step B6, based on the historical map samples and the global loss information, the encoder and the decoder are jointly trained according to the gradient descent algorithm to obtain a trained encoder.

In implementation, the global loss information may be used as loss information corresponding to a loss function required for performing joint training on the encoder and the decoder, and then, based on the global loss information, the encoder and the decoder may be subjected to joint training according to a gradient descent algorithm by combining a token vector of a node corresponding to the first sample node hidden vector and the first sample node hidden vector, so as to obtain a trained encoder.

The specific processing manner of the step B6 may be various, and an alternative processing manner is provided below, and specifically, the following processing of the step B62 and the step B64 may be included.

In step B62, target loss information is determined based on the local reconstruction loss information and the global loss information.

In implementation, the local reconstruction loss information and the global loss information may be subjected to addition calculation, and the obtained addition result may be used as the target loss information.

In step B64, based on the historical image samples and the target loss information, the encoder and the decoder are jointly trained according to the gradient descent algorithm, so as to obtain a trained encoder.

In implementation, the target loss information may be used as loss information corresponding to a loss function required for performing joint training on the encoder and the decoder, and then, based on the target loss information, the encoder and the decoder may be subjected to joint training according to a gradient descent algorithm by combining a representation vector of a node corresponding to the first sample node hidden vector and the first sample node hidden vector, so as to obtain a trained encoder.

If the number of nodes removed from the history map sample may be one, the processing for determining the global loss information based on the sample node hidden vector corresponding to each node in the two different nodes may be implemented by the following processing in steps C2 to C8.

In step C2, another node of the nodes except the removed node in the history graph sample is removed to obtain a second sub-graph sample.

In practice, the previously removed nodes in the history map sample may be restored to obtain a complete history map sample, and then removed from another node in the history map sample except the removed node, at which point the other node in the complete history map sample is removed (at which point only one node in the complete history map sample is removed in total).

In step C4, the relationship information of the edges between different nodes in the second sub-graph sample and the attribute information of each node are input into a pre-constructed encoder to obtain a second sample node hidden vector, where the second sample node hidden vector includes a hidden vector corresponding to a reconstructed node obtained after the removed node is subjected to node reconstruction processing.

In step C6, the second sample node hidden vector is input into the decoder corresponding to the encoder, so as to obtain a characterization vector of a node corresponding to the second sample node hidden vector.

In step C8, global loss information is determined based on a first sample node hidden vector corresponding to one node removed from the history map sample and a second sample node hidden vector corresponding to another node removed from the history map sample.

In implementation, a distance between a first sample node hidden vector corresponding to one node removed from the history map sample and a second sample node hidden vector corresponding to another node removed from the history map sample may be calculated, and the calculated distance may be used as global loss information.

In step S212, a target graph to be processed is acquired, the target graph including a plurality of different nodes and attribute information of each node.

In step S214, the relationship information of the edges between different nodes in the target graph and the attribute information of each node are input into a pre-trained encoder, so as to obtain a hidden vector corresponding to each node in the target graph.

In step S216, based on the hidden vector corresponding to each node in the target graph, the nodes included in the target graph are clustered by using a k-means algorithm, so as to obtain one or more nodes of different clustering categories.

The specific processing of step S212 to step S216 may refer to the relevant contents in the above embodiments, and is not described herein again.

In step S218, based on the nodes of one or more different cluster categories, the nodes of the cluster category where the preset risk exists are determined.

The preset risk may include multiple types, such as a fraud risk or a business security risk, and may be specifically set according to an actual situation.

In implementation, the nodes in the target graph are divided into one or more different cluster categories in the above manner, and if one or more nodes in a certain cluster category have a preset risk, it may be determined that the node in the cluster category is a node of the cluster category having the preset risk, for example, account 1 (i.e., one node) in cluster category a has a fraud risk, and then it may be determined that the account in cluster category a is a group account having the fraud risk.

EXAMPLE III

Based on the same idea, the data processing method provided in the embodiment of the present specification further provides a data processing apparatus, as shown in fig. 4.

The data processing apparatus includes: an atlas obtaining module 401, a hidden vector determining module 402, and a clustering module 403, where:

the graph obtaining module 401 obtains a target graph to be processed, where the target graph includes a plurality of different nodes and attribute information of each node;

a hidden vector determining module 402, configured to input relationship information of edges between different nodes in the target graph and attribute information of each node into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, where the encoder is an encoder obtained by removing one or more nodes in a history graph sample and performing joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample by using the sub-graph sample;

the clustering module 403 performs clustering processing on the nodes included in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories.

In this embodiment, the clustering module 403 performs clustering processing on the nodes included in the target graph by using a k-means algorithm based on the hidden vector corresponding to each node in the target graph, so as to obtain one or more nodes of different clustering categories.

In an embodiment of this specification, the apparatus further includes:

and the risk determination module is used for determining the nodes of the cluster categories with preset risks based on the nodes of one or more different cluster categories.

In an embodiment of this specification, the apparatus further includes:

the sample acquisition module is used for acquiring the historical graph sample, wherein the historical graph sample comprises a plurality of different nodes and attribute information of each node;

the node removing module is used for removing one or more different nodes in the historical graph sample to obtain a first sub-graph sample;

the sample hidden vector determining module is used for inputting the relationship information of edges between different nodes in the first sub-graph sample and the attribute information of each node into a pre-constructed encoder to obtain a first sample node hidden vector, wherein the first sample node hidden vector comprises a hidden vector corresponding to a reconstruction node obtained after the removed node is subjected to node reconstruction processing;

the sample representation determining module is used for inputting the first sample node hidden vector into a decoder corresponding to the encoder to obtain a representation vector of a node corresponding to the first sample node hidden vector;

and the training module is used for carrying out joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical graph sample, the characterization vector of the node corresponding to the first sample node hidden vector and the first sample node hidden vector to obtain the trained encoder.

In an embodiment of this specification, the training module includes:

a first local loss determination unit configured to determine the local reconstruction loss information based on a token vector of a removed node and a token vector of a reconstructed node obtained by a node reconstruction process of the removed node;

and the first training unit is used for carrying out joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical image samples and the local reconstruction loss information to obtain the trained encoder.

In an embodiment of this specification, the training module includes:

a second local loss determination unit configured to determine the local reconstruction loss information based on the removed node token vector and a reconstructed node token vector obtained by performing node reconstruction processing on the removed node;

the global loss determining unit is used for determining the global loss information based on a sample node hidden vector corresponding to each node in two different nodes, wherein the two different nodes are reconstructed nodes obtained after the removed nodes are subjected to node reconstruction processing;

and the second training unit is used for carrying out joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical image samples and the global loss information to obtain the trained encoder.

In an embodiment of this specification, the second training unit determines target loss information based on the local reconstruction loss information and the global loss information; and performing joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical image samples and the target loss information to obtain a trained encoder.

In this embodiment of the specification, the number of nodes removed from the history graph sample is one, and the global loss determining unit removes another node of the nodes except the removed node in the history graph sample to obtain a second sub-graph sample; inputting the relationship information of edges between different nodes in the second sub-graph sample and the attribute information of each node into a pre-constructed encoder to obtain a second sample node hidden vector, wherein the second sample node hidden vector comprises a hidden vector corresponding to a reconstruction node obtained after the removed node is subjected to node reconstruction processing; inputting the second sample node hidden vector into a decoder corresponding to the encoder to obtain a characterization vector of a node corresponding to the second sample node hidden vector; determining the global loss information based on a first sample node hidden vector corresponding to one node removed from the historical map sample and a second sample node hidden vector corresponding to another node removed from the historical map sample.

The embodiment of the present specification provides a data processing apparatus, which obtains a target graph to be processed, which includes a plurality of different nodes and attribute information of each node, and inputs relationship information of edges between the different nodes in the target graph and the attribute information of each node into a pre-trained encoder, to obtain a hidden vector corresponding to each node in the target graph, where the encoder is an encoder obtained by removing one or more nodes in a history graph sample, and using a sub-graph sample, and performing joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample, and finally, based on the hidden vector corresponding to each node in the target graph, performing clustering processing on the nodes included in the target graph, to obtain one or more nodes of different clustering classes, thus removing information of one node member in the encoder, then, reconstructing information of the node by using information of the node and its neighboring nodes, and determining whether the reconstructed information and/or global loss information of the community obtains nodes of the different clustering classes, and thus, and whether the encoder has different clustering efficiency is determined by a clustering algorithm, and whether the encoder is a node having a similar clustering algorithm, thereby providing a cluster algorithm.

Example four

Based on the same idea, the data processing apparatus provided in the embodiment of the present specification further provides a data processing device, as shown in fig. 5.

The data processing device may provide a terminal device or a server, etc. for the above embodiments.

The data processing apparatus may have a large difference due to different configurations or performances, and may include one or more processors 501 and a memory 502, and the memory 502 may store one or more stored applications or data. Memory 502 may be, among other things, transient or persistent storage. The application programs stored in memory 502 may include one or more modules (not shown), each of which may include a series of computer-executable instructions for a data processing device. Still further, the processor 501 may be arranged in communication with the memory 502 to execute a series of computer executable instructions in the memory 502 on the data processing device. The data processing apparatus may also include one or more power supplies 503, one or more wired or wireless network interfaces 504, one or more input-output interfaces 505, one or more keyboards 506.

In particular, in this embodiment, the data processing apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data processing apparatus, and the one or more programs configured to be executed by the one or more processors include computer-executable instructions for:

acquiring a target graph to be processed, wherein the target graph comprises a plurality of different nodes and attribute information of each node;

inputting relationship information of edges between different nodes in the target graph and attribute information of each node into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, wherein the encoder is an encoder obtained by removing one or more nodes in a historical graph sample and performing joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample by using the sub-graph sample;

and clustering the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories.

In this embodiment of the present specification, the clustering, based on the hidden vector corresponding to each node in the target graph, the nodes included in the target graph to obtain one or more nodes of different clustering categories includes:

and based on the hidden vector corresponding to each node in the target graph, clustering the nodes contained in the target graph by using a k-means algorithm to obtain one or more nodes of different clustering categories.

In the embodiment of this specification, the method further includes:

and determining the nodes of the cluster categories with preset risks based on the nodes of one or more different cluster categories.

In the embodiment of this specification, the method further includes:

acquiring a historical graph sample, wherein the historical graph sample comprises a plurality of different nodes and attribute information of each node;

removing one or more different nodes in the historical graph sample to obtain a first sub-graph sample;

inputting the relationship information of edges between different nodes in the first sub-graph sample and the attribute information of each node into a pre-constructed encoder to obtain a first sample node hidden vector, wherein the first sample node hidden vector comprises a hidden vector corresponding to a reconstruction node obtained after the removed node is subjected to node reconstruction processing;

inputting the first sample node hidden vector into a decoder corresponding to the encoder to obtain a representation vector of a node corresponding to the first sample node hidden vector;

and performing joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical graph sample, the characterization vector of the node corresponding to the first sample node hidden vector and the first sample node hidden vector to obtain a trained encoder.

In an embodiment of this specification, the jointly training the encoder and the decoder according to a gradient descent algorithm based on the historical map sample, the characterization vector of the node corresponding to the first sample node hidden vector, and the first sample node hidden vector to obtain a trained encoder includes:

determining the local reconstruction loss information based on the characterization vector of the removed node and the characterization vector of the reconstruction node obtained by the node reconstruction processing of the removed node;

and performing joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical image samples and the local reconstruction loss information to obtain a trained encoder.

determining the global loss information based on a sample node hidden vector corresponding to each of two different nodes, wherein the two different nodes are reconstructed nodes obtained after node reconstruction processing is carried out on the removed nodes;

and performing joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical image samples and the global loss information to obtain a trained encoder.

In this embodiment of the present specification, the jointly training the encoder and the decoder according to a gradient descent algorithm based on the historical map samples and the global loss information to obtain a trained encoder includes:

determining target loss information based on the local reconstruction loss information and the global loss information;

and performing joint training on the encoder and the decoder according to a gradient descent algorithm based on the historical image samples and the target loss information to obtain a trained encoder.

In this embodiment of the present specification, the number of nodes removed from the history map sample is one, and the determining the global loss information based on the sample node hidden vector corresponding to each node in two different nodes includes:

removing another node in the nodes except the removed node in the historical graph sample to obtain a second sub-graph sample;

inputting the relationship information of edges between different nodes in the second sub-graph sample and the attribute information of each node into a pre-constructed encoder to obtain a second sample node hidden vector, wherein the second sample node hidden vector comprises a hidden vector corresponding to a reconstruction node obtained after the removed node is subjected to node reconstruction processing;

inputting the second sample node hidden vector into a decoder corresponding to the encoder to obtain a characterization vector of a node corresponding to the second sample node hidden vector;

determining the global loss information based on a first sample node hidden vector corresponding to one node removed from the historical map sample and a second sample node hidden vector corresponding to another node removed from the historical map sample.

The embodiment of the specification provides a data processing device, which obtains a target graph to be processed and including a plurality of different nodes and attribute information of each node, inputs edge relation information between different nodes in the target graph and the attribute information of each node into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, the encoder is an encoder obtained by performing joint training on the encoder and a decoder corresponding to the encoder by removing one or more nodes in a historical graph sample and using a sub-graph sample based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample, and finally, the encoder can perform clustering processing on nodes included in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories, so that information of one node member in the community is removed, then information of the node is reconstructed by using the node and information of neighboring nodes thereof to determine local reconstruction loss information and/or global loss information, further, the encoder obtains nodes of different clustering categories after training, and particularly determines whether the nodes have different clustering efficiency and the cluster classification according to which the cluster encoder can be quickly determine that the cluster nodes have different clustering efficiency and the cluster classification, thereby effectively providing a cluster for different service nodes.

EXAMPLE five

Further, based on the methods shown in fig. 1 to fig. 3, one or more embodiments of the present specification further provide a storage medium for storing computer-executable instruction information, in a specific embodiment, the storage medium may be a usb disk, an optical disk, a hard disk, or the like, and when the storage medium stores the computer-executable instruction information, the storage medium implements the following processes:

In the embodiment of this specification, the method further includes:

determining the global loss information based on a sample node hidden vector corresponding to each node in two different nodes, wherein the two different nodes are reconstructed nodes obtained after the removed nodes are subjected to node reconstruction processing;

The embodiment of the present specification provides a storage medium, a target graph including a plurality of different nodes and attribute information of each node to be processed is obtained, and edge relationship information between the different nodes in the target graph and the attribute information of each node are input into a pre-trained encoder, so as to obtain a hidden vector corresponding to each node in the target graph, the encoder is an encoder obtained by removing one or more nodes in a history graph sample, and using a sub-graph sample, performing joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample, and finally, performing clustering processing on the nodes included in the target graph based on the hidden vector corresponding to each node in the target graph, so as to obtain one or more nodes of different clustering classes, thus removing information of one node member in the encoder, then, reconstructing information of the node by using information of the node and its neighboring nodes, and determining local reconstruction loss information and/or global loss information of the nodes after training, so as to obtain nodes of different clustering classes, and thus, determining whether there is a node having a similar clustering efficiency, and providing a node for efficient clustering, and providing a cluster classification for the encoder.

The foregoing description of specific embodiments has been presented for purposes of illustration and description. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain a corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is an integrated circuit whose Logic functions are determined by a user programming the Device. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as ABEL (Advanced Boolean Expression Language), AHDL (alternate Hardware Description Language), traffic, CUPL (core universal Programming Language), HDCal, jhddl (Java Hardware Description Language), lava, lola, HDL, PALASM, rhyd (Hardware Description Language), and vhigh-Language (Hardware Description Language), which is currently used in most popular applications. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable fraud case serial-parallel apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable fraud case serial-parallel apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable fraud case to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable fraud case serial-parallel apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises that element.

One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present application. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of data processing, the method comprising:

2. The method according to claim 1, wherein the clustering the nodes included in the target graph based on the hidden vector corresponding to each node in the target graph to obtain nodes of one or more different clustering categories includes:

3. The method of claim 1, further comprising:

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein jointly training the encoder and the decoder according to a gradient descent algorithm based on the historical map samples, the characterization vectors of the nodes corresponding to the first sample node hidden vectors, and the first sample node hidden vectors to obtain a trained encoder, comprises:

6. The method of claim 4, wherein jointly training the encoder and the decoder according to a gradient descent algorithm based on the historical map samples, the characterization vectors of the nodes corresponding to the first sample node hidden vectors, and the first sample node hidden vectors to obtain a trained encoder comprises:

7. The method of claim 6, wherein jointly training the encoder and the decoder according to a gradient descent algorithm based on the historical map samples and the global loss information to obtain a trained encoder comprises:

8. The method of claim 6 or 7, wherein the number of nodes removed from the historical map sample is one, and the determining the global loss information based on the sample node hidden vector corresponding to each of the two different nodes comprises:

9. A data processing apparatus, the apparatus comprising:

the map acquisition module is used for acquiring a target map to be processed, wherein the target map comprises a plurality of different nodes and attribute information of each node;

a hidden vector determining module, configured to input relationship information of edges between different nodes in the target graph and attribute information of each node into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, where the encoder is an encoder obtained by removing one or more nodes in a history graph sample and performing joint training on the encoder and a decoder corresponding to the encoder based on local reconstruction loss information corresponding to the sub-graph sample and/or global loss information corresponding to the sub-graph sample by using the sub-graph sample;

and the clustering module is used for clustering the nodes contained in the target graph based on the hidden vector corresponding to each node in the target graph to obtain one or more nodes of different clustering categories.

10. A data processing apparatus, the data processing apparatus comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

11. A storage medium for storing computer-executable instructions, which when executed by a processor implement the following:

inputting relationship information of edges between different nodes in the target graph and attribute information of each node into a pre-trained encoder to obtain a hidden vector corresponding to each node in the target graph, wherein the encoder is an encoder obtained by removing one or more nodes in a historical graph sample and performing joint training on the encoder and a decoder corresponding to the encoder based on the determined local reconstruction loss information corresponding to the sub-graph sample after the nodes are removed and/or the global loss information corresponding to the sub-graph sample by using the sub-graph sample;