WO2023051080A1 - 图数据处理方法、装置、计算机设备、存储介质及计算机程序产品 - Google Patents

图数据处理方法、装置、计算机设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023051080A1
WO2023051080A1 PCT/CN2022/113580 CN2022113580W WO2023051080A1 WO 2023051080 A1 WO2023051080 A1 WO 2023051080A1 CN 2022113580 W CN2022113580 W CN 2022113580W WO 2023051080 A1 WO2023051080 A1 WO 2023051080A1
Authority
WO
WIPO (PCT)
Prior art keywords
nodes
features
node
embedding
neighborhood
Prior art date
Application number
PCT/CN2022/113580
Other languages
English (en)
French (fr)
Inventor
王志涛
洪立涛
黄薇屹
苏函晶
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22874475.1A priority Critical patent/EP4318375A1/en
Publication of WO2023051080A1 publication Critical patent/WO2023051080A1/zh
Priority to US18/333,530 priority patent/US20230334096A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, and relates to but not limited to a graph data processing method, device, computer equipment, storage medium and computer program product.
  • Graph embedding technology has become an effective technical means for large-scale graph data mining.
  • the main idea of graph embedding is to map the high-dimensional sparse original graph data into a low-dimensional dense vector, that is, the embedding vector of each node in the graph, so as to preserve the topology information of the graph as much as possible in the embedding space. How to efficiently perform graph embedding on graph data is a research direction.
  • graph embedding technologies include DeepWalk (deep walk), Node2Vec (node to vector), LINE (Large-scale Information Network Embedding, large-scale information network embedding), etc., based on the random walk algorithm to determine the graph data. Nodes and business relationships between nodes determine the encoding vector of each node.
  • a pair of nodes in the directed graph represent the initiator and receiver of the business relationship, and the directed edge between the pair of nodes is from the initiator to the receiver, but the above three algorithms ultimately What is obtained is an encoding vector corresponding to each node in the graph, and it is difficult to fully express the complete data relationship of the nodes in the graph, so there is a problem of information loss in the process of processing graph data.
  • the embodiment of the present application provides a graph data processing method, device, computer equipment, storage medium, and computer program product, which can enable nodes to fully express information in graph data and ensure information integrity during the conversion process from graph data to target graphs .
  • An embodiment of the present application provides a graph data processing method, which is applied to a computer device, and the method includes:
  • the target graph is obtained, the target graph includes M nodes representing M objects, M is a positive integer greater than 1, and the directed edges in the target graph are used to represent the corresponding There is a business relationship between objects, and the pointed node of the directed edge is the initiator of the business relationship, and the pointed node of the directed edge is the receiver of the business relationship;
  • the initiating embedding features and receiving embedding features corresponding to the M nodes are used to represent the corresponding nodes
  • the receiving embedded feature is used to represent the corresponding node as the feature of the receiver
  • the steps of the z-th iteration include:
  • N being a positive integer less than M, and z being a positive integer;
  • the directed edges associated with the N nodes Based on the directed edges associated with the N nodes, determine a plurality of first neighborhood nodes associated with the N nodes and a plurality of second neighborhood nodes associated with the N nodes, the first neighborhood nodes The indicated directed edges point to the N nodes, and the directed edges indicated by the N nodes point to the plurality of second neighbor nodes;
  • the first embedding feature of the N nodes Four embedding features, the first embedding features of the N nodes, the second embedding features of the plurality of first neighborhood nodes and the third embedding features of the plurality of second neighborhood nodes are to perform the (z- 1) Obtained by rounds of iterations, the fourth embedding features of the N nodes represent the initiating embedding features and receiving embedding features of the N nodes updated after the z-th round of iterations.
  • An embodiment of the present application provides a graph data processing device, and the device includes:
  • a graph acquisition module configured to obtain a target graph based on object data, the target graph includes M nodes representing objects, M is a positive integer, and the directed edges in the target graph are used to represent the objects corresponding to the connected nodes There is a business relationship between them, and the pointed node of the directed edge is the initiator of the business relationship, and the pointed node of the directed edge is the receiver of the business relationship;
  • the iterative module is configured to perform multiple rounds of iterations based on the M nodes and the directed edges between the M nodes, and obtain the initiation embedding features and receiving embedding features corresponding to the M nodes, and the initiation embedding features are used In order to represent the corresponding node as the feature of the initiator, the receiving embedded feature is used to represent the corresponding node as the feature of the receiver;
  • the iteration module is also configured as:
  • N is a positive integer smaller than M
  • the directed edges associated with the N nodes Based on the directed edges associated with the N nodes, determine a plurality of first neighborhood nodes associated with the N nodes and a plurality of second neighborhood nodes associated with the N nodes, the first neighborhood nodes The indicated directed edges point to the N nodes, and the directed edges indicated by the N nodes point to the plurality of second neighbor nodes;
  • the first embedding feature of the N nodes Four embedding features, the first embedding features of the N nodes, the second embedding features of the plurality of first neighborhood nodes and the third embedding features of the plurality of second neighborhood nodes are to perform the (z- 1) Obtained by rounds of iterations, the fourth embedding features of the N nodes represent the initiating embedding features and receiving embedding features of the N nodes updated after the z-th round of iterations.
  • An embodiment of the present application provides a computer device, the computer device includes a processor and a memory, and the memory is used to store at least one piece of computer program, and the at least one piece of computer program is loaded and executed by the processor to implement the implementation of the present application The operations performed by the graph data processing method in the example.
  • An embodiment of the present application provides a computer-readable storage medium, at least one section of computer program is stored in the computer-readable storage medium, and the at least one section of computer program is loaded and executed by a processor to realize the data as shown in the embodiment of the application. The action performed by the processing method.
  • An embodiment of the present application provides a computer program product, the computer program product includes computer program code, the computer program code is stored in a computer-readable storage medium, and a processor of a computer device reads the computer program from the computer-readable storage medium code, the processor executes the computer program code, so that the computer device executes the graph data processing method provided in the various implementation manners of the foregoing aspects.
  • N nodes are obtained from the target graph in each round of iteration, wherein, N nodes are some nodes in the target graph, and then based on the first N nodes of the N nodes An embedding feature, the second embedding feature of multiple first neighborhood nodes associated with the N nodes and the third embedding feature of multiple second neighborhood nodes to determine the fourth embedding feature of the N nodes, thereby completing One round of iterative process to update the embedded features of N nodes. Since not all M nodes in the target graph are processed, the amount of calculation can be greatly reduced.
  • the M nodes are used as the initiator's initiating embedding feature and the receiver's receiving embedding feature, so that the embedding feature of each node can maximize the preservation of the information of the first neighborhood and the second neighborhood, and enable the node to fully express the graph data. information to ensure the integrity of information during the conversion process from graph data to target graph.
  • FIG. 1 is a schematic diagram of an implementation environment of a graph data processing method provided according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for processing graph data provided according to an embodiment of the present application
  • Fig. 3 is a flow chart of another graph data processing method provided according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a directed graph provided according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a directed graph neighborhood definition provided according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a distributed computing architecture provided according to an embodiment of the present application.
  • FIG. 7 is a block diagram of a graph data processing device provided according to an embodiment of the present application.
  • Fig. 8 is a block diagram of another graph data processing device provided according to an embodiment of the present application.
  • FIG. 9 is a structural block diagram of a terminal provided according to an embodiment of the present application.
  • Fig. 10 is a schematic structural diagram of a server provided according to an embodiment of the present application.
  • first and second in the embodiments of the present application are used to distinguish the same or similar items with basically the same functions and functions. It should be understood that “first”, “second” and “nth” There is no logical or timing dependency between them, and the quantity and execution order are not limited.
  • the term "at least one” means one or more, and the meaning of “multiple” means two or more.
  • Graph in mathematics, a graph is a structure described in a set of objects, some of which are "related" in a certain sense. These objects correspond to mathematical abstractions called nodes (also called vertices or points), and each related pair of nodes is connected by an edge (also called a link or line).
  • nodes also called vertices or points
  • edge also called a link or line
  • the objects are user accounts, items, virtual resources, positions, and virtual live broadcast rooms, etc., which are not limited in the embodiments of the present application. If the object is a user account, different nodes in the graph represent different user accounts, and edges in the graph represent business relationships between user accounts, such as transferring money and applying for friends.
  • Oriented Graph is a graph composed of all oriented edges.
  • a node in a directed graph has two different identities or roles. For example, if the directed edge pointed out from node 1 and node 2 points to node 3, then the identity of node 3 relative to node 1 and node 2 is the recipient (also called responder or target role); If the edge points to node 4 and node 5, the identity of node 3 relative to node 4 and node 5 is the originator (also called sender or source role).
  • Directed relationship asymmetry The edge relationship between a pair of nodes in a directed graph is asymmetrical. If there is a directed edge from node 1 to node 3 in the directed graph, there is no edge from node 3 to node 1 If there are directed edges, the relationship between node 1 and node 3 is asymmetric.
  • Graph Embedding also known as graph representation learning
  • graph representation learning is a process of mapping graph data (usually a high-dimensional dense matrix) into a low-density vector, which can solve the problem of graph data that is difficult to efficiently input into machine learning. Algorithm problem.
  • Edge table also known as edge set table, is one of the storage structures of graphs.
  • the edge table is composed of header node and table node. Each vertex in the graph corresponds to a header node stored in the array.
  • Out-degree The number of edges pointed out by a node is called the out-degree of the node.
  • In-degree The number of edges pointing to a node is called the in-degree of the node.
  • the image data processing method provided in the embodiment of the present application can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, intelligent transportation, and assisted driving.
  • cloud technology including but not limited to cloud technology, artificial intelligence, intelligent transportation, and assisted driving.
  • artificial intelligence intelligent transportation, and assisted driving.
  • assisted driving The following describes the implementation environment of the embodiment of the present application.
  • FIG. 1 is a schematic diagram of the implementation environment of a graph data processing method provided in the embodiment of the present application. Referring to FIG. 1 , the implementation environment includes a terminal 101 and a server 102 .
  • the terminal 101 and the server 102 can be connected directly or indirectly through wired or wireless communication, which is not limited in this embodiment of the present application.
  • the terminal 101 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart home appliance, a vehicle terminal, etc., but is not limited thereto.
  • Application programs are installed and run on the terminal 101 .
  • the application program may be a social program, a shopping program, a multimedia program, a game program, etc., which is not limited in this embodiment of the present application.
  • the terminal 101 may be a terminal used by a user, the terminal 101 runs a social program, and the social program is registered with the user account of the user.
  • the number of the aforementioned terminals 101 may be more or less. For example, there is only one terminal 101, or there are tens or hundreds of terminals 101, or more. The embodiment of the present application does not limit the number of terminals 101 and device types.
  • the server 102 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network) and big data and artificial intelligence platforms.
  • the server 102 is configured to provide background services for the above-mentioned application programs run by the terminal 101 .
  • the server 102 undertakes the main calculation work, and the terminal 101 undertakes the secondary calculation work; or, the server 102 undertakes the secondary calculation work, and the terminal 101 undertakes the main calculation work; or, the server 102 and the terminal 101 adopt Distributed computing architecture for collaborative computing.
  • the application program installed on the terminal 101 is a shopping program as an example for description.
  • the shopping program installed on the terminal 101 is registered with a user account, and the server 102 uses the user account data and commodity data as object data, and builds an object graph based on the object data.
  • Nodes in the object graph represent user accounts or commodities, and the object graph
  • the directed edge of represents the business relationship between the user account and the item.
  • the directed edge between node 1 representing the customer’s user account and node 2 representing the item represents the business of purchasing relationship
  • the directed edge is from node 1 to node 2
  • node 1 is the initiator of the purchase
  • node 2 is the receiver of the purchase
  • the merchant uses the user account to list an item, which represents the node of the merchant's user account
  • the directed edge between 3 and node 2 representing the item represents the business relationship of putting on the shelf. This directed edge is directed from node 3 to node 2.
  • node 3 is the initiator of the putting on the shelf
  • node 2 is the receiver of the putting on the shelf.
  • the server 102 can adopt the graph data processing method provided by the embodiment of the present application to obtain the embedded features of each node in the target graph. Then, the embedded feature is applied to item recommendation service, social recommendation service, advertisement promotion service, etc., which is not limited in this embodiment of the present application. For example, in a shopping program, by performing an inner product operation on the embedded features of two nodes on a directed edge, the pointing relationship of the directed edge can be determined, and based on the pointing relationship of multiple directed edges, the user can be determined The historical shopping information of the account can be used to recommend items based on the historical shopping information, which can improve the recommendation accuracy.
  • a computer device is used as a terminal, and the terminal is installed with a multimedia program as an example.
  • the terminal acquires the history record data of the user listening to the song, and the history record data includes objects such as singer name, song name and song type.
  • the terminal uses the historical record data as the object data to build a target graph.
  • the nodes in the target graph represent singer names, song titles, and song types, etc.
  • the directed edges in the target graph represent business relationships between objects. For example, a singer If a certain song is sung, and a certain song belongs to a certain genre, it means that the directed edge between node A of the singer and node B of the song is directed from node A to node B, which means that the connection between node B of the song and node C of the genre is The directed edge between node B points to node C.
  • the terminal 101 can use the graph data processing method provided by the embodiment of the present application to obtain the embedded features of each node in the target graph. Then, the embedded feature is applied to item recommendation service, social recommendation service, advertisement promotion service, etc., which is not limited in this embodiment of the present application.
  • a computer device is used as a terminal, and the terminal is installed with a social network APP as an example for illustration.
  • the terminal After being authorized by the user, acquires the user's interaction history data, and the interaction history data includes other user identifiers such as forwarding, commenting, and liking.
  • the terminal uses the interaction history data as the object data to build a target graph.
  • the nodes of the target graph represent different users of the social network, and the directed edges represent social relationships. For example, if user 1 likes user 2, then user 1 and user 2 The directed edge between is directed from node A representing user 1 to node B representing user 2.
  • the terminal 101 can use the graph data processing method provided by the embodiment of this application to determine the embedding features of each node in the target graph, and then use different nodes
  • the embedded features of the user can determine the similarity between different users, so that the user's potential friends can be predicted and recommended based on the similarity between different users, and the recommendation accuracy can be improved.
  • Fig. 2 is a flowchart of a graph data processing method provided according to an embodiment of the present application. As shown in Fig. 2 , in the embodiment of the present application, execution by a server is taken as an example for description.
  • the graph data processing method includes the following steps:
  • the target graph includes M nodes representing M objects, M is a positive integer greater than 1, and the directed edges in the target graph are used to represent the connected nodes corresponding to There is a business relationship between objects, and the pointed node of the directed edge is the initiator of the business relationship, and the pointed node of the directed edge is the receiver of the business relationship.
  • the objects in the object data include user accounts, items, multimedia resources, virtual resources, virtual services, and virtual live broadcast rooms, etc.
  • the object data includes relationship information between objects
  • the relationship information includes object Business relationships and associations among them.
  • the server can build different target graphs, such as social relationship graphs, shopping graphs, and virtual live broadcast graphs.
  • the object data is the data collected and used after full authorization.
  • the directed edge in the target graph can represent the business relationship between objects. If the objects corresponding to the two nodes connected by the directed edge are two user accounts, the business relationship represented by the directed edge can be a transfer relationship or a follow relationship. and likes, etc. For any directed edge, point out that the node of the directed edge is the initiator of the business relationship represented by the directed edge; the node pointed to by the directed edge is the receiver of the business relationship represented by the directed edge.
  • each object corresponds to a node.
  • the receiver embedding feature is used to represent the corresponding node as the feature of the receiver.
  • the server can determine the initiation embedding feature of each node as an initiator and the receiving embedding feature of each node as a receiver among the M nodes.
  • the server can obtain the first neighbor node with the node as the receiver, and the second neighbor node with the node as the initiator , and then adopt the method of gradient descent to gradually determine the initiation embedding feature and receiving embedding feature of the node.
  • the steps of the z-th iteration include: the server obtains N nodes from the above M nodes, where N is a positive integer. Then the server determines multiple first neighborhood nodes associated with the N nodes and multiple second neighborhood nodes associated with the N nodes based on the directed edges associated with the N nodes. Wherein, the directed edge pointed out by the first neighboring node points to the N nodes, and the directed edges pointed out by the N nodes point to the plurality of second neighboring nodes.
  • the server determines the fourth embedded features of the N nodes based on the first embedded features of the above-mentioned N nodes, the second embedded features of a plurality of first neighborhood nodes, and the third embedded features of a plurality of second neighborhood nodes, Complete this iteration.
  • the fourth embedding features of the N nodes represent the initiating embedding features and receiving embedding features of the N nodes after the z-th round of iterative update.
  • the z-1 round of iteration is performed by the server After that, the first embedding features of the N nodes represent the initiating embedding features and receiving embedding features obtained after the (z-1)th round of iterative update of the N nodes.
  • the embodiment of the present application provides a graph data processing method, by obtaining N nodes from the target graph in each round of iteration, and then based on the first embedding features of the N nodes, the multiple first neighbors associated with the N nodes The second embedding feature of the domain node and the third embedding feature of multiple second neighborhood nodes to determine the fourth embedding feature of the N nodes, thereby completing a round of iterative process, and realizing the update of the embedding features of the N nodes , since M nodes are not processed, the amount of calculation can be greatly reduced.
  • the M nodes can be obtained as the originating embedding feature of the initiator and the receiving embedding feature of the receiver, so that The embedded features of each node can maximize the preservation of the information of the first neighborhood and the second neighborhood, enabling nodes to fully express the information in the graph data, and eliminating the information loss during the data conversion process.
  • the pointing relationship of the directed edge can be determined, so that it can not only express the dual identity of the node, but also express the pointing relationship of the directed edge.
  • FIG. 2 exemplarily shows the implementation flow of the graph data processing method provided by the embodiment of the present application.
  • the graph data processing method is introduced below based on an application scenario.
  • Fig. 3 is a flowchart of another graph data processing method provided according to an embodiment of the present application. As shown in Fig. 3 , in the embodiment of the present application, it is performed by a server and applied to a social scene as an example for illustration.
  • the graph data processing method includes the following steps:
  • the target graph includes M nodes representing M objects, M is a positive integer greater than 1, and the directed edges in the target graph are used to indicate the corresponding There is a business relationship between objects, and the pointed node of the directed edge is the initiator of the business relationship, and the pointed node of the directed edge is the receiver of the business relationship.
  • the object data includes object information and relationship information
  • the object information is used to represent the attribute of the object
  • the relationship information is used to represent the business relationship between the objects.
  • the object data is user account data
  • the user account data includes user account information and account relationship information.
  • the user account information includes attributes of the user account such as a user account identifier and a user account level.
  • the account relationship information includes business relationships between the user account and other user accounts, etc., and this embodiment of the present application does not limit the content included in the object data.
  • the server can construct a target graph based on the above object data, and the target graph includes M nodes and directed edges between the M nodes.
  • the M nodes represent M different user accounts included in the object data, and M is a positive integer greater than 1.
  • a directed edge between any two nodes is used to indicate that there is a business relationship between user accounts corresponding to the two nodes, such as a friend relationship, a group relationship, a collection relationship, and a transfer relationship.
  • the node pointing out the directed edge is the initiator of the business relationship represented by the directed edge
  • the node pointed to by the directed edge is the receiver of the business relationship represented by the directed edge.
  • FIG. 4 is a schematic diagram of a directed graph provided according to an embodiment of the present application.
  • the directed graph exemplarily shows six nodes representing user accounts, A, B, C, D, E, and F.
  • node C points out two directed edges, and there are two directed edges pointing to node C.
  • node C acts as the receiver of the business relationship, which can also be called the target role;
  • nodes E and node F node C plays the role of the initiator of the business relationship, which can also be called the target role. as the source role.
  • the user account represented by node A and node B respectively initiates a transfer service to the user account represented by node C
  • node A and node B are the initiators of the transfer service
  • node C is the receiver of the transfer service
  • the group created by the user account includes the user account represented by node E, then node C is the initiator of the group business, and node E is the receiver of the group business; for example, the user account represented by node C is stored with node F
  • node C is the initiator of the collection service
  • node F is the receiver of the collection service.
  • the server can also send the above object data to a graph creation server, and the graph creation server creates a target graph based on the above object data.
  • the server can also acquire the target graph created based on the above object data from the local database.
  • the embodiment of the present application does not limit the source of the target graph.
  • the server can perform multiple rounds of iterations based on the M nodes and the directed edges between the M nodes to determine the characteristics of each node as the initiator of the business relationship, and each node as the A characteristic when the recipient of a business relationship.
  • the server can process the M nodes separately, and can also use a distributed processing method to process part of the M nodes based on multiple executors, so as to improve processing efficiency.
  • the server can divide the target graph into at least one graph partition, each graph partition including a plurality of nodes and directed edges between the plurality of nodes. For any graph partition, the server stores the nodes in the graph partition in the parameter server, and stores the edge table corresponding to the graph partition in the executor corresponding to the graph partition.
  • the server can realize distributed computing based on multiple executors, and significantly improve the efficiency of computing graph embedding.
  • Multiple executors are set in the server, and the multiple executors can execute graph data processing tasks synchronously.
  • the executor can determine the embedded features of each node in the graph partition by means of small-batch stochastic gradient descent, and the executor can obtain N nodes from the corresponding graph partition in each round of iteration, N is a positive integer.
  • steps 302-step 307 exemplarily show an implementation of the z-th iteration
  • N nodes from the M nodes, where N is a positive integer smaller than M.
  • the server can sample the above M nodes to obtain N nodes, and the N nodes are the nodes to be processed in the z-th iteration.
  • each executor obtains N nodes for processing.
  • the server adopts the method of Alias Table (alias table) to sample multiple sampling edges from the target graph.
  • multiple sampling edges can be randomly sampled from the target graph
  • Sampling edges can also be sampled according to certain rules, for example, a node can be sampled every preset number.
  • the nodes connected by the plurality of sampling edges are the above N nodes.
  • the server can determine N node identifiers based on the plurality of sampled edges, and then obtain N nodes corresponding to the N node identifiers from the parameter server.
  • the server can also determine sampling edges in other ways, or directly sample nodes, which is not limited in this embodiment of the present application.
  • the server can model the neighborhood of the node, and divide the neighborhood of the node into the first neighborhood and the second neighborhood according to the dual identity of the node, and then from the second A neighborhood acquires at least one first neighborhood node associated with the node, and acquires at least one second neighborhood node associated with the node from a second neighborhood.
  • the directed edge between the first neighborhood node and this node in the first neighborhood points to this node; the directed edge between the second neighborhood node and this node in the second neighborhood is pointed out by this node .
  • the neighborhood of nodes is divided into 1-order to R-order (also called 1-hop neighbor to R-hop neighbor), and R is a positive integer.
  • the first neighborhood may also be called a source neighborhood
  • the second neighborhood may also be called a target neighborhood.
  • the user account represented by at least one first neighbor node associated with the node is transferred to the user account represented by the node;
  • the user accounts represented by the second neighboring node all receive transfers from the user accounts represented by the node.
  • N s (v n ) represents the first neighborhood in the R order of node v n ; Indicates the first-order first neighborhood of node v n ; Indicates the second-order first neighborhood of node v n ; Indicates the R-order first neighbor of node v n , where R is a positive integer.
  • N t (v n ) represents the second neighborhood in the R order of node v n ; Indicates the first-order second neighborhood of node v n ; Indicates the second-order second neighborhood of node v n ; Indicates the R-order second neighborhood of node v n , where R is a positive integer.
  • FIG. 5 is a schematic diagram of a directed graph neighborhood definition provided according to an embodiment of the present application.
  • the first-order first neighborhood of node A includes node B, node C and node D
  • the second-order first neighborhood of node A includes node H and node I
  • the first-order second neighborhood of node A Including node E, node F and node G
  • the second-order second neighborhood of node A includes node J and node K.
  • the directed edge between any node in the first-order first neighborhood and node A points to node A, and the directed edge connected to any node in the second-order first neighborhood points to node; the directed edge between any node in the first-order second neighborhood and node A is pointed out by node A, and the directed edge connected to any node in the second-order second neighborhood is pointed out by the first-order second neighborhood pointed out by the node.
  • the embodiment of the present application provides two implementation methods.
  • the first implementation method determines the embedded features of the node based on the probability information of the node determining the first neighborhood node and the second neighborhood node, see step 304 - as shown in step 307;
  • the second implementation mode is to determine the embedded features of the node by performing negative sampling on the first neighborhood and the second neighborhood of the node, see step 308-step 310.
  • first probability information Based on the first embedded features of the N nodes and the second embedded features of the multiple first neighborhood nodes, determine first probability information, where the first probability information is used to indicate that the multiple The probability of the first neighbor node.
  • the node v n in the embedding (embedding) space, corresponds to an initiating embedding feature s n (also called a source role representation vector) and a receiving embedding feature t n (also known as the target role characterization vector), the ultimate goal is to determine the initiation embedding matrix S′ composed of the initiation embedding features of M nodes in the target graph and the receiving embedding matrix T′ composed of the receiving embedding features of M nodes.
  • the elements in the initiating embedding matrix S' represent the initiating embedding characteristics of each node in the target graph
  • the elements in the receiving embedding matrix T' represent the receiving embedding characteristics of each node in the target graph.
  • the first embedding feature refers to the embedding feature of the N nodes.
  • the first embedding feature of the node includes the node's initiation embedding feature and the node's receiving Embed features.
  • each node in the N nodes corresponds to at least one first neighbor node.
  • the second embedding feature refers to the embedding feature of the plurality of first neighborhood nodes.
  • the second embedding feature of the first neighborhood node includes the initiation embedding feature of the first neighborhood node and The received embedded feature of the first neighborhood node.
  • the node v n when the first neighbor node v i of the node is determined based on the node v n, the node v n is the recipient of the business relationship, and the first neighbor node v i is the business
  • the initiator of the relationship defines the probability of node v n determining the first neighborhood node v i in the entire target graph based on the softmax (normalized) function, and obtains the first probability information.
  • the server determines the first probability information based on the first embedding features of N nodes and the second embedding features of multiple first neighborhood nodes, including: the server obtains Nth embedding features from the first embedding features of the N nodes An intermediate embedded feature, the N first intermediate embedded features are used to represent the feature that the N nodes are the recipients, that is, the receiving embedded features of the N nodes.
  • the server obtains a plurality of second intermediate embedded features from the second embedded features of the plurality of first neighborhood nodes, and the plurality of second intermediate embedded features are used to indicate that the plurality of first neighborhood nodes are the initiator Features, that is, the initial embedding features of the plurality of first neighborhood nodes.
  • the server obtains M seventh intermediate embedded features of the M nodes, and the M seventh intermediate embedded features are used to indicate that the M nodes are the initiators, that is, the originating embedded features of the node.
  • the first probability information is determined based on the N first intermediate embedded features, the plurality of second intermediate embedded features and the M seventh intermediate embedded features.
  • the calculation method of the first probability information is shown in formula (3):
  • v n ) represents the following probability in the first probability information: the probability of determining the first neighborhood node v i in the entire target graph based on node v n ; exp( ⁇ ) represents the base e t n represents the first intermediate embedded feature of node v n ; T represents the matrix transpose; s i represents the second intermediate embedded feature of the first neighborhood node v i ; M represents the total number of nodes in the target graph, M is a positive integer; s m represents the seventh intermediate embedded feature of node v m , and node v m represents the mth node in the target graph.
  • step 304 for a node v n , when the second neighbor node v j of the node is determined based on the node v n , the node v n is the initiator of the business relationship, and the second Neighborhood node v j is the receiver of the business relationship, then define the probability of node v n determining the second neighborhood node v j in the entire target graph based on the softmax function, and obtain the second probability information.
  • each node among the N nodes corresponds to at least one second neighbor node.
  • the third embedding feature refers to the embedding feature of the multiple second neighborhood nodes, and for any second neighborhood node, the third embedding feature of the second neighborhood node includes the initiation embedding feature of the second neighborhood node and the Received embedded features of the second neighborhood node.
  • the server determines the second probability information based on the first embedding features of the N nodes and the third embedding features of the plurality of second neighborhood nodes, including: the server acquires N from the first embedding features of the N nodes A fourth intermediate embedded feature, where the N fourth intermediate embedded features are used to represent the feature that the N nodes are the originator, that is, the originating embedded feature of the N nodes.
  • the server obtains a plurality of fifth intermediate embedded features from the third embedded features of the plurality of second neighborhood nodes, and the plurality of fifth intermediate embedded features are used to indicate that the plurality of second neighborhood nodes are the recipients feature, that is, the received embedding feature of the plurality of second neighborhood nodes.
  • the server obtains M eighth intermediate embedded features of the M nodes, and the M eighth intermediate embedded features are used to indicate that the M nodes are the receiver features, that is, the receiving embedded features of the M nodes.
  • the second probability information is determined based on the N fourth intermediate embedded features, the plurality of fifth intermediate embedded features and the M eighth intermediate embedded features.
  • the calculation method of the second probability information is shown in formula (4):
  • v n ) represents the following probability in the second probability information: the probability of determining the second neighborhood node v j in the entire target graph based on node v n ; exp( ⁇ ) represents the base e s n represents the fourth intermediate embedded feature of node v n ; T represents the matrix transpose; t j represents the fifth intermediate embedded feature of the second neighborhood node v j ; M represents the total number of nodes in the target graph, M is a positive integer; t m represents the eighth intermediate embedded feature of node v m , and node v m represents the mth node in the target graph.
  • the idea of maximum likelihood estimation is used to construct the objective function, then for a given node v n , multiple rounds of iteration
  • the goal of is to maximize the likelihood function of determining the neighbor nodes of the node v n based on the node v n , the neighbor nodes including the first neighbor node and the second neighbor node.
  • the likelihood function is shown in formula (5):
  • N s (v n ) represents the first neighborhood of node v n in R order
  • N t (v n ) represents the node v n's The second neighborhood in the R order
  • ⁇ ( ⁇ ) represents the multiplication function
  • v n ) represents the following probability in the first probability information: determine the first neighborhood based on the node v n in the entire target graph
  • v n ) represents the following probability in the second probability information: determine the probability of the second neighborhood node v j in the entire target graph based on node v n .
  • the server can convert the maximization problem in the above formula (5) into a minimization problem based on a negative logarithmic likelihood function (Negative Log Likelihood), thereby obtaining a second objective function, which can be found in Formula (6) shows:
  • L likelihood represents the negative log-likelihood function
  • N s (v n ) represents the first neighborhood in the R-order of node v n
  • N t (v n ) represents the second neighborhood in the R-order of node v n
  • v i represents the first neighborhood node
  • v j represents the second neighborhood node
  • log( ⁇ ) represents the logarithmic function
  • exp( ⁇ ) represents the exponential function with base e
  • t n represents the first middle of node v n Embedded feature
  • T represents matrix transpose
  • s i represents the second intermediate embedded feature of the first neighborhood node v i
  • M represents the total number of nodes in the target graph, and M is a positive integer
  • s m represents the seventh intermediate of node v m Embedded features
  • node v m represents the mth node in the target graph
  • s n represents the fourth intermediate embedded feature of node v n
  • the server can iterate based on the second objective function to obtain the fourth embedded features of the N nodes.
  • the fourth embedding feature represents the iteratively updated initiating embedding feature and receiving embedding feature of the N nodes. Since the optimization goal of the second objective function is to minimize, the server performs multiple rounds of iterations based on the stochastic gradient descent method, so as to obtain the sending embedding features and receiving embedding features of the N nodes.
  • the server can implement the optimization by using negative sampling contrastive learning.
  • the server can convert the problem of optimizing the second objective function into a problem of optimizing the first objective function. For the details of the first objective function, see steps 308 to 310 .
  • the first neighborhood and the second neighborhood of the node v n have different negative sample distributions, and the first neighborhood includes the first neighborhood node v of the node v n i , the second neighborhood includes a second neighborhood node v j of node v n .
  • the negative sampling focuses on the directed edge pointing to node v i in the first neighborhood of node v n , then the in-degree distribution information of the node is obtained, which means pointing to the current node v n The distribution of the number of directed edges of .
  • the second embedded features of the multiple first neighborhood nodes Based on the first embedded features of the N nodes, the second embedded features of the multiple first neighborhood nodes, node in-degree distribution information, and the fifth embedded features of multiple first nodes, determine a first function, the The node in-degree distribution information is used to represent the distribution of the node in-degrees of the N nodes, and the multiple first nodes are obtained based on negative sampling of the first neighborhood of the N nodes, and the first neighborhood includes the multiple first neighborhood nodes.
  • the server can obtain the first embedded features of the N nodes after the z-1 round of iterations and the first embedded features of the multiple first neighborhood nodes from the parameter server.
  • the second embedded feature and the fifth embedded feature of the plurality of first nodes and then determine the first function based on the above acquired features.
  • the node v n in the graph in the embedding space, the node v n corresponds to an initiating embedding feature s n and a receiving embedding feature t n , then the ultimate goal is to determine the initiating embedding feature composed of M nodes in the target graph.
  • the elements in the initiating embedding matrix S' represent the initiating embedding characteristics of each node in the target graph
  • the elements in the receiving embedding matrix T' represent the receiving embedding characteristics of each node in the target graph.
  • the fifth embedded feature refers to the embedded features of the multiple first nodes, and for any first node, the fifth embedded feature of the first node includes the initiated embedded feature of the first node and The first node receives embedded features.
  • the node v n is the receiver of the business relationship
  • the first neighbor node v i is the initiator of the business relationship, such as The business relationship is transfer
  • the user account represented by the node v i in the first neighborhood is transferred to the user account represented by the node v n .
  • the first function can be determined.
  • the server determines the The steps of the first function include: the server obtains N first intermediate embedded features from the first embedded features of the N nodes, and the N first intermediate embedded features are used to indicate that the N nodes are the features of the receiver , that is, the received embedding features of the N nodes.
  • the server obtains a plurality of second intermediate embedded features from the second embedded features of the plurality of first neighborhood nodes, and the plurality of second intermediate embedded features are used to indicate that the plurality of first neighborhood nodes are the initiator Features, that is, the initial embedding features of the plurality of first neighborhood nodes.
  • the server determines first expectation information based on the node in-degree distribution information, where the first expectation information is used to indicate that the plurality of first nodes satisfy the mathematical expectation of the node in-degree distribution information.
  • the server obtains a plurality of third intermediate embedded features from the fifth embedded features of the plurality of first nodes, and the plurality of third intermediate embedded features are used to indicate that the plurality of first nodes are the characteristics of the initiator, that is, the Initiating embedded features of multiple first nodes.
  • the server determines the first function based on the above N first intermediate embedded features, multiple second intermediate embedded features, first expected information, and multiple third intermediate embedded features.
  • the complexity of calculation can be significantly reduced, thereby improving calculation efficiency and saving computing resources.
  • L 1 represents the first function
  • N s (v n ) represents the first neighborhood in the R order of node v n
  • v i represents the first neighborhood node
  • log( ⁇ ) represents the logarithmic function
  • ⁇ ( ⁇ ) represents the sigmoid function
  • t n represents the first intermediate embedded feature of node v n
  • T represents matrix transposition
  • s i represents the second intermediate embedded feature of node v i in the first neighborhood
  • K represents negative sampling of the first neighborhood
  • the number of first nodes obtained, K is a positive integer
  • v k represents the kth first node
  • s k represents the third intermediate embedded feature of the first node v k .
  • the node out-degree distribution information is used to represent the distribution of the node out-degree of the N nodes, and the multiple second nodes are obtained based on the negative sampling of the second neighborhood of the N nodes, and the second neighborhood includes the multiple second neighborhood nodes.
  • step 308 for a node v n , when the second neighbor node v j of the node is determined based on the node v n , the node v n is the initiator of the business relationship, and the second Neighborhood node v j is the receiver of the business relationship, if the business relationship is transfer, the user account represented by the second neighbor node v j has received the transfer of the user account represented by node v n . Based on the node out-degree distribution information obtained by negative sampling, the second function can be determined.
  • the sixth embedded feature refers to the embedded features of the plurality of second nodes, and for any second node, the sixth embedded feature of the second node includes the initiated embedded feature of the second node and The second node receives embedded features.
  • the server determines the second embedding feature based on the first embedding features of N nodes, the third embedding features of multiple second neighborhood nodes, node out-degree distribution information, and the sixth embedding features of multiple second nodes.
  • the steps of the function include: the server obtains N fourth intermediate embedded features from the first embedded features of N nodes, and the N fourth intermediate embedded features are used to indicate that the N nodes are the characteristics of the initiator, that is, The originating embedding features of the N nodes.
  • the server obtains a plurality of fifth intermediate embedded features from the third embedded features of the plurality of second neighborhood nodes, and the plurality of fifth intermediate embedded features are used to indicate that the plurality of second neighborhood nodes are features of the receiver , that is, the received embedding features of the multiple neighborhood nodes.
  • the server determines second expectation information based on the node out-degree distribution information, where the second expectation information is used to indicate that the plurality of second nodes satisfy the mathematical expectation of the node out-degree distribution information.
  • the server obtains a plurality of sixth intermediate embedded features from the embedded features of the multiple second nodes, and the multiple sixth intermediate embedded features are used to represent the features of the multiple second nodes as the receiver, that is, the multiple Receive embedding features for the second node.
  • the server determines the second function based on the above N fourth intermediate embedded features, multiple fifth intermediate embedded features, second expected information, and multiple sixth intermediate embedded features.
  • L 2 represents the second function
  • N t (v n ) represents the second neighborhood in the R order of node v n
  • v j represents the second neighborhood node
  • log( ⁇ ) represents the logarithmic function
  • ⁇ ( ⁇ ) represents the sigmoid function
  • s n represents the fourth intermediate embedded feature of node v n
  • T represents the matrix transpose
  • t j represents the fifth intermediate embedded feature of the second neighborhood node v j
  • L represents the negative sampling of the first neighborhood
  • the number of obtained second nodes, L is a positive integer
  • v l represents the lth second node
  • t l indicates the sixth intermediate embedded feature of the second node v l .
  • the server can construct the first objective function based on the first function and the second function, and the first objective function is shown in formula (11):
  • L neg represents the first objective function
  • L 1 represents the first function
  • L 2 represents the second function
  • the server uses a small-batch stochastic gradient descent method to optimize the first objective function based on multiple rounds of iterations to obtain the fourth embedded features of the above N nodes.
  • the desired effect can be achieved on highly sparse graph data while preserving the first-order neighborhood.
  • the sum of the neighborhood of the node in the first objective function is equivalent to the sum of all the edges in the graph, so the server can
  • the first objective function is transformed into a third objective function.
  • the third objective function is a function expressed from the perspective of an edge.
  • the node v x is the initiator of the business relationship corresponding to the directed edge
  • the node v y is the The receiver of the business relationship corresponding to the directed edge, that is, node v x plays the role of source
  • node v y plays the role of target.
  • the algorithm corresponding to the third objective function can be called a bidirectional graph embedding algorithm (Bidirectional Graph Embedding, BGE).
  • L 3 represents the third objective function
  • (v x , v y ) represents the directed edge between node v x and node v y in the target graph, the directed edge is pointed out by node v x and points to node v y
  • E graph represents the set of directed edges in the target graph
  • log( ⁇ ) represents the logarithmic function
  • ⁇ ( ⁇ ) represents the sigmoid function
  • s x represents the initiation embedding feature of the node v x
  • T represents the matrix transpose
  • ty represents the node
  • F represents the number of negative sampling nodes v f obtained by negatively sampling the first-order second neighborhood of node v x , and F is a positive integer
  • t f represents the receiving embedded feature of node v f
  • G represents the negative sampling of
  • step 311 is executed, or the server performs multiple rounds of iterations based on the implementations shown in steps 302, 303 and 308 to 310 above. After a round of iterations, execute step 311.
  • the initiating embedding features are used to indicate that the corresponding node is the initiator, and the receiving embedding features are used to indicate the corresponding node characteristics of the receiver.
  • the server ends the iteration after iterating the target number of rounds, and obtains the initiating embedding feature of each node in the target graph as an initiator and the receiving embedding feature of each node as a receiver.
  • the server can concatenate the node's initiation embedding feature and receiving embedding feature with the account feature of the user account represented by the node, and then determine the connection between the node and other nodes based on the concatenated feature. Whether the business between nodes is at risk. For example, when the user account corresponding to this node initiates a transfer request to the user account corresponding to another node, the server can splicing the corresponding initiation embedding feature and receiving embedding feature on the basis of the account features of the above two user accounts, and the spliced obtained The characteristics are input into the anti-fraud model to judge whether the transfer request is a fraudulent business.
  • the hit rate of the fraud number of the anti-fraud model is increased by 14%-21%, and the hit rate of the fraudulent amount is increased by 27%-47% relative improvement.
  • the graph data processing method provided by the present application is 2-5 times more efficient in obtaining embedded features of nodes than the existing technical solutions, and the efficiency is significantly improved.
  • the distributed computing architecture includes multiple executors and multiple parameter servers.
  • the server performs graph partitioning on the target graph, stores the edge tables corresponding to each graph partition in the executor, and stores the nodes corresponding to each graph partition in the corresponding parameter server.
  • Each node has two embedded features, and one represents the node as the source Embedding features for roles, an embedding feature representing the node as the target role.
  • the parameter server initializes the embedded features of each node, and each executor pulls the first embedded features of N nodes and the first embedded features of multiple first neighbor nodes from the parameter server in batches in the corresponding graph partition.
  • the second embedded feature and the third embedded feature of the plurality of second neighborhood nodes determine N feature gradients of the N nodes. Then upload the N feature gradients to the parameter server, and the parameter server determines the fourth embedded features of the N nodes based on the first embedded features and the N feature gradients of the N nodes, so as to update the embedded features of the nodes.
  • the final iteration ends after the target number of rounds.
  • the graph data processing method provided in the embodiment of the present application can also be combined with existing models to realize different services.
  • the recommendation model for example, when applied to a social scene, the initiation embedding feature and receiving embedding feature of the node obtained based on the graph data processing method provided by the embodiment of the present application are introduced into the friend recommendation model, so that the recommendation model can be more Accurately recommend friends for users, etc.
  • the node's initiation embedding features and receiving embedding features obtained based on the graph data processing method provided by the embodiment of the present application are introduced into the item recommendation model, so that the recommendation model can more accurately recommend for users merchandise etc. No more examples are given here.
  • the embodiment of this application is described with one type of business relationship between nodes. Since there may be multiple types of business relationships between objects in the object data, for example, there is a transfer relationship between object A and object B There is a transfer relationship and a like relationship between Object A and Object C. Since the follow relationship and the like relationship belong to the same type of business relationship, they usually occur in the same business scenario, such as watching a live broadcast scene; while the transfer relationship It is a different type of business relationship from the above two relationships, and usually occurs in different business scenarios, such as the transfer relationship occurs in a social scenario. Therefore, the server can determine multiple target graphs based on graph data based on the relationship types between nodes.
  • Directed edges in the same target graph represent the same type of business relationship, and directed edges in different target graphs represent different types of business relation.
  • the server processes the multiple target graphs separately, and then splices or weights the initiating embedding features and receiving embedding features obtained by the same node based on different target graphs to obtain the node's The final originating embedded features and receiving embedded features.
  • the embodiment of the present application provides a novel graph data processing method, by obtaining N nodes from the target graph in each round of iteration, and then based on the first embedding features of the N nodes, the multiple nodes associated with the N nodes.
  • the second embedding feature of a neighborhood node and the third embedding feature of multiple second neighborhood nodes are used to determine the fourth embedding feature of the N nodes, thereby completing a round of iterative process and realizing the embedding feature of N nodes Since the update of M nodes is not processed, the amount of calculation can be greatly reduced.
  • the M nodes can be obtained as the initiator's initiation embedding feature and as the receiver's receiving embedding feature , so that the embedded features of each node can maximize the preservation of the information of the first neighborhood and the second neighborhood, and then based on the inner product between the embedded features of the two nodes connected by the directed edge, the value of the directed edge can be determined Pointing relationship, so that it can not only express the dual identity of the node, but also express the pointing relationship of the directed edge.
  • Fig. 7 is a block diagram of a graph data processing device provided according to an embodiment of the present application.
  • the device is used to execute the steps in the above-mentioned graph data processing method.
  • the device includes: a graph acquisition module 71 and an iteration module 72, wherein the iteration module 72 includes an acquisition unit 721, a node determination unit 722, and a feature determination unit 723.
  • the graph acquisition module 71 is configured to acquire a target graph based on object data, the target graph includes M nodes representing M objects, M is a positive integer greater than 1, and the directed edges in the target graph are used to represent all There is a business relationship between the objects corresponding to the connected nodes, and the pointed node of the directed edge is the initiator of the business relationship, and the pointed node of the directed edge is the receiver of the business relationship;
  • the iterative module 72 is configured to perform multiple rounds of iterations based on the M nodes and the directed edges between the M nodes to obtain the initial embedding features and receiving embedding features corresponding to the M nodes, and the initiating embedding features are used to represent the corresponding
  • the node of is the feature of the initiator, and the receiving embedding feature is used to represent the feature of the corresponding node as the receiver;
  • the iteration module 72 includes:
  • the obtaining unit 721 is configured to obtain N nodes from the M nodes, where N is a positive integer smaller than M;
  • the node determination unit 722 is configured to determine a plurality of first neighborhood nodes associated with the N nodes and a plurality of second neighborhood nodes associated with the N nodes based on directed edges associated with the N nodes, the first The directed edges pointed out by the neighboring nodes point to the N nodes, and the directed edges pointed out by the N nodes point to the plurality of second neighboring nodes;
  • the feature determining unit 723 is configured to determine the N nodes based on the first embedded features of the N nodes, the second embedded features of the plurality of first neighborhood nodes, and the third embedded features of the plurality of second neighborhood nodes.
  • the fourth embedding feature of the node, the fourth embedding feature of the N nodes represents the initiating embedding feature and receiving embedding feature of the N nodes after the z-th round of iterative update.
  • FIG. 8 is a block diagram of another graph data processing device provided according to an embodiment of the present application.
  • the feature determination unit 723 includes:
  • the first determining subunit 7231 is configured to be based on the first embedding features of the N nodes, the second embedding features of the plurality of first neighborhood nodes, node in-degree distribution information, and the fifth embedding features of the plurality of first nodes , to determine the first function, the node in-degree distribution information is used to represent the distribution of the node in-degree of the N nodes, the multiple first nodes are obtained based on the negative sampling of the first neighborhood of the N nodes, the first neighbor
  • the domain includes the plurality of first neighborhood nodes, the first embedding features of the N nodes, the second embedding features of the plurality of first neighborhood nodes, and the third embedding features of the plurality of second neighborhood nodes
  • the feature is obtained by performing the (z-1)th round of iteration;
  • the second determining subunit 7232 is configured to be based on the first embedding features of the N nodes, the third embedding features of the plurality of second neighborhood nodes, node out-degree distribution information, and the sixth embedding features of the plurality of second nodes , to determine the second function, the node out-degree distribution information is used to represent the node out-degree distribution of the N nodes, the plurality of second nodes are obtained based on the negative sampling of the second neighborhood of the N nodes, the second neighbor the domain includes the plurality of second neighborhood nodes;
  • the first optimization subunit 7233 is configured to optimize the first objective function based on the first objective function formed by the first function and the second function, to obtain fourth embedded features of the N nodes.
  • the first determination subunit 7231 is configured to obtain N first intermediate embedded features from the first embedded features of the N nodes, and the N first intermediate embedded features
  • the feature is used to indicate that the N nodes are the features of the receiver; from the second embedded features of the multiple first neighborhood nodes, a plurality of second intermediate embedded features are obtained, and the multiple second intermediate embedded features are used for Representing the characteristics of the plurality of first neighborhood nodes as the initiator; determining first expectation information based on the node in-degree distribution information, the first expectation information being used to indicate that the plurality of first nodes satisfy the node in-degree distribution The mathematical expectation of information; from the fifth embedded features of the plurality of first nodes, a plurality of third intermediate embedded features are obtained, and the multi-third intermediate embedded features are used to represent the features of the plurality of first nodes as the initiator ; determining the first function based on the N first intermediate embedded features, the plurality of second intermediate embedded features, the first desired information, and the plurality of third intermediate embedded features
  • the second determining subunit 7232 is configured to obtain N fourth intermediate embedded features from the first embedded features of the N nodes, and the N fourth intermediate embedded features are used to represent the N node as the feature of the initiator; from the third embedded features of the plurality of second neighborhood nodes, a plurality of fifth intermediate embedded features are obtained, and the plurality of fifth intermediate embedded features are used to represent the plurality of second Neighborhood nodes are used as the characteristics of the receiver; based on the node out-degree distribution information, second expectation information is determined, and the second expectation information is used to indicate that the plurality of second nodes meet the mathematical expectation of the node out-degree distribution information; from Among the embedded features of the plurality of second nodes, a plurality of sixth intermediate embedded features are obtained, and the plurality of sixth intermediate embedded features are used to represent the plurality of second nodes as the characteristics of the receiver; based on the N fourth The intermediate embedded features, the plurality of fifth intermediate embedded features, the second desired information, and the plurality of sixth intermediate embedded features determine the second function.
  • the feature determining unit 723 includes:
  • the third determination subunit 7234 is configured to determine first probability information based on the first embedded features of the N nodes and the second embedded features of the plurality of first neighboring nodes, the first probability information is used to indicate that based on the N nodes determine the probability of the plurality of first neighborhood nodes;
  • the fourth determining subunit 7235 is configured to determine second probability information based on the first embedded features of the N nodes and the third embedded features of the plurality of second neighboring nodes, the second probability information is used to indicate based on the N nodes determine the probability of the plurality of second neighborhood nodes;
  • the fifth determination subunit 7236 is configured to determine a second objective function based on the first probability information and the second probability information;
  • the second optimization subunit 7237 is configured to optimize the second objective function to obtain fourth embedded features of the N nodes.
  • the third determination subunit 7234 is configured to obtain N first intermediate embedded features from the first embedded features of the N nodes, and the N first intermediate embedded features are used to represent the N node as the feature of the receiver; from the second embedded features of the multiple first neighborhood nodes, a plurality of second intermediate embedded features are obtained, and the multiple second intermediate embedded features are used to represent the multiple first Neighborhood nodes are the characteristics of the initiator; M seventh intermediate embedded features of the M nodes are obtained, and the M seventh intermediate embedded features are used to indicate that the M nodes are the characteristics of the initiator; based on the N The first intermediate embedded features, the plurality of second intermediate embedded features and the M seventh intermediate embedded features determine the first probability information.
  • the fourth determining subunit 7235 is configured to obtain N fourth intermediate embedded features from the first embedded features of the N nodes, and the N fourth intermediate embedded features are used to represent the N node as the feature of the initiator; from the third embedded features of the plurality of second neighborhood nodes, a plurality of fifth intermediate embedded features are obtained, and the plurality of fifth intermediate embedded features are used to represent the plurality of second Neighborhood nodes are used as the characteristics of the receiver; M eighth intermediate embedded features of the M nodes are obtained, and the M eighth intermediate embedded features are used to represent the M nodes as the characteristics of the receiver; based on the N The fourth intermediate embedded features, the plurality of fifth intermediate embedded features and the M eighth intermediate embedded features determine the second probability information.
  • the device further includes:
  • a division module 73 configured to divide the target graph into at least one graph partition
  • the storage module 74 is configured to, for any graph partition, store the nodes in the graph partition in the parameter server;
  • the storage module 74 is also configured to store the edge table corresponding to the graph partition in the executor corresponding to the graph partition;
  • the sampling module 75 is configured to, for any executor, based on the edge table stored by the executor, obtain a plurality of sampling edges from the target graph by sampling, and the nodes connected by the plurality of sampling edges are the N nodes;
  • the acquiring unit 721 is configured to determine N node identifiers based on the plurality of sampled edges; and acquire the N nodes corresponding to the N node identifiers from the parameter server.
  • the feature determination unit 723 is configured to be based on the first embedded features of the N nodes, the second embedded features of the plurality of first neighborhood nodes, and the third feature of the plurality of second neighborhood nodes Embedding features, determining N feature gradients of the N nodes; uploading the N feature gradients to the parameter server, and the parameter server determines the N feature gradients based on the first embedded features of the N nodes and the N feature gradients.
  • the fourth embedding feature of nodes is configured to be based on the first embedded features of the N nodes, the second embedded features of the plurality of first neighborhood nodes, and the third feature of the plurality of second neighborhood nodes Embedding features, determining N feature gradients of the N nodes; uploading the N feature gradients to the parameter server, and the parameter server determines the N feature gradients based on the first embedded features of the N nodes and the N feature gradients.
  • the fourth embedding feature of nodes is configured to be based on the first embedded features of the N nodes, the
  • the graph data processing device obtains N nodes from the target graph in each round of iteration, and then based on the first embedding features of the N nodes, the multiple first neighbor nodes associated with the N nodes.
  • the second embedding features of the second neighborhood nodes and the third embedding features of multiple second neighborhood nodes are used to determine the fourth embedding features of the N nodes, so as to complete a round of iterative process and realize the update of the embedding features of the N nodes, because It is not to process M nodes, so the amount of calculation can be greatly reduced.
  • the initiation embedding features of the M nodes as initiators and the receiving embedding features of receivers can be obtained, so that each node
  • the embedded features of can maximize the preservation of the information of the first neighborhood and the second neighborhood, and then based on the inner product between the embedded features of the two nodes connected by the directed edge, the pointing relationship of the directed edge can be determined, so that It can not only represent the dual identity of the node, but also represent the pointing relationship of the directed edge.
  • the graph data processing device when the graph data processing device provided in the above embodiment processes graph data, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. , that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the graph data processing device and the graph data processing method embodiments provided in the above embodiments belong to the same concept, and the implementation process thereof is shown in the method embodiments.
  • the computer device can be configured as a terminal or a server.
  • the terminal can be used as the execution subject to implement the technical solution provided by the embodiment of the application.
  • the technical solution provided by the embodiment of the present application can be implemented by the server as the execution body, or the technical solution provided by the present application can be implemented through the interaction between the terminal and the server, which is not limited in the embodiment of the present application.
  • FIG. 9 is a structural block diagram of a terminal 900 provided according to an embodiment of the present application.
  • the terminal 900 can be a portable mobile terminal, such as: smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, moving picture expert compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic Video Expert compresses the standard audio level 4) player, laptop or desktop computer.
  • the terminal 900 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 900 includes: a processor 901 and a memory 902 .
  • the processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • Processor 901 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • Processor 901 may also include a main processor and a coprocessor, and the main processor is a processor for processing data in a wake-up state, also known as a CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state.
  • CPU Central Processing Unit, central processing unit
  • the coprocessor is Low-power processor for processing data in standby state.
  • the processor 901 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw the content to be displayed on the display screen.
  • the processor 901 may also include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is configured to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 902 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 902 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 902 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 901 to implement the methods provided by the method embodiments in this application. Graph data processing methods.
  • the terminal 900 may further include: a peripheral device interface 903 and at least one peripheral device.
  • the processor 901, the memory 902, and the peripheral device interface 903 may be connected through buses or signal lines.
  • Each peripheral device can be connected to the peripheral device interface 903 through a bus, a signal line or a circuit board.
  • the peripheral equipment includes: at least one of a radio frequency circuit 904 , a display screen 905 , a camera assembly 906 , an audio circuit 907 , a positioning assembly 908 and a power supply 909 .
  • the terminal 900 further includes one or more sensors 910 .
  • the one or more sensors 910 include, but are not limited to: an acceleration sensor 911 , a gyro sensor 912 , a pressure sensor 913 , a fingerprint sensor 914 , an optical sensor 915 and a proximity sensor 916 .
  • FIG. 9 does not constitute a limitation on the terminal 900, and may include more or less components than shown in the figure, or combine certain components, or adopt different component arrangements.
  • FIG. 10 is a schematic structural diagram of a server provided according to an embodiment of the present application.
  • the server 1000 may have relatively large differences due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 1001 and One or more memories 1002, where at least one computer program is stored in the memory 1002, and the at least one computer program is loaded and executed by the processor 1001 to implement the graph data processing methods provided by the above method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, and an input and output interface for input and output, and the server may also include other components for realizing device functions.
  • the embodiment of the present application also provides a computer-readable storage medium, at least one piece of computer program is stored in the computer-readable storage medium, and the at least one piece of computer program is loaded and executed by the processor of the computer device to realize the diagrams of the above-mentioned embodiments An operation performed by a computer device in a data processing method.
  • the computer-readable storage medium can be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a compact disc read-only memory (Compact Disc Read-Only Memory, CD-ROM) ), tapes, floppy disks, and optical data storage devices, etc.
  • the computer programs involved in the embodiments of the present application can be deployed and executed on one computer device, or executed on multiple computer devices at one location, or distributed in multiple locations and communicated Executed on multiple computer devices interconnected by the network, multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain system.
  • the embodiment of the present application also provides a computer program product, where the computer program product includes computer program code, and the computer program code is stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, so that the computer device executes the graph data processing method provided in the above various implementation manners.
  • the program can be stored in a computer-readable storage medium.
  • the above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种图数据处理方法、装置、计算机设备、存储介质及计算机程序产品,属于计算机技术领域,能够应用于云技术、人工智能以及智慧交通等场景。方法包括:基于对象数据,获取目标图,基于所述M个节点和所述M个节点之间的有向边进行多轮迭代,得到所述M个节点对应的发起嵌入特征和接收嵌入特征,其中,第z轮迭代的步骤包括:从M个节点中获取N个节点,基于N个节点关联的有向边,确定N个节点关联的多个第一邻域节点以及N个节点关联的多个第二邻域节点,基于N个节点的第一嵌入特征、多个第一邻域节点的第二嵌入特征以及多个第二邻域节点的第三嵌入特征,确定N个节点的第四嵌入特征。

Description

图数据处理方法、装置、计算机设备、存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202111151811.8、申请日为2021年09月29日的中国专利申请提出,并要求以上中国专利申请的优先权,以上中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及计算机技术领域,涉及但不限于一种图数据处理方法、装置、计算机设备、存储介质及计算机程序产品。
背景技术
图嵌入技术已成为大规模图数据挖掘的一种有效技术手段。图嵌入的主要思想是将高维稀疏的原始图数据,映射成低维度稠密的向量,即图中每个节点的嵌入向量,从而在嵌入空间中尽可能地保留图的拓扑结构信息。如何高效的对图数据进行图嵌入是一个研究方向。
目前,常用的图嵌入技术有DeepWalk(深度游走)、Node2Vec(节点到向量)、LINE(Large-scale Information Network Embedding,大规模信息网络嵌入)等,基于随机游走算法来确定图数据中的节点和节点之间的业务关系,确定每个节点的编码向量。
然而,对于有向图来说,有向图中的一对节点分别表示业务关系的发起者和接收者,该对节点之间的有向边从发起者指向接收者,但是上述三种算法最终得到的都是图中每个节点对应的一个编码向量,难以完整表达节点在图中的完整数据关系,因此在处理图数据过程中存在信息损失的问题。
发明内容
本申请实施例提供了一种图数据处理方法、装置、计算机设备、存储介质及计算机程序产品,能够使得节点充分表达图数据中的信息,保证图数据到目标图的转换过程中的信息完整性。
本申请实施例提供一种图数据处理方法,应用于计算机设备,所述方法包括:
基于对象数据,获取目标图,所述目标图包括用于表示M个对象的M个节点,M为大于1的正整数,所述目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且所述有向边的指出节点为所述业务关系的发起者,所述有向边的指向节点为所述业务关系的接收者;
基于所述M个节点和所述M个节点之间的有向边进行多轮迭代,得到所述M个节点对应的发起嵌入特征和接收嵌入特征,所述发起嵌入特征用于表示对应的节点作为发起者的特征,所述接收嵌入特征用于表示对应的节点作为接收者的特征;
其中,第z轮迭代的步骤包括:
从所述M个节点中获取N个节点,N为小于M的正整数,z为正整数;
基于所述N个节点关联的有向边,确定所述N个节点关联的多个第一邻域节点以及所述N个节点关联的多个第二邻域节点,所述第一邻域节点指出的有向边指向所述N个节点,所述N个节点指出的有向边指向所述多个第二邻域节点;
基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征为执行第(z-1)轮迭代得到的,所述N个节点的第四嵌入特征表示所述N个节点经过第z轮迭代更新后的发起嵌入特征和接收嵌入特征。
本申请实施例提供一种图数据处理装置,所述装置包括:
图获取模块,配置为基于对象数据,获取目标图,所述目标图包括表示对象的M个节点,M为正整数,所述目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且所述有向边的指出节点为所述业务关系的发起者,所述有向边的指向节点为所述业务关系的接收者;
迭代模块,配置为基于所述M个节点和所述M个节点之间的有向边进行多轮迭代,得到所述M个节点对应的发起嵌入特征和接收嵌入特征,所述发起嵌入特征用于表示对应的节点作为发起者的特征,所述接收嵌入特征用于表示对应的节点作为接收者的特征;
其中,所述迭代模块,还配置为:
从所述M个节点中获取N个节点,N为小于M的正整数;
基于所述N个节点关联的有向边,确定所述N个节点关联的多个第一邻域节点以及所述N个节点关联的多个第二邻域节点,所述第一邻域节点指出的有向边指向所述N个节点,所述N个节点指出的有向边指向所述多个第二邻域节点;
基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征为执行第(z-1)轮迭代得到的,所述N个节点的第四嵌入特征表示所述N个节点经过第z轮迭代更新后的发起嵌入特征和接收嵌入特征。
本申请实施例提供一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器用于存储至少一段计算机程序,所述至少一段计算机程序由所述处理器加载并执行以实现本申请实施例中的图数据处理方法所执行的操作。
本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一段计算机程序,所述至少一段计算机程序由处理器加载并执行以实现如本申请实施例中图数据处理方法所执行的操作。
本申请实施例提供一种计算机程序产品,该计算机程序产品包括计算机程序代码,该计算机程序代码存储在计算机可读存储介质中,计算机设备的处理器从该计算机可读存储介质读取该计算机程序代码,该处理器执行该计算机程序代码,使得该计算机设备执行上述各个方面的各种实现方式中提供的图数据处理方法。
本申请实施例提供的技术方案带来的有益效果是:
本申请实施例提供了一种图数据处理的技术方案,通过每轮迭代时从目标图中获取N个节点,其中,N个节点为目标图中的部分节点,然后基于该N个节点的第一嵌入特征,该N个节点关联的多个第一邻域节点的第二嵌入特征和多个第二邻域节点的第三嵌入特征,来确定该N个节点的第四嵌入特征,从而完成一轮迭代过程,实现对N个节点的嵌入特征的更新,由于并非是对目标图中全部的M个节点进行处理,因此能够大幅度的减少运算量,在多轮迭代之后,即可得到该M个节点作为发起者的发起嵌入特 征和作为接收者的接收嵌入特征,使得各节点的嵌入特征能够最大化的保存第一邻域和第二邻域的信息,能够使得节点充分表达图数据中的信息,保证图数据到目标图的转换过程中的信息完整性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本申请实施例提供的一种图数据处理方法的实施环境示意图;
图2是根据本申请实施例提供的一种图数据处理方法的流程图;
图3是根据本申请实施例提供的另一种图数据处理方法的流程图;
图4是根据本申请实施例提供的一种有向图的示意图;
图5是根据本申请实施例提供的一种有向图邻域定义的示意图;
图6是根据本申请实施例提供的一种分布式计算架构的示意图;
图7是根据本申请实施例提供的一种图数据处理装置的框图;
图8是根据本申请实施例提供的另一种图数据处理装置的框图;
图9是根据本申请实施例提供的一种终端的结构框图;
图10是根据本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式进行描述。
本申请实施例中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请实施例中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上。
以下,对本申请实施例涉及的技术术语进行解释。
1)图(Graph),在数学中图是描述于一组对象的结构,其中某些对象对在某种意义上是“相关的”。这些对象对应于称为节点的数学抽象(也称为顶点或点),并且每个相关的节点对通过边(也称为链接或线)来连接。在本申请实施例中,该对象为用户账号、物品、虚拟资源、职位以及虚拟直播间等,本申请实施例对此不进行限制。如对象为用户账号,图中的不同节点表示不同的用户账号,图中的边表示用户账号之间的业务关系,如转账、申请好友等。
2)有向图(Oriented Graph)是全部由有向边构成的图。有向图中的一个节点具有两种不同的身份或角色。例如,从节点1和节点2指出的有向边指向节点3,则节点3相对于节点1和节点2的身份为接收者(也可以称为响应者或目标角色);从节点3指出的有向边指向节点4和节点五,则节点3相对于节点4和节点5的身份为发起者(也可以称为发送者或源角色)。
3)有向关系不对称:有向图中一对节点的边关系是不对称的,如有向图中存在由节点1指向节点3的有向边,而不存在由节点3指向节点1的有向边,则节点1和节点 3的关系不对称。
4)图嵌入(Graph Embedding,也称图表示学习),是一种将图数据(通常为高维稠密的矩阵)映射为低微稠密向量的过程,能够很好地解决图数据难以高效输入机器学习算法的问题。
5)边表,也称边集表,图的储存结构之一。边表由表头结点和表结点两部分组成,图中每个顶点均对应一个存储在数组中的表头结点。
6)出度:由某个节点指出的边的个数称为该节点的出度。
7)入度:指向某个节点的边的个数称为该节点的入度。
本申请实施例提供的图数据处理方法,能够应用于各种场景,包括但不限于云技术、人工智能、智慧交通以及辅助驾驶等。以下介绍本申请实施例的实施环境。
本申请实施例提供的图数据处理方法,能够由计算机设备执行,该计算机设备为终端或服务器。下面首先以计算机设备为服务器为例,介绍本申请实施例提供的图数据处理方法的实施环境,图1是根据本申请实施例提供的一种图数据处理方法的实施环境示意图。参见图1,该实施环境包括终端101和服务器102。
终端101和服务器102能够通过有线或无线通信方式进行直接或间接地连接,在本申请实施例中不做限制。
在一些实施例中,终端101可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、智能家电以及车载终端等,但并不局限于此。终端101中安装和运行有应用程序。该应用程序可以是社交程序、购物程序、多媒体程序以及游戏程序等,本申请实施例对此不进行限制。在一些实施例中,终端101可以是用户使用的终端,该终端101运行有社交程序,该社交程序登录有该用户的用户账号。在一些实施例中,上述终端101的数量可以更多或更少。比如上述终端101仅为一个,或者上述终端101为几十个或几百个,或者更多数量。本申请实施例对终端101的数量和设备类型不加以限定。
在一些实施例中,服务器102可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还能够是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)以及大数据和人工智能平台等基础云计算服务的云服务器。服务器102用于为终端101运行的上述应用程序提供后台服务。在一些实施例中,服务器102承担主要计算工作,终端101承担次要计算工作;或者,服务器102承担次要计算工作,终端101承担主要计算工作;或者,服务器102和终端101二者之间采用分布式计算架构进行协同计算。
在该实施环境中,以终端101安装的应用程序为购物程序为例进行说明。
例如,终端101安装的购物程序登录有用户账号,服务器102将用户账号数据和商品数据作为对象数据,基于该对象数据构建目标图,该目标图中的节点表示用户账号或商品,该目标图中的有向边表示用户账号与物品之间的业务关系,如顾客使用用户账号购买某个物品,表示顾客的用户账号的节点1与表示物品的节点2之间的有向边表示购买这一业务关系,该有向边由节点1指向节点2,此时节点1为购买的发起者,节点2为购买的接收者;同理,商家使用用户账号上架某个物品,表示商家的用户账号的节点3与表示物品的节点2之间的有向边表示上架这一业务关系,该有向边由节点3指向节点2,此时节点3为上架的发起者,节点2为上架的接收者。服务器102能够采用本申请实施例提供的图数据处理方法,获取该目标图中每个节点的嵌入特征。然后将该嵌入特征应用于物品推荐业务、社交推荐业务以及广告宣传业务等,本申请实施例对此不进行限制。例如,在购物程序中,通过将有向边上的两个节点的嵌入特征进行内积运算, 可以确定该有向边的指向关系,从而基于多个有向边的指向关系,可以确定出用户账号的历史购物信息,从而基于历史购物信息进行物品推荐,能够提高推荐精度。
在另一实施例中,以计算机设备为终端,终端安装有多媒体程序为例。
例如,终端在用户授权之后,获取用户收听歌曲的历史记录数据,该历史记录数据包括歌手名称、歌曲名称以及歌曲类型等对象。终端以该历史记录数据为对象数据,构建目标图,该目标图中的节点表示歌手名称、歌曲名称以及歌曲类型等,该目标图中的有向边表示对象之间的业务关系,如某歌手演唱某歌曲,某歌曲属于某类型,则表示该歌手的节点A与该歌曲的节点B之间的有向边由节点A指向节点B,表示该歌曲的节点B与表示该类型的节点C之间的有向边由节点B指向节点C。终端101能够采用本申请实施例提供的图数据处理方法,获取该目标图中每个节点的嵌入特征。然后将该嵌入特征应用于物品推荐业务、社交推荐业务以及广告宣传业务等,本申请实施例对此不进行限制。
在另一实施例中,以计算机设备为终端,终端安装有社交网络APP为例进行说明。
例如,终端在得到用户授权之后,获取用户的互动历史数据,该互动历史数据包括转发、评论、点赞的其他用户标识。终端以该互动历史数据为对象数据,构建目标图,该目标图的节点表示社交网络的不同用户、有向边表示社交关系,例如,用户1点赞了用户2,那么用户1于用户2之间的有向边由表示用户1的节点A指向表示用户2的节点B,终端101能够采用本申请实施例提供的图数据处理方法,确定目标图中每个节点的嵌入特征,然后利用不同节点的嵌入特征,可以确定不同用户之间的相似度,从而能够基于不同用户之间的相似度预测用户的潜在好友并进行推荐,能够提高推荐准确度。
图2是根据本申请实施例提供的一种图数据处理方法的流程图,如图2所示,在本申请实施例中以由服务器执行为例进行说明。该图数据处理方法包括以下步骤:
201、基于对象数据,获取目标图,该目标图包括用于表示M个对象的M个节点,M为大于1的正整数,该目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且该有向边的指出节点为该业务关系的发起者,该有向边的指向节点为该业务关系的接收者。
在本申请实施例中,该对象数据中的对象包括用户账号、物品、多媒体资源、虚拟资源、虚拟服务以及虚拟直播间等,该对象数据中包括对象之间的关系信息,该关系信息包括对象之间的业务关系以及关联关系等。服务器基于不同的对象数据,能够构建不同的目标图,如社交关系图、购物图以及虚拟直播图等。其中,该对象数据为经过充分授权后采集和使用的数据。目标图中的有向边能够表示对象之间的业务关系,如有向边连接的两个节点对应的对象为两个用户账号,则该有向边表示的业务关系可以为转账关系、关注关系以及点赞关系等。对于任一有向边,指出该有向边的节点为该有向边所表示的业务关系的发起者;该有向边指向的节点为该有向边所表示的业务关系的接收者。
在本申请实施例中,每个对象对应一个节点。
202、基于该M个节点和该M个节点之间的有向边进行多轮迭代,得到该M个节点对应的发起嵌入特征和接收嵌入特征,该发起嵌入特征用于表示对应的节点作为发起者的特征,该接收嵌入特征用于表示对应的节点作为接收者的特征。
在本申请实施例中,服务器能够基于多轮迭代,确定M个节点中每个节点作为发起者的发起嵌入特征以及每个节点作为接收者的接收嵌入特征。在每轮迭代过程中,对于任一节点,服务器能够基于节点之间的有向边,获取以该节点为接收者的第一邻域节点,以及以该节点为发起者的第二邻域节点,进而采用梯度下降的方式,逐步确定该节点的发起嵌入特征和接收嵌入特征。
下面以第z轮迭代为例,z为正整数。第z轮迭代的步骤包括:服务器从上述M个 节点中获取N个节点,N为正整数。然后服务器基于该N个节点关联的有向边,确定该N个节点关联的多个第一邻域节点以及该N个节点关联的多个第二邻域节点。其中,该第一邻域节点指出的有向边指向该N个节点,该N个节点指出的有向边指向该多个第二邻域节点。然后服务器基于上述N个节点的第一嵌入特征、多个第一邻域节点的第二嵌入特征以及多个第二邻域节点的第三嵌入特征,确定该N个节点的第四嵌入特征,完成本轮迭代。该N个节点的第四嵌入特征表示所述N个节点经过第z轮迭代更新后的发起嵌入特征和接收嵌入特征。
需要说明的是,上述N个节点的第一嵌入特征、多个第一邻域节点的第二嵌入特征以及多个第二邻域节点的第三嵌入特征,由服务器执行第z-1轮迭代后得到,其中,N个节点的第一嵌入特征表示N个节点经过第(z-1)轮迭代更新后得到的发起嵌入特征和接收嵌入特征。
本申请实施例提供了一种图数据处理方法,通过每轮迭代时从目标图中获取N个节点,然后基于该N个节点的第一嵌入特征,该N个节点关联的多个第一邻域节点的第二嵌入特征和多个第二邻域节点的第三嵌入特征,来确定该N个节点的第四嵌入特征,从而完成一轮迭代过程,实现对N个节点的嵌入特征的更新,由于并非是对M个节点进行处理,因此能够大幅度的减少运算量,在多轮迭代之后,即可得到该M个节点作为发起者的发起嵌入特征和作为接收者的接收嵌入特征,使得各节点的嵌入特征能够最大化的保存第一邻域和第二邻域的信息,能够使得节点完整表达图数据中的信息,消除了数据转换过程中的信息损失的情况。并且基于有向边连接的两个节点的嵌入特征之间的内积即可确定该有向边的指向关系,从而既能表示节点的双重身份,又能表示有向边的指向关系。
图2示例性的示出了本申请实施例提供的图数据处理方法的实现流程,下面基于一种应用场景,对该图数据处理方法进行介绍。图3是根据本申请实施例提供的另一种图数据处理方法的流程图,如图3所示,在本申请实施例中以由服务器执行,应用于社交场景为例进行说明。该图数据处理方法包括以下步骤:
301、基于对象数据,获取目标图,该目标图包括用于表示M个对象的M个节点,M为大于1的正整数,该目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且该有向边的指出节点为该业务关系的发起者,该有向边的指向节点为该业务关系的接收者。
在本申请实施例中,该对象数据包括对象信息及关系信息,该对象信息用于表示对象的属性,该关系信息用于表示对象之间的业务关系。在该对象数据为用户账号数据时,该用户账号数据包括用户账号信息和账号关系信息。该用户账号信息包括用户账号标识、用户账号等级等用户账号的属性。该账号关系信息包括该用户账号与其他用户账号之间的业务关系等,本申请实施例对该对象数据所包括的内容不进行限制。
在本申请实施例中,服务器能够基于上述对象数据构建目标图,该目标图中包括M个节点和该M个节点之间的有向边。其中,该M个节点表示对象数据中包括的M个不同的用户账号,M为大于1的正整数。该M个节点中,任意两个节点之间的有向边用于表示该两个节点对应的用户账号之间存在业务关系,如好友关系、群组关系、收藏关系以及转账关系等。对于任一有向边,该指出该有向边的节点为该有向边所表示的业务关系的发起者,该有向边指向的节点为该有向边所表示的业务关系的接收者。
例如,图4是根据本申请实施例提供的一种有向图的示意图。参见图4所示,该有向图示例性的示出了A、B、C、D、E、F这个6个表示用户账号的节点。以节点C为例,节点C指出两条有向边,且有两条有向边指向节点C。节点C相对于节点A和节点B来说,扮演着业务关系的接收者,也可以称为目标角色;节点C相对于节点E和 节点F来说,扮演着业务关系的发起者,也可以称为源角色。如节点A和节点B表示的用户账号分别向节点C表示的用户账号发起转账业务,节点A和节点B为该转账业务的发起者,节点C为该转账业务的接收者;如节点C表示的用户账号创建的群组中包括节点E表示的用户账号,则节点C为该群组业务的发起者,节点E为该群组业务的接收者;如节点C表示的用户账号收藏有节点F表示的用户账号发布的内容,则节点C为该收藏业务的发起者,节点F为该收藏业务的接收者。
在一些实施例中,服务器还能够将上述对象数据发送至图创建服务器,由该图创建服务器基于上述对象数据,创建目标图。当然,服务器还能够从本地数据库中获取基于已基于上述对象数据创建完毕的目标图。本申请实施例对目标图的来源不进行限制。
在本申请实施例中,服务器能够基于该M个节点和该M个节点之间的有向边进行多轮迭代,以确定每个节点作为业务关系的发起者时的特征,以及每个节点作为业务关系的接收者时的特征。每轮迭代时,服务器能够对该M个节点分别进行处理,也能够采用分布式处理的方式,基于多个执行器分别处理M个节点中的部分节点,以提高处理效率。
在一些实施例中,服务器能够将目标图划分为至少一个图分区,每个图分区包括多个节点和该多个节点之间的有向边。对于任一图分区,服务器将该图分区中的节点存储至参数服务器中,将该图分区对应的边表,存储至该图分区对应的执行器中。通过将目标图划分为图分区,将节点分布式存储在参数服务器中,使得服务器能够基于多个执行器,实现分布式计算,显著的提高计算图嵌入的效率。
服务器中设置有多个执行器,该多个执行器能够同步执行图数据处理任务。以任一执行器为例,该执行器能够采用小批量随机梯度下降的方式确定该图分区中各个节点的嵌入特征,每轮迭代该执行器能够从对应的图分区中获取N个节点,N为正整数。
下面,以任一执行器的第z轮迭代为例进行介绍,z是正整数,步骤302-步骤307示例性的示出了第z轮迭代的一种实现方式,步骤302、步骤303以及步骤308-步骤310示例性示出了第z轮迭代的另一种实现方式。
302、从该M个节点中获取N个节点,N为小于M的正整数。
在本申请实施例中,服务器能够对上述M个节点进行采样,得到N个节点,该N个节点为第z轮迭代将处理的节点。
在一些实施例中,每个执行器均获取N个节点进行处理。对于任一执行器,服务器基于执行器存储的边表,采用Alias Table(别名表)的方法,从该目标图中采样得到多个采样边,其中,可以是从目标图中随机采样得到多个采样边,还可以是按照一定规则进行采样,例如,可以每间隔预设个数采样一个节点。该多个采样边连接的节点为上述N个节点。服务器能够基于该多个采样边,确定N个节点标识,然后从该参数服务器中获取该N个节点标识对应的N个节点。当然,服务器还能够采用其他方式确定采样边,或者直接采样节点,本申请实施例对此不进行限制。
303、基于该N个节点关联的有向边,确定该N个节点关联的多个第一邻域节点以及该N个节点关联的多个第二邻域节点,该第一邻域节点指出的有向边指向该N个节点,该N个节点指出的有向边指向该多个第二邻域节点。
在本申请实施例中,对于任一节点,服务器能够对该节点的邻域进行建模,根据节点的双重身份,将节点的邻域分为第一邻域和第二邻域,然后从第一邻域获取该节点关联的至少一个第一邻域节点,从第二邻域获取该节点关联的至少一个第二邻域节点。其中,第一邻域中的第一邻域节点与该节点之间的有向边指向该节点;第二邻域中的第二邻域节点与该节点之间的有向边由该节点指出。需要说明的是,根据节点与节点之间的距离,将节点的邻域划分为1阶至R阶(也可以称为1跳邻居至R跳邻居),R为正整 数。其中,第一邻域也可以称为源邻域,第二邻域也可以称为目标邻域。
例如,以业务关系为转账关系为例,对于任一节点,该节点关联的至少一个第一邻域节点所表示的用户账号,均转账给该节点所表示的用户账号;该节点关联的至少一个第二邻域节点所表示的用户账号,均接收到该节点所表示的用户账号的转账。
对于任一节点v n,该节点的R阶内第一邻域和第二邻域的表示方式,参见公式(1)和公式(2)所示:
Figure PCTCN2022113580-appb-000001
其中,N s(v n)表示节点v n的R阶内第一邻域;
Figure PCTCN2022113580-appb-000002
表示节点v n的一阶第一邻域;
Figure PCTCN2022113580-appb-000003
表示节点v n的二阶第一邻域;
Figure PCTCN2022113580-appb-000004
表示节点v n的R阶第一邻域,R为正整数。
Figure PCTCN2022113580-appb-000005
其中,N t(v n)表示节点v n的R阶内第二邻域;
Figure PCTCN2022113580-appb-000006
表示节点v n的一阶第二邻域;
Figure PCTCN2022113580-appb-000007
表示节点v n的二阶第二邻域;
Figure PCTCN2022113580-appb-000008
表示节点v n的R阶第二邻域,R为正整数。
例如,图5是根据本申请实施例提供的一种有向图邻域定义的示意图。参见图5所示,节点A的一阶第一邻域包括节点B、节点C和节点D,节点A的二阶第一邻域包括节点H和节点I;节点A的一阶第二邻域包括节点E、节点F和节点G,节点A的二阶第二邻域包括节点J和节点K。其中,一阶第一邻域中的任一节点与节点A之间的有向边指向节点A,二阶第一邻域中任一节点连接的有向边指向一阶第一邻域中的节点;一阶第二邻域中的任一节点与节点A之间的有向边由节点A指出,二阶第二邻域中任一节点连接的有向边由一阶第二邻域中的节点指出。
需要说明的是,本申请实施例提供了两种实现方式,第一种实现方式基于节点确定第一邻域节点和第二邻域节点的概率信息,来确定该节点的嵌入特征,参见步骤304-步骤307所示;第二种实现方式,通过对节点的第一邻域和第二邻域进行负采样,来确定该节点的嵌入特征,参见步骤308-步骤310所示。
304、基于该N个节点的第一嵌入特征和该多个第一邻域节点的第二嵌入特征,确定第一概率信息,该第一概率信息用于指示基于该N个节点确定该多个第一邻域节点的概率。
在本申请实施例中,对于图中的节点v n,在embedding(嵌入)空间中该节点v n对应有一个发起嵌入特征s n(也可以称为源角色表征向量)和一个接收嵌入特征t n(也可以称为目标角色表征向量),则最终目标为确定目标图中M个节点的发起嵌入特征构成的发起嵌入矩阵S′和M个节点的接收嵌入特征构成的接收嵌入矩阵T′。其中,发起嵌入矩阵S′中的元素表示目标图中各个节点的发起嵌入特征,接收嵌入矩阵T′中的元素表示目标图中各个节点的接收嵌入特征。
在本申请实施例中,第一嵌入特征是指该N个节点的嵌入特征,对于N个节点中的任一节点,该节点的第一嵌入特征包括该节点的发起嵌入特征和该节点的接收嵌入特征。其中,N个节点中的每个节点对应至少一个第一邻域节点。第二嵌入特征是指该多个第一邻域节点的嵌入特征,对于任一第一邻域节点,该第一邻域节点的第二嵌入特征 包括该第一邻域节点的发起嵌入特征和该第一邻域节点的接收嵌入特征。
在一些实施例中,对于节点v n,基于该节点v n确定该节点的第一邻域节点v i时,该节点v n为业务关系的接收者,该第一邻域节点v i为业务关系的发起者,则基于softmax(归一化)函数去定义节点v n在整个目标图中确定第一邻域节点v i的概率,得到第一概率信息。服务器基于N个节点的第一嵌入特征和多个第一邻域节点的第二嵌入特征,确定第一概率信息的步骤包括:服务器从该N个节点的第一嵌入特征中,获取N个第一中间嵌入特征,该N个第一中间嵌入特征用于表示该N个节点作为该接收者的特征,也即该N个节点的接收嵌入特征。服务器从该多个第一邻域节点的第二嵌入特征中,获取多个第二中间嵌入特征,该多个第二中间嵌入特征用于表示该多个第一邻域节点作为该发起者的特征,也即该多个第一邻域节点的发起嵌入特征。服务器获取该M个节点的M个第七中间嵌入特征,该M个第七中间嵌入特征用于表示该M个节点作为该发起者的特征,也即该个节点的发起嵌入特征。最后基于该N个第一中间嵌入特征、该多个第二中间嵌入特征以及该M个第七中间嵌入特征,确定该第一概率信息。相应的,该第一概率信息的计算方式,参见公式(3)所示:
Figure PCTCN2022113580-appb-000009
其中,p s(v i|v n)表示第一概率信息中的如下概率:基于节点v n在整个目标图中确定第一邻域节点v i的概率;exp(·)表示以e为底的指数函数;t n表示节点v n的第一中间嵌入特征;T表示矩阵转置;s i表示第一邻域节点v i的第二中间嵌入特征;M表示目标图中节点的总数,M为正整数;s m表示节点v m的第七中间嵌入特征,节点v m表示目标图中的第m个节点。
305、基于该N个节点的第一嵌入特征和该多个第二邻域节点的第三嵌入特征,确定第二概率信息,该第二概率信息用于指示基于该N个节点确定该多个第二邻域节点的概率。
在本申请实施例中,与步骤304同理,对于节点v n,基于该节点v n确定该节点的第二邻域节点v j时,该节点v n为业务关系的发起者,该第二邻域节点v j为业务关系的接收者,则基于softmax函数去定义节点v n在整个目标图中确定第二邻域节点v j的概率,得到第二概率信息。
在本申请实施例中,N个节点中的每个节点对应至少一个第二邻域节点。第三嵌入特征是指该多个第二邻域节点的嵌入特征,对于任意第二邻域节点,该第二邻域节点的第三嵌入特征包括该第二邻域节点的发起嵌入特征和该第二邻域节点的接收嵌入特征。
服务器基于该N个节点的第一嵌入特征和该多个第二邻域节点的第三嵌入特征,确定第二概率信息的步骤包括:服务器从该N个节点的第一嵌入特征中,获取N个第四中间嵌入特征,该N个第四中间嵌入特征用于表示该N个节点作为该发起者的特征,也即该N个节点的发起嵌入特征。服务器从该多个第二邻域节点的第三嵌入特征中,获取多个第五中间嵌入特征,该多个第五中间嵌入特征用于表示该多个第二邻域节点作为该接收者的特征,也即该多个第二邻域节点的接收嵌入特征。服务器获取该M个节点 的M个第八中间嵌入特征,该M个第八中间嵌入特征用于表示该M个节点作为该接收者的特征,也即该M个节点的接收嵌入特征。最后基于该N个第四中间嵌入特征、该多个第五中间嵌入特征以及该M个第八中间嵌入特征,确定该第二概率信息。相应的,该第二概率信息的计算方式,参见公式(4)所示:
Figure PCTCN2022113580-appb-000010
其中,p t(v j|v n)表示第二概率信息中的如下概率:基于节点v n在整个目标图中确定第二邻域节点v j的概率;exp(·)表示以e为底的指数函数;s n表示节点v n的第四中间嵌入特征;T表示矩阵转置;t j表示第二邻域节点v j的第五中间嵌入特征;M表示目标图中节点的总数,M为正整数;t m表示节点v m的第八中间嵌入特征,节点v m表示目标图中的第m个节点。
306、基于该第一概率信息和该第二概率信息,确定第二目标函数。
在本申请实施例中,为了最大化的保存节点的第一邻域和第二邻域中的信息,采用最大似然估计的思想构建目标函数,则对于给定的节点v n,多轮迭代的目标是最大化基于该节点v n确定该节点v n的邻居节点的似然函数,该邻居节点包括第一邻域节点和第二邻域节点。假设节点v n确定邻居节点的概率相互独立,则该似然函数参见公式(5)所示:
Figure PCTCN2022113580-appb-000011
其中,
Figure PCTCN2022113580-appb-000012
表示最大似然函数;S′表示发起嵌入矩阵;T′表示接收嵌入矩阵;N s(v n)表示节点v n的R阶内第一邻域;N t(v n)表示节点v n的R阶内第二邻域;∏(·)表示连乘函数;p s(v i|v n)表示第一概率信息中的如下概率:基于节点v n在整个目标图中确定第一邻域节点v i的概率;p t(v j|v n)表示第二概率信息中的如下概率:基于节点v n在整个目标图中确定第二邻域节点v j的概率。
在一些实施例中,服务器能够基于负对数似然函数(Negative Log Likelihood)将上述公式(5)中的最大化问题转换为最小化问题,从而得到第二目标函数,该第二目标函数参见公式(6)所示:
Figure PCTCN2022113580-appb-000013
其中,L likelihood表示负对数似然函数;N s(v n)表示节点v n的R阶内第一邻域;N t(v n)表示节点v n的R阶内第二邻域;v i表示第一邻域节点;v j表示第二邻域节点; log(·)表示对数函数;exp(·)表示以e为底的指数函数;t n表示节点v n的第一中间嵌入特征;T表示矩阵转置;s i表示第一邻域节点v i的第二中间嵌入特征;M表示目标图中节点的总数,M为正整数;s m表示节点v m的第七中间嵌入特征,节点v m表示目标图中的第m个节点;s n表示节点v n的第四中间嵌入特征;t j表示第二邻域节点v j的第五中间嵌入特征;t m表示节点v m的第八中间嵌入特征。
307、对该第二目标函数进行优化,得到该N个节点的第四嵌入特征。
在本申请实施例中,服务器能够基于第二目标函数进行迭代,以获取N个节点的第四嵌入特征。该第四嵌入特征表示该N个节点经过迭代更新后的发起嵌入特征和接收嵌入特征。由于该第二目标函数的优化目标为最小化,因此服务器基于随机梯度下降方法进行多轮迭代,从而得到该N个节点的发起嵌入特征和接收嵌入特征。
在一些实施例中,由于直接优化上述第二目标函数的复杂度较高,因此服务器能够采用负采样对比学习的方式来实现优化。相应的,服务器通过执行步骤308-步骤310,取代执行步骤304-步骤307,能够将优化第二目标函数的问题转换为优化第一目标函数的问题,第一目标函数详见步骤308-步骤310。
需要说明的是,对于图中的节点v n,该节点v n的第一邻域和第二邻域具有不同的负样本分布,该第一邻域包括节点v n的第一邻域节点v i,该第二邻域包括节点v n的第二邻域节点v j。当对节点v n的第一邻域负采样时,负采样关注的是有向边指向节点v n的第一邻域节点v i,则得到节点入度分布信息,即表示指向当前节点v n的有向边的数量的分布。当对节点v n的第二邻域负采样时,负采样关注的是节点v n指出的有向边所指向的第二邻域节点v j,则得到节点出度分布信息,即表示当前节点v n指出的有向边的数量的分布。节点入度分布信息和节点出度分布信息的计算方式参见公式(7)和公式(8)所示:
Figure PCTCN2022113580-appb-000014
其中,
Figure PCTCN2022113580-appb-000015
表示节点入度分布信息中,指向节点v n的有向边的数量的分布;
Figure PCTCN2022113580-appb-000016
表示指向节点v n的有向边的数量;M表示目标图中节点的总数,M为正整数;
Figure PCTCN2022113580-appb-000017
表示指向节点v m的有向边的数量,节点v m表示目标图中的第m个节点。
Figure PCTCN2022113580-appb-000018
其中,
Figure PCTCN2022113580-appb-000019
表示节点v n指出的有向边的数量的分布;
Figure PCTCN2022113580-appb-000020
表示节点v n指出的有向边的数量;M表示目标图中节点的总数,M为正整数;
Figure PCTCN2022113580-appb-000021
表示节点v m指出的有向边的数量,节点v m表示目标图中的第m个节点。
308、基于该N个节点的第一嵌入特征、该多个第一邻域节点的第二嵌入特征、节点入度分布信息以及多个第一节点的第五嵌入特征,确定第一函数,该节点入度分布信息用于表示该N个节点的节点入度的分布,该多个第一节点基于该N个节点的第一邻域负采样得到,该第一邻域包括该多个第一邻域节点。
在本申请实施例中,在得到节点入度分布信息之后,服务器能够从参数服务器获取该N个节点在第z-1轮迭代后的第一嵌入特征、该多个第一邻域节点的第二嵌入特征以及该多个第一节点的第五嵌入特征,然后基于获取到的上述特征确定第一函数。对于图中的节点v n,在embedding空间中该节点v n对应有一个发起嵌入特征s n和一个接收嵌入特征t n,则最终目标为确定目标图中M个节点的发起嵌入特征构成的发起嵌入矩阵S′和M个节点的接收嵌入特征构成的接收嵌入矩阵T′。其中,发起嵌入矩阵S′中的元素表示目标图中各个节点的发起嵌入特征,接收嵌入矩阵T′中的元素表示目标图中各个节点的接收嵌入特征。
在本申请实施例中,该第五嵌入特征是指该多个第一节点的嵌入特征,对于任一第一节点,该第一节点的第五嵌入特征包括该第一节点的发起嵌入特征和该第一节点的接收嵌入特征。
对于节点v n,基于该节点v n确定该节点的第一邻域节点v i时,该节点v n为业务关系的接收者,该第一邻域节点v i为业务关系的发起者,如业务关系为转账,第一邻域节点v i表示的用户账号均转账给节点v n表示的用户账号。基于负采样得到的节点入度分布信息,能够确定第一函数。
在一些实施例中,服务器基于该N个节点的第一嵌入特征、该多个第一邻域节点的第二嵌入特征、节点入度分布信息以及多个第一节点的第五嵌入特征,确定第一函数的步骤包括:服务器从该N个节点的第一嵌入特征中,获取N个第一中间嵌入特征,该N个第一中间嵌入特征用于表示该N个节点作为该接收者的特征,也即该N个节点的接收嵌入特征。服务器从该多个第一邻域节点的第二嵌入特征中,获取多个第二中间嵌入特征,该多个第二中间嵌入特征用于表示该多个第一邻域节点作为该发起者的特征,也即该多个第一邻域节点的发起嵌入特征。服务器基于该节点入度分布信息,确定第一期望信息,该第一期望信息用于表示该多个第一节点满足该节点入度分布信息的数学期望。服务器从该多个第一节点的第五嵌入特征中,获取多个第三中间嵌入特征,该多第三中间嵌入特征用于表示该多个第一节点作为该发起者的特征,也即该多个第一节点的发起嵌入特征。最后服务器基于上述N个第一中间嵌入特征、多个第二中间嵌入特征、第一期望信息以及多个第三中间嵌入特征,确定第一函数。通过基于负采样得到的节点入度分布信息来确定第一函数,由于不需要基于目标图中所有节点的嵌入特征进行计算,能够显著的减少计算的复杂度,从而提高计算效率,节约计算资源。
其中,第一函数参见公式(9)所示:
Figure PCTCN2022113580-appb-000022
其中,L 1表示第一函数,N s(v n)表示节点v n的R阶内第一邻域;v i表示第一邻域节点;log(·)表示对数函数;σ(·)表示sigmoid函数,t n表示节点v n的第一中间嵌 入特征;T表示矩阵转置;s i表示第一邻域节点v i的第二中间嵌入特征;K表示对第一邻域进行负采样得到的第一节点的数量,K为正整数;
Figure PCTCN2022113580-appb-000023
表示第一期望信息;v k表示第k个第一节点;
Figure PCTCN2022113580-appb-000024
表示节点入度分布信息中,指向节点v n的有向边的数量的分布;s k表示第一节点v k的第三中间嵌入特征。
309、基于该N个节点的第一嵌入特征、该多个第二邻域节点的第三嵌入特征、节点出度分布信息以及多个第二节点的第六嵌入特征,确定第二函数,该节点出度分布信息用于表示该N个节点的节点出度的分布,该多个第二节点基于该N个节点的第二邻域负采样得到,该第二邻域包括该多个第二邻域节点。
在本申请实施例中,与步骤308同理,对于节点v n,基于该节点v n确定该节点的第二邻域节点v j时,该节点v n为业务关系的发起者,该第二邻域节点v j为业务关系的接收者,如业务关系为转账,第二邻域节点v j表示的用户账号均接收到节点v n表示的用户账号的转账。基于负采样得到的节点出度分布信息,能够确定第二函数。
在本申请实施例中,该第六嵌入特征是指该多个第二节点的嵌入特征,对于任一第二节点,该第二节点的第六嵌入特征包括该第二节点的发起嵌入特征和该第二节点的接收嵌入特征。
在一些实施例中,服务器基于N个节点的第一嵌入特征、多个第二邻域节点的第三嵌入特征、节点出度分布信息以及多个第二节点的第六嵌入特征,确定第二函数的步骤包括:服务器从N个节点的第一嵌入特征中,获取N个第四中间嵌入特征,该N个第四中间嵌入特征用于表示该N个节点作为该发起者的特征,也即该N个节点的发起嵌入特征。服务器从多个第二邻域节点的第三嵌入特征中,获取多个第五中间嵌入特征,该多个第五中间嵌入特征用于表示该多个第二邻域节点作为该接收者的特征,也即该多个邻域节点的接收嵌入特征。服务器基于节点出度分布信息,确定第二期望信息,该第二期望信息用于表示该多个第二节点满足该节点出度分布信息的数学期望。服务器从多个第二节点的嵌入特征中,获取多个第六中间嵌入特征,该多个第六中间嵌入特征用于表示该多个第二节点作为该接收者的特征,也即该多个第二节点的接收嵌入特征。最后服务器基于上述N个第四中间嵌入特征、多个第五中间嵌入特征、第二期望信息以及多个第六中间嵌入特征,确定该第二函数。通过基于负采样得到的节点出度分布信息来确定第二函数,由于不需要基于目标图中所有节点的嵌入特征进行计算,能够显著的减少计算的复杂度,从而提高计算效率,节约计算资源。
其中,第二函数参见公式(10)所示:
Figure PCTCN2022113580-appb-000025
其中,L 2表示第二函数,N t(v n)表示节点v n的R阶内第二邻域;v j表示第二邻域节点;log(·)表示对数函数;σ(·)表示sigmoid函数,s n表示节点v n的第四中间嵌入特征;T表示矩阵转置;t j表示第二邻域节点v j的第五中间嵌入特征;L表示对第一邻域进行负采样得到的第二节点的数量,L为正整数;
Figure PCTCN2022113580-appb-000026
表示第二期望信息; v l表示第l个第二节点;
Figure PCTCN2022113580-appb-000027
表示节点出度分布信息中,节点v n指出的有向边的数量的分布;t l表示第二节点v l的第六中间嵌入特征。
310、对基于该第一函数和该第二函数构成的第一目标函数进行优化,得到该N个节点的第四嵌入特征。
在本申请实施例中,服务器能够基于第一函数和第二函数构建第一目标函数,该第一目标函数参见公式(11)所示:
L neg=L 1+L 2(11);
其中,L neg表示第一目标函数;L 1表示第一函数;L 2表示第二函数。
服务器采用小批量随机梯度下降方法,基于多轮迭代来优化该第一目标函数,得到上述N个节点的第四嵌入特征。
在一些实施例中,在保留一阶邻域的情况下,能够在高稀疏的图数据上达到期望的效果。并且,在保留节点的一阶第一邻域和一阶第二邻域的情况下,第一目标函数中对节点的邻域求和等价于对图中所有边求和,因此服务器能够将第一目标函数转化为第三目标函数。
其中,第三目标函数是以边的视角进行表示的函数,对于一条有向边(v x,v y),节点v x为该有向边对应的业务关系的发起者,节点v y为该有向边对应的业务关系的接收者,也即节点v x扮演源角色,节点v y扮演目标角色。在正向观察该有向边(v x,v y)时,关注基于节点v x确定节点v y的概率,并对节点v x的一阶第二邻域进行负采样得到负采样节点v f;在反向观察该有向边(v x,v y)时,关注基于节点v y确定节点v x的概率,并对节点v y的一阶第一邻域进行负采样得到负采样节点v y,也即同时考虑了正反两个方向的信息,则该第三目标函数对应的算法可以称为双向图嵌入算法(Bidirectional Graph Embedding,BGE)。
其中,第三目标函数如公式(12)所示:
Figure PCTCN2022113580-appb-000028
其中,L 3表示第三目标函数;(v x,v y)表示目标图中节点v x和节点v y之间的有向边,该有向边由节点v x指出,指向节点v y;E graph表示目标图中有向边的集合;log(·)表示对数函数;σ(·)表示sigmoid函数;s x表示节点v x的发起嵌入特征;T表示矩阵转置;t y表示节点v y的接收嵌入特征;F表示对节点v x的一阶第二邻域进行负采样得到负采样节点v f的数量,F为正整数;
Figure PCTCN2022113580-appb-000029
表示节点出度分布信息中,节点v x指出的有向边的数量的分布;t f表示节点v f的接收嵌入特征;G表示对节点v y的一阶第一邻域进行负采样得到负采样节点v g的数量,G为正整数;
Figure PCTCN2022113580-appb-000030
表示节点入度分布信息中,指向节点v y的有向边的数量的分布;s g表示节点v g的发起嵌入特征。
需要说明的是,服务器基于上述步骤302-步骤307示出的实现方式进行多轮迭代后,执行步骤311,或者服务器基于上述步骤302、步骤303以及步骤308-步骤310示出的实现方式进行多轮迭代后,执行步骤311。
311、响应于多轮迭代结束,得到该M个节点对应的发起嵌入特征和接收嵌入特征,该发起嵌入特征用于表示对应的节点作为发起者的特征,该接收嵌入特征用于表示对应的节点作为接收者的特征。
在本申请实施例中,服务器在迭代目标轮数后结束迭代,得到目标图中每个节点作为发起者的发起嵌入特征以及每个节点作为接收者的接收嵌入特征。
在一些实施例中,对于任一节点,服务器能够将该节点的发起嵌入特征和接收嵌入特征与该节点表示的用户账号的账号特征进行拼接,然后基于拼接得到的拼接特征来确定该节点与其他节点之间的业务是否存在风险。如该节点对应的用户账号向另一节点对应的用户账号发起转账请求时,服务器能够在上述两个用户账号的账号特征的基础上,拼接对应的发起嵌入特征和接收嵌入特征,将拼接得到的特征输入反欺诈模型,从而判断该转账请求是否是欺诈业务。通过将基于本申请实施例提供的图数据处理方法得到的嵌入特征引入至反欺诈模型,使得该反欺诈模型的欺诈笔数命中率由14%-21%的相对提升,欺诈金额的命中率由27%-47%的相对提升。并且,本申请提供的图数据处理方法在获取节点的嵌入特征的效率上,是现有技术方案的2-5倍,效率显著提高。
需要说明的是,为了使服务器分布式执行本申请实施例提供的图数据处理方法的流程更容易理解,参见图6所示,图6是根据本申请实施例提供的一种分布式计算架构的示意图。在图6中,该分布式计算架构包括多个执行器和多个参数服务器。服务器对目标图进行图分区,将各图分区对应的边表存储在执行器中,各图分区对应的节点存储在对应的参数服务器中,每个节点有两个嵌入特征,一个表示节点作为源角色的嵌入特征,一个表示节点作为目标角色的嵌入特征。参数服务器对各节点的嵌入特征做初始化,各执行器分别在对应的图分区内按照batch(批次)从参数服务器拉取N个节点的第一嵌入特征、多个第一邻域节点的第二嵌入特征以及多个第二邻域节点的第三嵌入特征,确定该N个节点的N个特征梯度。然后将该N个特征梯度上传至参数服务器,由参数服务器基于该N个节点的第一嵌入特征和N个特征梯度,确定N个节点的第四嵌入特征,以实现节点的嵌入特征的更新。最后迭代至目标轮数后结束。
需要说明的是,本申请实施例提供的图数据处理方法还能与已有的模型结合,以实现不同的业务。以和推荐模型结合,如应用于社交场景时,将基于本申请实施例提供的图数据处理方法得到的节点的发起嵌入特征和接收嵌入特征,引入好友推荐模型中,使得该推荐模型能够更为精准的为用户推荐好友等。再如应用于购物场景时,将基于本申请实施例提供的图数据处理方法得到的节点的发起嵌入特征和接收嵌入特征,引入物品推荐模型中,使得该推荐模型能够更为精准的为用户推荐商品等。在此不再一一举例说明。
需要说明的是,本申请实施例以节点之间具有一种类型的业务关系进行说明,由于对象数据中对象之间可能存在多种类型的业务关系,如对象甲和对象乙之间存在转账关系和关注关系,对象甲和对象丙之间存在转账关系和点赞关系,由于关注关系和点赞关系属于同一种类型的业务关系,通常发生在相同的业务场景,如观看直播场景;而转账关系与上述两种关系属于不同类型的业务关系,通常发生在不同的业务场景,如转账关系发生在社交场景。因此,服务器能够根据节点之间的关系类型,基于图数据确定多个目标图,同一目标图中的有向边表示同一种类型的业务关系,不同目标图中的有向边表示不同类型的业务关系。服务器基于本申请实施例提供的图数据处理方法,分别对该多个目标图进行处理,然后将同一节点基于不同目标图得到的发起嵌入特征和接收嵌入特 征进行拼接或者加权融合,得到该节点的最终的发起嵌入特征和接收嵌入特征。通过按照不同类型的业务关系进行处理,使得节点在具有两种以上的身份时,能够确定多个身份的嵌入特征,从而便于针对不同业务场景使用对应身份的嵌入特征。
本申请实施例提供了一种新颖的图数据处理方法,通过每轮迭代时从目标图中获取N个节点,然后基于该N个节点的第一嵌入特征,该N个节点关联的多个第一邻域节点的第二嵌入特征和多个第二邻域节点的第三嵌入特征,来确定该N个节点的第四嵌入特征,从而完成一轮迭代过程,实现对N个节点的嵌入特征的更新,由于并非是对M个节点进行处理,因此能够大幅度的减少运算量,在多轮迭代之后,即可得到该M个节点作为发起者的发起嵌入特征和作为接收者的接收嵌入特征,使得各节点的嵌入特征能够最大化的保存第一邻域和第二邻域的信息,进而基于有向边连接的两个节点的嵌入特征之间的内积即可确定该有向边的指向关系,从而既能表示节点的双重身份,又能表示有向边的指向关系。
图7是根据本申请实施例提供的一种图数据处理装置的框图。该装置用于执行上述图数据处理方法中的步骤,参见图7,装置包括:图获取模块71和迭代模块72,其中迭代模块72包括获取单元721、节点确定单元722以及特征确定单元723。
图获取模块71,配置为基于对象数据,获取目标图,该目标图包括用于表示M个对象的M个节点,M为大于1的正整数,该目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且该有向边的指出节点为该业务关系的发起者,该有向边的指向节点为该业务关系的接收者;
迭代模块72,配置为基于该M个节点和该M个节点之间的有向边进行多轮迭代,得到该M个节点对应的发起嵌入特征和接收嵌入特征,该发起嵌入特征用于表示对应的节点作为发起者的特征,该接收嵌入特征用于表示对应的节点作为接收者的特征;
其中,该迭代模块72,包括:
获取单元721,配置为从该M个节点中获取N个节点,N为小于M的正整数;
节点确定单元722,配置为基于该N个节点关联的有向边,确定该N个节点关联的多个第一邻域节点以及该N个节点关联的多个第二邻域节点,该第一邻域节点指出的有向边指向该N个节点,该N个节点指出的有向边指向该多个第二邻域节点;
特征确定单元723,配置为基于该N个节点的第一嵌入特征、该多个第一邻域节点的第二嵌入特征以及该多个第二邻域节点的第三嵌入特征,确定该N个节点的第四嵌入特征,所述N个节点的第四嵌入特征表示所述N个节点经过第z轮迭代更新后的发起嵌入特征和接收嵌入特征。
在一些实施例中,图8是根据本申请实施例提供的另一种图数据处理装置的框图,参见图8所示,该特征确定单元723,包括:
第一确定子单元7231,配置为基于该N个节点的第一嵌入特征、该多个第一邻域节点的第二嵌入特征、节点入度分布信息以及多个第一节点的第五嵌入特征,确定第一函数,该节点入度分布信息用于表示该N个节点的节点入度的分布,该多个第一节点基于该N个节点的第一邻域负采样得到,该第一邻域包括该多个第一邻域节点,所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征为执行第(z-1)轮迭代得到的;
第二确定子单元7232,配置为基于该N个节点的第一嵌入特征、该多个第二邻域节点的第三嵌入特征、节点出度分布信息以及多个第二节点的第六嵌入特征,确定第二函数,该节点出度分布信息用于表示该N个节点的节点出度的分布,该多个第二节点基于该N个节点的第二邻域负采样得到,该第二邻域包括该多个第二邻域节点;
第一优化子单元7233,配置为基于该第一函数和该第二函数构成的第一目标函数, 对所述第一目标函数进行优化,得到该N个节点的第四嵌入特征。
在一些实施例中,参见图8所示,该第一确定子单元7231,配置为从该N个节点的第一嵌入特征中,获取N个第一中间嵌入特征,该N个第一中间嵌入特征用于表示该N个节点作为该接收者的特征;从该多个第一邻域节点的第二嵌入特征中,获取多个第二中间嵌入特征,该多个第二中间嵌入特征用于表示该多个第一邻域节点作为该发起者的特征;基于该节点入度分布信息,确定第一期望信息,该第一期望信息用于表示该多个第一节点满足该节点入度分布信息的数学期望;从该多个第一节点的第五嵌入特征中,获取多个第三中间嵌入特征,该多第三中间嵌入特征用于表示该多个第一节点作为该发起者的特征;基于该N个第一中间嵌入特征、该多个第二中间嵌入特征、该第一期望信息以及该多个第三中间嵌入特征,确定该第一函数。
在一些实施例中,该第二确定子单元7232,配置为从该N个节点的第一嵌入特征中,获取N个第四中间嵌入特征,该N个第四中间嵌入特征用于表示该N个节点作为该发起者的特征;从该多个第二邻域节点的第三嵌入特征中,获取多个第五中间嵌入特征,该多个第五中间嵌入特征用于表示该多个第二邻域节点作为该接收者的特征;基于该节点出度分布信息,确定第二期望信息,该第二期望信息用于表示该多个第二节点满足该节点出度分布信息的数学期望;从该多个第二节点的嵌入特征中,获取多个第六中间嵌入特征,该多个第六中间嵌入特征用于表示该多个第二节点作为该接收者的特征;基于该N个第四中间嵌入特征、该多个第五中间嵌入特征、该第二期望信息以及该多个第六中间嵌入特征,确定该第二函数。
在一些实施例中,该特征确定单元723,包括:
第三确定子单元7234,配置为基于该N个节点的第一嵌入特征和该多个第一邻域节点的第二嵌入特征,确定第一概率信息,该第一概率信息用于指示基于该N个节点确定该多个第一邻域节点的概率;
第四确定子单元7235,配置为基于该N个节点的第一嵌入特征和该多个第二邻域节点的第三嵌入特征,确定第二概率信息,该第二概率信息用于指示基于该N个节点确定该多个第二邻域节点的概率;
第五确定子单元7236,配置为基于该第一概率信息和该第二概率信息,确定第二目标函数;
第二优化子单元7237,配置为对该第二目标函数进行优化,得到该N个节点的第四嵌入特征。
在一些实施例中,该第三确定子单元7234,配置为从该N个节点的第一嵌入特征中,获取N个第一中间嵌入特征,该N个第一中间嵌入特征用于表示该N个节点作为该接收者的特征;从该多个第一邻域节点的第二嵌入特征中,获取多个第二中间嵌入特征,该多个第二中间嵌入特征用于表示该多个第一邻域节点作为该发起者的特征;获取该M个节点的M个第七中间嵌入特征,该M个第七中间嵌入特征用于表示该M个节点作为该发起者的特征;基于该N个第一中间嵌入特征、该多个第二中间嵌入特征以及该M个第七中间嵌入特征,确定该第一概率信息。
在一些实施例中,该第四确定子单元7235,配置为从该N个节点的第一嵌入特征中,获取N个第四中间嵌入特征,该N个第四中间嵌入特征用于表示该N个节点作为该发起者的特征;从该多个第二邻域节点的第三嵌入特征中,获取多个第五中间嵌入特征,该多个第五中间嵌入特征用于表示该多个第二邻域节点作为该接收者的特征;获取该M个节点的M个第八中间嵌入特征,该M个第八中间嵌入特征用于表示该M个节点作为该接收者的特征;基于该N个第四中间嵌入特征、该多个第五中间嵌入特征以及该M个第八中间嵌入特征,确定该第二概率信息。
在一些实施例中,参见图8所示,该装置还包括:
划分模块73,配置为将该目标图被划分为至少一个图分区;
存储模块74,配置为对于任一图分区,将该图分区中的节点存储至参数服务器中;
该存储模块74,还配置为将该图分区对应的边表,存储至该图分区对应的执行器中;
采样模块75,配置为对于任一执行器,基于该执行器存储的边表,从该目标图中采样得到多个采样边,该多个采样边连接的节点为该N个节点;
该获取单元721,配置为基于该多个采样边,确定N个节点标识;从该参数服务器获取该N个节点标识对应的该N个节点。
在一些实施例中,该特征确定单元723,配置为基于该N个节点的第一嵌入特征、该多个第一邻域节点的第二嵌入特征以及该多个第二邻域节点的第三嵌入特征,确定该N个节点的N个特征梯度;将该N个特征梯度上传至该参数服务器,由该参数服务器基于该N个节点的第一嵌入特征和该N个特征梯度,确定该N个节点的第四嵌入特征。
本申请实施例提供的图数据处理装置,通过每轮迭代时从目标图中获取N个节点,然后基于该N个节点的第一嵌入特征,该N个节点关联的多个第一邻域节点的第二嵌入特征和多个第二邻域节点的第三嵌入特征,来确定该N个节点的第四嵌入特征,从而完成一轮迭代过程,实现对N个节点的嵌入特征的更新,由于并非是对M个节点进行处理,因此能够大幅度的减少运算量,在多轮迭代之后,即可得到该M个节点作为发起者的发起嵌入特征和作为接收者的接收嵌入特征,使得各节点的嵌入特征能够最大化的保存第一邻域和第二邻域的信息,进而基于有向边连接的两个节点的嵌入特征之间的内积即可确定该有向边的指向关系,从而既能表示节点的双重身份,又能表示有向边的指向关系。
需要说明的是:上述实施例提供的图数据处理装置在处理图数据时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的图数据处理装置与图数据处理方法实施例属于同一构思,其实现过程见方法实施例。
在本申请实施例中,计算机设备能够被配置为终端或者服务器,当计算机设备被配置为终端时,可以由终端作为执行主体来实施本申请实施例提供的技术方案,当计算机设备被配置为服务器时,可以由服务器作为执行主体来实施本申请实施例提供的技术方案,也可以通过终端和服务器之间的交互来实施本申请提供的技术方案,本申请实施例对此不作限定。
图9是根据本申请实施例提供的一种终端900的结构框图。该终端900可以是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端900还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端900包括有:处理器901和存储器902。
处理器901可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器901可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器901也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central  Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器901可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器901还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器902可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器902还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器902中的非暂态的计算机可读存储介质用于存储至少一个计算机程序,该至少一个计算机程序用于被处理器901所执行以实现本申请中方法实施例提供的图数据处理方法。
在一些实施例中,终端900还可以包括有:外围设备接口903和至少一个外围设备。处理器901、存储器902和外围设备接口903之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口903相连。外围设备包括:射频电路904、显示屏905、摄像头组件906、音频电路907、定位组件908和电源909中的至少一种。
在一些实施例中,终端900还包括有一个或多个传感器910。该一个或多个传感器910包括但不限于:加速度传感器911、陀螺仪传感器912、压力传感器913、指纹传感器914、光学传感器915以及接近传感器916。
本领域技术人员可以理解,图9中示出的结构并不构成对终端900的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图10是根据本申请实施例提供的一种服务器的结构示意图,该服务器1000可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,CPU)1001和一个或一个以上的存储器1002,其中,该存储器1002中存储有至少一条计算机程序,该至少一条计算机程序由该处理器1001加载并执行以实现上述各个方法实施例提供的图数据处理方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一段计算机程序,该至少一段计算机程序由计算机设备的处理器加载并执行以实现上述实施例的图数据处理方法中计算机设备所执行的操作。例如,所述计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在一些实施例中,本申请实施例所涉及的计算机程序可被部署在一个计算机设备上执行,或者在位于一个地点的多个计算机设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算机设备上执行,分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链系统。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序代码,该计算机程序代码存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序代码,处理器执行该计算机程序代码,使得该计算机设备执行上述各种实现方式中提供的图数据处理方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可能实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (13)

  1. 一种图数据处理方法,应用于计算机设备,所述方法包括:
    基于对象数据,获取目标图,所述目标图包括用于表示M个对象的M个节点,M为大于1的正整数,所述目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且所述有向边的指出节点为所述业务关系的发起者,所述有向边的指向节点为所述业务关系的接收者;
    基于所述M个节点和所述M个节点之间的有向边进行多轮迭代,得到所述M个节点对应的发起嵌入特征和接收嵌入特征,所述发起嵌入特征用于表示节点作为发起者的特征,所述接收嵌入特征用于表示节点作为接收者的特征;
    其中,第z轮迭代的步骤包括:
    从所述M个节点中获取N个节点,N为小于M的正整数,z为正整数;
    基于所述N个节点关联的有向边,确定所述N个节点关联的多个第一邻域节点以及所述N个节点关联的多个第二邻域节点,所述第一邻域节点指出的有向边指向所述N个节点,所述N个节点指出的有向边指向所述多个第二邻域节点;
    基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,所述N个节点的第四嵌入特征表示所述N个节点经过第z轮迭代更新后的发起嵌入特征和接收嵌入特征。
  2. 根据权利要求1所述的方法,其中,所述基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,包括:
    基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征、节点入度分布信息以及多个第一节点的第五嵌入特征,确定第一函数,所述节点入度分布信息用于表示所述N个节点的节点入度的分布,所述多个第一节点基于所述N个节点的第一邻域负采样得到,所述第一邻域包括所述多个第一邻域节点,所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征为执行第(z-1)轮迭代得到的;
    基于所述N个节点的第一嵌入特征、所述多个第二邻域节点的第三嵌入特征、节点出度分布信息以及多个第二节点的第六嵌入特征,确定第二函数,所述节点出度分布信息用于表示所述N个节点的节点出度的分布,所述多个第二节点基于所述N个节点的第二邻域负采样得到,所述第二邻域包括所述多个第二邻域节点;
    基于所述第一函数和所述第二函数构成的第一目标函数,对所述第一目标函数进行优化,得到所述N个节点的第四嵌入特征。
  3. 根据权利要求2所述的方法,其中,所述基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征、节点入度分布信息以及多个第一节点的第五嵌入特征,确定第一函数,包括:
    从所述N个节点的第一嵌入特征中,获取N个第一中间嵌入特征,所述N个第一中间嵌入特征用于表示所述N个节点作为所述接收者的特征;
    从所述多个第一邻域节点的第二嵌入特征中,获取多个第二中间嵌入特征,所述多个第二中间嵌入特征用于表示所述多个第一邻域节点作为所述发起者的特征;
    基于所述节点入度分布信息,确定第一期望信息,所述第一期望信息用于表示所述多个第一节点满足所述节点入度分布信息的数学期望;
    从所述多个第一节点的第五嵌入特征中,获取多个第三中间嵌入特征,所述多个第三中间嵌入特征用于表示所述多个第一节点作为所述发起者的特征;
    基于所述N个第一中间嵌入特征、所述多个第二中间嵌入特征、所述第一期望信息以及所述多个第三中间嵌入特征,确定所述第一函数。
  4. 根据权利要求2所述的方法,其中,所述基于所述N个节点的第一嵌入特征、所述多个第二邻域节点的第三嵌入特征、节点出度分布信息以及多个第二节点的第六嵌入特征,确定第二函数,包括:
    从所述N个节点的第一嵌入特征中,获取N个第四中间嵌入特征,所述N个第四中间嵌入特征用于表示所述N个节点作为所述发起者的特征;
    从所述多个第二邻域节点的第三嵌入特征中,获取多个第五中间嵌入特征,所述多个第五中间嵌入特征用于表示所述多个第二邻域节点作为所述接收者的特征;
    基于所述节点出度分布信息,确定第二期望信息,所述第二期望信息用于表示所述多个第二节点满足所述节点出度分布信息的数学期望;
    从所述多个第二节点的嵌入特征中,获取多个第六中间嵌入特征,所述多个第六中间嵌入特征用于表示所述多个第二节点作为所述接收者的特征;
    基于所述N个第四中间嵌入特征、所述多个第五中间嵌入特征、所述第二期望信息以及所述多个第六中间嵌入特征,确定所述第二函数。
  5. 根据权利要求1所述的方法,其中,所述基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,包括:
    基于所述N个节点的第一嵌入特征和所述多个第一邻域节点的第二嵌入特征,确定第一概率信息,所述第一概率信息用于指示基于所述N个节点确定所述多个第一邻域节点的概率;
    基于所述N个节点的第一嵌入特征和所述多个第二邻域节点的第三嵌入特征,确定第二概率信息,所述第二概率信息用于指示基于所述N个节点确定所述多个第二邻域节点的概率;
    基于所述第一概率信息和所述第二概率信息,确定第二目标函数;
    对所述第二目标函数进行优化,得到所述N个节点的第四嵌入特征。
  6. 根据权利要求5所述的方法,其中,所述基于所述N个节点的第一嵌入特征和所述多个第一邻域节点的第二嵌入特征,确定第一概率信息,包括:
    从所述N个节点的第一嵌入特征中,获取N个第一中间嵌入特征,所述N个第一中间嵌入特征用于表示所述N个节点作为所述接收者的特征;
    从所述多个第一邻域节点的第二嵌入特征中,获取多个第二中间嵌入特征,所述多个第二中间嵌入特征用于表示所述多个第一邻域节点作为所述发起者的特征;
    获取所述M个节点的M个第七中间嵌入特征,所述M个第七中间嵌入特征用于表示所述M个节点作为所述发起者的特征;
    基于所述N个第一中间嵌入特征、所述多个第二中间嵌入特征以及所述M个第七中间嵌入特征,确定所述第一概率信息。
  7. 根据权利要求5所述的方法,其中,所述基于所述N个节点的第一嵌入特征和所述多个第二邻域节点的第三嵌入特征,确定第二概率信息,包括:
    从所述N个节点的第一嵌入特征中,获取N个第四中间嵌入特征,所述N个第四中间嵌入特征用于表示所述N个节点作为所述发起者的特征;
    从所述多个第二邻域节点的第三嵌入特征中,获取多个第五中间嵌入特征,所述多个第五中间嵌入特征用于表示所述多个第二邻域节点作为所述接收者的特征;
    获取所述M个节点的M个第八中间嵌入特征,所述M个第八中间嵌入特征用于表示所述M个节点作为所述接收者的特征;
    基于所述N个第四中间嵌入特征、所述多个第五中间嵌入特征以及所述M个第八中间嵌入特征,确定所述第二概率信息。
  8. 根据权利要求1至7任一项所述的方法,其中,所述方法还包括:
    将所述目标图被划分为至少一个图分区;
    对于任一图分区,将所述图分区中的节点存储至参数服务器中;
    将所述图分区对应的边表,存储至所述图分区对应的执行器中;
    对于任一执行器,基于所述执行器存储的边表,从所述目标图中采样得到多个采样边,所述多个采样边连接的节点为所述N个节点;
    所述从所述M个节点中获取N个节点,包括:
    基于所述多个采样边,确定N个节点标识;
    从所述参数服务器获取所述N个节点标识对应的所述N个节点。
  9. 根据权利要求8所述的方法,其中,所述基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,包括:
    基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的N个特征梯度;
    将所述N个特征梯度上传至所述参数服务器,由所述参数服务器基于所述N个节点的第一嵌入特征和所述N个特征梯度,确定所述N个节点的第四嵌入特征。
  10. 一种图数据处理装置,所述装置包括:
    图获取模块,配置为基于对象数据,获取目标图,所述目标图包括用于表示M个对象的M个节点,M为大于1的正整数,所述目标图中的有向边用于表示所连接的节点对应的对象之间存在业务关系,且所述有向边的指出节点为所述业务关系的发起者,所述有向边的指向节点为所述业务关系的接收者;
    迭代模块,配置为基于所述M个节点和所述M个节点之间的有向边进行多轮迭代,得到所述M个节点对应的发起嵌入特征和接收嵌入特征,所述发起嵌入特征用于表示对应的节点作为发起者的特征,所述接收嵌入特征用于表示对应的节点作为接收者的特征;
    其中,所述迭代模块,包括:
    获取单元,配置为从所述M个节点中获取N个节点,N为小于M的正整数;
    节点确定单元,配置为基于所述N个节点关联的有向边,确定所述N个节点关联的多个第一邻域节点以及所述N个节点关联的多个第二邻域节点,所述第一邻域节点指出的有向边指向所述N个节点,所述N个节点指出的有向边指向所述多个第二邻域节点;
    特征确定单元,配置为基于所述N个节点的第一嵌入特征、所述多个第一邻域节点的第二嵌入特征以及所述多个第二邻域节点的第三嵌入特征,确定所述N个节点的第四嵌入特征,所述N个节点的第四嵌入特征表示所述N个节点经过第z轮迭代更新后的发起嵌入特征和接收嵌入特征。
  11. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器用于存储至少一段计算机程序,所述至少一段计算机程序由所述处理器加载并执行权利要求1至9任一项权利要求所述的图数据处理方法。
  12. 一种计算机可读存储介质,所述计算机可读存储介质用于存储至少一段计算机程序,所述至少一段计算机程序用于执行权利要求1至9任一项权利要求所述的图数据处理方法。
  13. 一种计算机程序产品,所述计算机程序产品包括计算机程序代码,所述计算机程 序代码存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机程序代码,所述处理器执行所述计算机程序代码,使得所述计算机设备执行权利要求1至9任一项权利要求所述的图数据处理方法。
PCT/CN2022/113580 2021-09-29 2022-08-19 图数据处理方法、装置、计算机设备、存储介质及计算机程序产品 WO2023051080A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22874475.1A EP4318375A1 (en) 2021-09-29 2022-08-19 Graph data processing method and apparatus, computer device, storage medium and computer program product
US18/333,530 US20230334096A1 (en) 2021-09-29 2023-06-12 Graph data processing method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111151811.8 2021-09-29
CN202111151811.8A CN115909419A (zh) 2021-09-29 2021-09-29 图数据处理方法、装置、计算机设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/333,530 Continuation US20230334096A1 (en) 2021-09-29 2023-06-12 Graph data processing method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023051080A1 true WO2023051080A1 (zh) 2023-04-06

Family

ID=85742981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113580 WO2023051080A1 (zh) 2021-09-29 2022-08-19 图数据处理方法、装置、计算机设备、存储介质及计算机程序产品

Country Status (4)

Country Link
US (1) US20230334096A1 (zh)
EP (1) EP4318375A1 (zh)
CN (1) CN115909419A (zh)
WO (1) WO2023051080A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507543A (zh) * 2020-05-28 2020-08-07 支付宝(杭州)信息技术有限公司 用于预测实体间业务关系的模型训练方法及装置
US20200342523A1 (en) * 2019-04-23 2020-10-29 Target Brands, Inc. Link prediction using hebbian graph embeddings
CN112085172A (zh) * 2020-09-16 2020-12-15 支付宝(杭州)信息技术有限公司 图神经网络的训练方法及装置
CN112541575A (zh) * 2020-12-06 2021-03-23 支付宝(杭州)信息技术有限公司 图神经网络的训练方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200342523A1 (en) * 2019-04-23 2020-10-29 Target Brands, Inc. Link prediction using hebbian graph embeddings
CN111507543A (zh) * 2020-05-28 2020-08-07 支付宝(杭州)信息技术有限公司 用于预测实体间业务关系的模型训练方法及装置
CN112085172A (zh) * 2020-09-16 2020-12-15 支付宝(杭州)信息技术有限公司 图神经网络的训练方法及装置
CN112541575A (zh) * 2020-12-06 2021-03-23 支付宝(杭州)信息技术有限公司 图神经网络的训练方法及装置

Also Published As

Publication number Publication date
US20230334096A1 (en) 2023-10-19
EP4318375A1 (en) 2024-02-07
CN115909419A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
US11979433B2 (en) Highly scalable four-dimensional web-rendering geospatial data system for simulated worlds
US11074295B2 (en) Distributed graph embedding method and apparatus, device, and system
US10083228B2 (en) Searching method and apparatus
US20230013392A1 (en) Graph data processing method and apparatus, computer device, and storage medium
WO2022142859A1 (zh) 数据处理方法、装置、计算机可读介质及电子设备
US11317258B2 (en) Methods, systems, and non-transitory computer readable record media for grasping nearby friend based on short-range wireless communication
CN110633717A (zh) 一种目标检测模型的训练方法和装置
KR20220149643A (ko) 상품 정보 추정 모델을 이용한 글로벌 이커머스 통합 관리 방법 및 시스템
CN111475392A (zh) 生成预测信息的方法、装置、电子设备和计算机可读介质
CN113034168A (zh) 内容项投放方法、装置、计算机设备及存储介质
CN114066363A (zh) 订单信息处理方法、装置、电子设备和计算机可读介质
CN110855487A (zh) 网络用户相似度管理方法、装置及存储介质
CN112365109B (zh) 配送路径信息生成方法、装置、电子设备和可读介质
CN113065873A (zh) 基于大数据及人工智能的支付方法及云平台系统
WO2023051080A1 (zh) 图数据处理方法、装置、计算机设备、存储介质及计算机程序产品
CN113450167A (zh) 一种商品推荐方法和装置
CN113630476B (zh) 应用于计算机集群的通信方法及通信装置
CN115222483A (zh) 金融产品推荐方法、装置、电子设备及存储介质
WO2021159871A1 (zh) 用于输出信息的方法和装置
CN112035581B (zh) 基于模型的任务处理方法、装置、设备和介质
CN114780847A (zh) 一种对象信息处理、信息推送方法、装置和系统
CN114943543A (zh) 模型训练方法、异常商品识别方法、电子设备及存储介质
CN110704679B (zh) 视频分类方法、装置及电子设备
CN105701118B (zh) 用于归一化文件的非数值特征的方法和装置
CN111488890B (zh) 一种物品检测模型的训练方法和装置

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2022874475

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022874475

Country of ref document: EP

Effective date: 20231030

NENP Non-entry into the national phase

Ref country code: DE