Disclosure of Invention
Embodiments of the present disclosure aim to provide a more efficient solution for acquiring a node sequence of a relational network based on a distributed system, so as to solve the deficiencies in the prior art.
To achieve the above object, an aspect of the present specification provides a method of acquiring a node sequence of a relational network based on a distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identification, one of at least one type, a plurality of out-degree values respectively corresponding to the respective types, and an adjacent node, the method being performed in a first working machine among the plurality of working machines, including:
acquiring a plurality of outbound values of a plurality of first nodes in the plurality of nodes, and transmitting the acquired data to at least one first designated server in the plurality of servers;
receiving an accumulation matrix from the plurality of servers, the accumulation matrix showing: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
Acquiring node identifiers and types of respective adjacent nodes of the at least one first node;
calculating an arrangement position of node identifiers of respective adjacent nodes of the at least one first node in an edge vector based on node identifiers and types of respective adjacent nodes of the at least one first node and the accumulation matrix to acquire partial elements of the edge vector, and sending the partial elements of the edge vector to at least one second designated server in the plurality of servers, wherein the edge vector comprises at least one part corresponding to the at least one type respectively, and node identifiers of respective adjacent nodes of the respective corresponding types of the plurality of nodes are sequentially arranged in each part;
receiving the edge vectors from the plurality of servers; and
based on the respective types of the plurality of nodes, the accumulation matrix and the edge vector, sequentially and randomly acquiring a plurality of node identifiers as the node sequence according to a preset path, wherein the preset path limits the types of the nodes respectively corresponding to the plurality of node identifiers.
In one embodiment, sequentially randomly acquiring the plurality of node identities as the node sequence includes randomly acquiring one node identity from respective node identities of a first predetermined type of the plurality of nodes as a first node identity of the node sequence according to a predetermined path.
In one embodiment, sequentially randomly acquiring a plurality of node identities as the node sequence, respectively, comprises:
acquiring a second preset type of output value of the node corresponding to the first node identifier based on the accumulation matrix according to a preset path;
randomly acquiring a first integer based on the output value; and
and calculating an arrangement position in the edge vector based on the first integer, the second preset type, the first node identification, the accumulation matrix and the edge vector, so as to obtain a node identification corresponding to the arrangement position as a second node identification.
In one embodiment, the relationship network is a bipartite graph network, and the at least one type includes a user type and a commodity type.
In one embodiment, the elements of the ith row and jth column of the accumulation matrix are sums of the outbound values of type i for nodes having node identifications of 0 through j-1, respectively, where i and j are both counted from 0.
In one embodiment, in the edge vector, an arrangement position of at least one adjacent node of the type i of the node whose node is identified as j in the edge vector is at least one position in the edge vector starting from a first position, and respective node identifications of the at least one adjacent node are arranged in the at least one position in order from small to large, wherein a column number of the first position in the edge vector is equal to a starting column number of a portion corresponding to the type i in the edge vector plus an element value of an ith row and a jth column in the accumulation matrix, wherein a column number in the edge vector is counted from 0.
In one embodiment, sequentially randomly acquiring the plurality of node identities, respectively, according to the predetermined path ends in any one of the following cases:
the number of the plurality of node identifiers reaches a predetermined number;
the next node identity cannot be found.
In one embodiment, the step of sequentially randomly acquiring the plurality of node identifications, respectively, according to the predetermined path, is looped a plurality of times, the method further comprising writing a predetermined number of rows of node sequences acquired through the predetermined number of loops into the database.
In one embodiment, the step of sequentially randomly acquiring the plurality of node identities, respectively, according to a predetermined path stops the loop when: and the sum of the number of lines of the node sequence respectively acquired by the plurality of working machines reaches a preset value.
In one embodiment, sequentially randomly acquiring a plurality of node identities as the node sequence according to a predetermined path based on the respective type of the plurality of nodes, the accumulation matrix and the edge vector, respectively, includes receiving the accumulation matrix and the edge vector from the plurality of servers in the event that the first working machine loses memory data.
Another aspect of the present specification provides a method of acquiring a node sequence of a relational network based on a distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identification, one of at least one type, a plurality of out-degree values respectively corresponding to the respective types, and an adjacent node, the method being performed in a first server of the plurality of servers, comprising:
Receiving a plurality of output values from at least one of the work machines, the plurality of output values corresponding to a designated plurality of second nodes of the plurality of nodes, respectively;
calculating at least part of the elements of an accumulation matrix by receiving corresponding output value data from other servers of the plurality of servers and transmitting at least part of the elements of the accumulation matrix to each of the work machines, wherein the accumulation matrix shows: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
receiving partial elements of an edge vector from at least one of the work machines, wherein the edge vector includes at least one portion corresponding to the at least one type, respectively, each of the portions having node identifications of respective corresponding types of neighboring nodes of the plurality of nodes sequentially arranged therein; and
and sending part of elements of the edge vector to each working machine.
In one embodiment, the method further comprises, after computing at least a portion of the elements of the accumulation matrix, transmitting the at least a portion of the elements of the accumulation matrix to at least one other server of the plurality of servers to backup the at least a portion of the elements of the accumulation matrix.
In one embodiment, the method further comprises, after obtaining the partial elements of the edge vector, sending the partial elements of the edge vector to at least one other server of the plurality of servers to backup the partial elements of the edge vector.
In one embodiment, the method further comprises, prior to sending the partial elements of the edge vectors to each of the work machines, receiving the partial elements of the corresponding edge vectors from all other servers of the plurality of servers to obtain the edge vectors, wherein sending the partial elements of the edge vectors to each of the work machines comprises sending the edge vectors to each of the work machines.
In one embodiment, the method further comprises, after transmitting the partial elements of the edge vector to each of the work machines, transmitting at least a portion of the elements of the accumulation matrix and the partial elements of the edge vector to the respective work machine in response to a request from the work machine.
Another aspect of the present disclosure provides an apparatus for acquiring a node sequence of a relational network based on a distributed system, the distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identifier, one of at least one type, a plurality of out-degree values respectively corresponding to the respective types, and an adjacent node, the apparatus being implemented in a first working machine of the plurality of working machines, including:
A first obtaining unit configured to obtain a plurality of output values of each of a plurality of first nodes in the plurality of nodes, and send the obtained data to at least one first designated server in the plurality of servers;
a first receiving unit configured to receive an accumulation matrix from the plurality of servers, the accumulation matrix showing: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
a second obtaining unit configured to obtain node identifiers and types of respective neighboring nodes of the at least one first node;
a calculation unit configured to calculate, based on node identifiers and types of respective adjacent nodes of the at least one first node and the accumulation matrix, arrangement positions of the node identifiers of respective adjacent nodes of the at least one first node in an edge vector, to obtain partial elements of the edge vector, and to send the partial elements of the edge vector to at least one second specifying server of the plurality of servers, wherein the edge vector includes at least one portion corresponding to the at least one type, respectively, in each of which node identifiers of respective corresponding types of adjacent nodes of the plurality of nodes are sequentially arranged;
A second receiving unit configured to receive the edge vectors from the plurality of servers; and
and a third obtaining unit configured to sequentially and randomly obtain a plurality of node identifiers as the node sequence according to a predetermined path, based on respective types of the plurality of nodes, the accumulation matrix and the edge vector, wherein the predetermined path defines the type of each node to which the plurality of node identifiers respectively correspond.
In one embodiment, the third obtaining unit is further configured to randomly obtain, as the first node identifier of the node sequence, one node identifier from node identifiers of each of a plurality of nodes of the first predetermined type among the plurality of nodes according to a predetermined path.
In one embodiment, the third acquisition unit includes:
the first acquisition subunit is configured to acquire a second preset type of output value of the node corresponding to the first node identifier based on the accumulation matrix according to a preset path;
a second acquisition subunit configured to randomly acquire a first integer based on the output value; and
and the calculating subunit is configured to calculate an arrangement position in the edge vector based on the first integer, the second preset type, the first node identifier, the accumulation matrix and the edge vector, so as to acquire the node identifier corresponding to the arrangement position as a second node identifier.
In one embodiment, the third acquisition unit ends implementation in any of the following cases:
the number of the plurality of node identifiers reaches a predetermined number;
the next node identity cannot be found.
In an embodiment, the third acquisition unit is implemented in a number of loops, and the apparatus further comprises a writing unit configured to write a sequence of nodes of a predetermined number of rows acquired by a predetermined number of loops into the database.
In one embodiment, the third acquisition unit stops the loop implementation when: and the sum of the number of lines of the node sequence respectively acquired by the plurality of working machines reaches a preset value.
In one embodiment, the third obtaining unit includes a receiving subunit configured to receive the accumulation matrix and the edge vector from the plurality of servers in a case where the first working machine loses memory data.
Another aspect of the present specification provides an apparatus for acquiring a node sequence of a relational network based on a distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identification, one of at least one type, a plurality of out-degree values respectively corresponding to the respective types, and neighboring nodes, the apparatus being implemented in a first server of the plurality of servers, comprising:
A first receiving unit configured to receive, from at least one of the working machines, a plurality of output values respectively corresponding to a specified plurality of second nodes among the plurality of nodes;
a computing unit configured to compute at least part of the elements of an accumulation matrix by receiving corresponding output value data from other servers of the plurality of servers, and to send at least part of the elements of the accumulation matrix to each of the working machines, wherein the accumulation matrix shows: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
a second receiving unit configured to receive partial elements of an edge vector from at least one of the working machines, wherein the edge vector includes at least one portion respectively corresponding to the at least one type, and node identifications of respective corresponding types of neighboring nodes of the plurality of nodes are sequentially arranged in each of the portions; and
and the first transmitting unit is configured to transmit part of elements of the edge vector to each working machine.
In an embodiment, the apparatus further comprises a second sending unit configured to send at least part of the elements of the accumulation matrix to at least one other server of the plurality of servers after calculating the at least part of the elements of the accumulation matrix, so as to backup the at least part of the elements of the accumulation matrix.
In one embodiment, the apparatus further includes a third sending unit configured to send the partial element of the edge vector to at least one other server of the plurality of servers to backup the partial element of the edge vector after the partial element of the edge vector is acquired.
In an embodiment, the apparatus further comprises a third receiving unit configured to receive the partial elements of the respective edge vectors from all other servers of the plurality of servers to obtain the edge vectors before transmitting the partial elements of the edge vectors to each of the working machines, wherein the first transmitting unit is further configured to transmit the edge vectors to each of the working machines.
In an embodiment, the apparatus further comprises a fourth transmitting unit configured to transmit at least part of the elements of the accumulation matrix and part of the elements of the edge vector to the respective working machine in response to a request of the working machine after transmitting part of the elements of the edge vector to each of the working machines.
Another aspect of the present disclosure provides a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements any of the methods described above.
With the solution of acquiring the node sequence of the relational network based on the distributed system according to the embodiment of the present specification, with the increase of the number of machines for parallel computing, the computing power is nearly doubled, and with the distributed architecture of the parameter server, very large scale graph nodes can be processed. In addition, the scheme optimizes the graph storage mode, and reduces the memory usage to the maximum extent.
Detailed Description
Embodiments of the present specification will be described below with reference to the accompanying drawings.
Fig. 1 shows a schematic diagram of a distributed system 100 for acquiring a sequence of nodes of a relational network according to an embodiment of the present description. As shown in fig. 1, the system 100 includes a database 11, a plurality of work machines 12 (three are schematically shown in the figure), and a plurality of servers 13 (three are schematically shown in the figure). In carrying out the method according to the embodiment of the present description by the system 100, first, map data including respective out-degree values corresponding to different types of the respective partial nodes in the relational network are read in parallel from the database 11 by the respective working machines 12, respectively, and the map data is transmitted to the at least one designated server 13. Then, the plurality of servers 13 acquire an accumulation matrix based on the output values received by them, respectively, and transmit the accumulation matrix to each working machine 12. Then, each working machine 12 again reads in the node identifiers and types of the respective adjacent nodes of the corresponding node in parallel from the database 11, respectively, and calculates the arrangement positions of the node identifiers of the respective adjacent nodes of the corresponding node in the edge vector based on the data and the accumulation matrix to acquire a partial element of the edge vector, and transmits the partial element to the designation server 13. The plurality of servers, after acquiring the partial elements of the edge vectors, respectively, transmit the acquired partial elements thereof to each working machine 12, respectively, so that each working machine 12 acquires all the elements of the edge vectors. After each working machine 12 acquires the edge vectors separately, the node sequence of the relational network is acquired randomly based on the accumulation matrix and the edge vectors according to a predetermined path.
It is to be understood that the system 100 shown in fig. 1, and the above description of the system 100 in fig. 1, are illustrative only, and the distributed system according to embodiments of the present description is not limited thereto. For example, the system does not necessarily include a database, and the work machine may acquire data from the database via a network or the like. For example, the plurality of servers 13, upon receiving the partial elements of the corresponding edge vectors, may send to the designated server, and join the partial elements of the edge vectors together by the designated server to obtain the edge vectors, and send the edge vectors to each work machine 12.
Fig. 2 shows a method for acquiring a node sequence of a relational network based on a distributed system according to an embodiment of the present specification, the distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identification, one type of at least one types, a plurality of out-degree values respectively corresponding to the respective types, and an adjacent node, the method being performed in a first working machine of the plurality of working machines, including:
in step S202, a plurality of output values of each of a plurality of first nodes in the plurality of nodes are acquired, and the acquired data is sent to at least one first designated server in the plurality of servers;
In step S204, an accumulation matrix is received from the plurality of servers, the accumulation matrix showing: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
in step S206, the node identifier and the type of each neighboring node of each of the at least one first node are obtained;
in step S208, based on the node identifiers and types of the respective neighboring nodes of the at least one first node and the accumulation matrix, an arrangement position of the node identifiers of the respective neighboring nodes of the at least one first node in an edge vector is calculated to obtain a partial element of the edge vector, and the partial element of the edge vector is sent to at least one second designated server in the plurality of servers, where the edge vector includes at least one portion corresponding to the at least one type, and each portion has node identifiers of neighboring nodes of respective corresponding types of the plurality of nodes sequentially arranged therein;
receiving the edge vectors from the plurality of servers at step S210; and
In step S212, a plurality of node identifiers are sequentially and randomly acquired as the node sequence according to a predetermined path, based on the respective types of the plurality of nodes, the accumulation matrix and the edge vector, wherein the predetermined path defines the type of each node to which the plurality of node identifiers respectively correspond.
The distributed system is a parameter server system (parameter server), and mainly comprises two parts: a server cluster (a plurality of server nodes) and a work machine cluster (a plurality of work machine nodes). The server cluster mainly realizes model parallelism, that is, a whole quantity of model parameters, such as a whole quantity of accumulation matrixes, edge vectors and the like, are maintained in the memory of the server cluster. The working machine cluster is used for reading different image data from the database and performing calculation in parallel. For example, the method shown in fig. 2 is performed in parallel in each work machine.
The relationship network may be a plurality of relationship networks including a plurality of nodes which may be of a plurality of types, such as two types, three types, and so on. For example, the relationship network may be a bipartite graph network that includes two types of nodes, namely user nodes and commodity nodes. Fig. 3 schematically shows a relational network and corresponding graph data. Where the left side of fig. 3 shows a relational network comprising six nodes with node identifications (ids) of 0, 1, … 5. Wherein the numbers outside the circles represent the types of nodes, and as can be seen from the figure, the six nodes in the figure are of the type 0 or 1. The table (database) on the right side of fig. 3 shows graph data corresponding to the relationship network on the left side, and as shown in the graph, the first column from the left of the table is a node identification, the second column is an out-degree value of which the type of each node is 0, the third column is an out-degree value of which the type of each node is 1, the fourth column is a neighboring node of each node, and the fifth column is a node type of each node. The outtake value of a specific type of a specific node is the number of all neighboring nodes of the specific type. For example, for node 0, its neighboring nodes include node 1 and node 5, where node 5 is of type 0, node 1 is of type 1, that is, node 0 is of type 0 with an out value of 1, and node 1 is of type 1 with an out value of 1.
First, in step S202, a plurality of out-degree values that each of a plurality of first nodes has are acquired, and the acquired data is transmitted to at least one first designated server of the plurality of servers.
Fig. 4 schematically shows the process in step 202. As shown in fig. 4, in which the working machines 0, 1, and 2, and the servers 0, 1, and 2 are schematically shown, it is to be understood that the numbers of the working machines and the servers are merely illustrative, and not restrictive. The first work machine may be any one of the work machines 0, 1, and 2, and the work machine 0 will be described as an example of the first work machine. In fig. 4, the work machine 0 acquires the out-degree values of the type 0 and the type 1 of each of the node 0 and the node 1 from the database shown in fig. 3, for example. In the working machine 0, the output value information is stored in the degree matrix, and the degree matrix in each working machine is shown in the lower part of fig. 4. As shown in fig. 4, in the working machine 0, the 0 th row of the degree matrix corresponds to the type 0, the 1 st row corresponds to the type 1, the 0 th column does not correspond to any node, the 1 st column corresponds to the node 0, the 2 nd column corresponds to the node 1, that is, the column j of the degree matrix corresponds to the node j-1. Here, both row i and column j are counted starting from 0. In addition, the element of the ith row and jth column of the degree matrix represents the degree value of the type i of the node j-1, for example, the element of the 0 th row and 1 st column represents the degree value of the type 0 of the node 0, namely 1, and the element of the 0 th row and 2 nd column represents the degree value of the type 0 of the node 1, namely 2, wherein the element values of the type 0 and the type 1 of the 0 th column are assigned to 0. And acquiring partial data in the database through each working machine, wherein each working machine acquires partial elements of the degree matrix, and the partial elements of the degree matrix acquired by all working machines are combined together to form a complete degree matrix. The upper left matrix in fig. 4 shows the complete degree matrix.
After acquiring the degree matrix, the working machine 0 transmits the data in the degree matrix to the designated servers (here, server 0 and server 1), that is, in the distributed system, the designated server 0 receives the data of the 0 th column and the 1 st column of the degree matrix, and the designated server 1 receives the data of the 2 nd column and the 3 rd column, and therefore, the working machine 0 transmits the data of the 0 th column and the 1 st column of the degree matrix to the server 0 and the data of the 2 nd column to the server 1 according to the designation in the system. Here, it is understood that the designation of the server is illustrative and not limiting, and that, for example, column 5 and column 6 of the degree matrix of the working machine 2 are transmitted to one designated server, i.e., the server 2.
In step S204, an accumulation matrix is received from the plurality of servers, the accumulation matrix showing: for each of the plurality of nodes, for each of the at least one type, the node identifies a sum of the outbound values of the respective nodes preceding the node.
The plurality of servers acquire the complete degree matrix after receiving a part of the elements of the degree matrix transmitted from each working machine, and acquire an accumulation matrix based on the degree matrix, and a specific acquisition process of the accumulation matrix will be described in detail below. The upper right table in fig. 4 schematically shows an accumulation matrix. As shown in the figure, the element value of the ith row and the jth column of the accumulation matrix is equal to the sum of the j element values before the ith row in the degree matrix, for example, the value of the 0 th row and the 2 nd column (i=0, j=2, i, j are all counted from 0) in the accumulation matrix is equal to the sum of the 0 th column, the 1 st column and the 2 nd column of the 0 th row in the degree matrix, that is, 0+1+2=3. From this accumulation matrix, the position of the neighboring node of the respective type of each node in the edge vector described below can be known, e.g. the position of the neighboring node of the type 0 of the node 2 in the edge vector corresponds to the value of row 0, column 2, i.e. 3 in the accumulation matrix, and thus the accumulation matrix may also be referred to as an index matrix, which may be used for indexing the position of the neighboring node of the respective node in the edge vector. In the accumulation matrix, the last column of element values shows the number of edges of different types in the relationship network, e.g., 9 and 5 are shown in the last column of the accumulation matrix, where 9 shows the number of edges of type 0 in the relationship network and 5 shows the number of edges of type 1 in the relationship network. The specific node and any adjacent node form a directed edge, and the type of the directed edge is the type of the adjacent node.
The plurality of servers, after acquiring the complete accumulation matrix, transmit the accumulation matrix to each work machine, such as work machine 0, work machine 1, and work machine 2. So that work machine 0 receives the accumulation matrix from the plurality of servers.
In step S206, node identifiers and types of respective neighboring nodes of the at least one first node are obtained.
For example, the working machine 0 acquires the neighboring node of the node 0 including the node 1 and the node 5 from the database shown in fig. 3, and acquires the type of the node 1 as 1 and the type of the node 5 as 0 from the database. Similarly, work machine 0 may obtain the node identification and type of the neighboring node of node 1 from the database.
In step S208, based on the node identifiers and types of the respective neighboring nodes of the at least one first node and the accumulation matrix, an arrangement position of the node identifiers of the respective neighboring nodes of the at least one first node in an edge vector is calculated to obtain a partial element of the edge vector, and the partial element of the edge vector is sent to at least one second designated server in the plurality of servers, where the edge vector includes at least one portion corresponding to the at least one type, and node identifiers of the respective neighboring nodes of the respective corresponding types of the plurality of nodes are sequentially arranged in each portion.
Fig. 5 schematically shows the process in step S208. The edge vector includes information of all edges of the relational network. As described above, the last column of the accumulation matrix in fig. 4 shows the number of edges of each type in the relationship network, i.e., the number of edges of type 0 is 9 and the number of edges of type 1 is 5. That is, 14 elements should be included in the edge vector. Fig. 5 shows the edge vector, the number of columns of the edge vector is counted from 0, and the edge vector comprises 14 elements, namely 0, 1, … and 13, wherein the elements from 0 th column to 8 th column are node identifications of adjacent nodes of respective types 0 to 5 of the sequentially arranged nodes, and the elements from 9 th column to 13 th column are node identifications of adjacent nodes of respective types 1 of the sequentially arranged nodes 0 to 5. That is, in the edge vector, the edge information is divided into two parts in type 0 and type 1, and in each part, the edge information is represented by the node identification of the neighboring node of the corresponding type of each node.
Thus, based on this structure of the edge vector, the positions of the neighboring nodes of the corresponding node in the edge vector can be easily found. As described above, by accumulating the matrices, the positions (columns) of the neighboring nodes of the respective types of the respective nodes in the edge vector can be known. For example, where the corresponding node j includes a plurality of neighboring nodes of type i, then the node identification of each of the plurality of neighboring nodes should be as shown in equation (1) for the column number j' in the edge vector:
j’=s(i)+idx[i][j]+m (1)
Where s (i) represents the starting column number of node identifications of the nodes of type i in the edge vector, e.g., in the edge vector shown in fig. 5, s (0) represents the starting position of node identifications of the nodes of type 0, i.e., 0, and s (1) represents the starting position of node identifications of the nodes of type 1, i.e., 9.idx [ i ] [ j ] represents the element of the ith row and jth column of the accumulation matrix, where i, j are counted from 0. m is the sequence number of the plurality of adjacent nodes from small to large, wherein m is counted from 0.
Taking working machine 0 as an example, it needs to find the positions (columns) in the edge vector of neighboring nodes 1 (type 1) and 5 (type 0) of node 0, and neighboring nodes 0 (type 0) and 2 (type 0) of node 1. For the neighboring node 1, i=1, j=0 of type 1 of node 0, according to equation (1),
j’=s(1)+idx[1][0]+0=9+0+0=9,
so that the element of column 9 in the edge vector is the node identification of node 1, i.e. 1.
For a neighboring node 5,i =0, j=0 of type 0 of node 0, according to equation (1),
j’=s(0)+idx[0][0]+0=0+0+0=0,
so that the element of column 0 in the edge vector is the node identification of node 5, i.e., 5.
For the neighboring node 0, i=0, j=1 of type 0 of node 1, according to equation (1),
j’=s(0)+idx[0][1]+0=0+1+0=1,
so that the element of column 1 in the edge vector is the node identification of node 0, i.e., 0.
For the neighboring node 2 of the type 0 of the node 1, according to the above, the starting positions of the neighboring nodes of the type 0 of the node 1 are the positions of the neighboring node 0 (0<2), i.e. 1, and the positions of the neighboring nodes 2 are the positions sequentially arranged from the starting position, i.e. 2, so the element of the 2 nd column in the edge vector is the node identifier of the node 2, i.e. 2.
Namely: j' =s (0) +idx [0] [1] +1=0+1+1=2.
Each working machine sends the arrangement position data to at least one second designated server in the plurality of servers after calculating the arrangement positions of node identifiers of respective adjacent nodes of the at least one first node in an edge vector to acquire partial elements of the edge vector. For example, as shown in fig. 5, the server 0 is designated to receive the elements of columns 0 to 4 in the edge vector in the system, and the server 2 is designated to receive the elements of columns 9 to 13 in the edge vector, so that the working machine 0 transmits the elements of columns 0 to 2 acquired therein to the server 0 and the elements of columns 9 to the server 2.
In step S210, the edge vector is received from the plurality of servers. The plurality of servers may acquire the complete edge vector after receiving the partial elements of the edge vector from the respective work machines, respectively, and transmit the complete edge vector to the respective work machines, that is, each work machine acquires the complete edge vector from the plurality of servers.
In step S212, a plurality of node identifiers are sequentially and randomly acquired as the node sequence according to a predetermined path, based on the respective types of the plurality of nodes, the accumulation matrix and the edge vector, wherein the predetermined path defines the type of each node to which the plurality of node identifiers respectively correspond.
Fig. 6 schematically shows the process in step S212. The predetermined path, the meta path, defines the type of each node in the sequence of nodes. For example, in the example shown in FIG. 6, the meta-paths are, for example, 0,1,0, i.e., the sequence of nodes starts with a node of type 0 and alternates between type 0 and type 1. The lower part in fig. 6 schematically shows a plurality of rows of node sequences, and a node sequence of the fourth row (2, 1,0, 1 … 5, 4) will be described below as an example node sequence. In the example node sequence, the node 2 corresponding to the 1 st node identifier 2 is a node of type 0, the node 1 corresponding to the 2 nd node identifier 1 is a node of type 1, the node 0 corresponding to the 3 rd node identifier 0 is a node of type 0, and the following node identifiers are all alternately arranged according to the rule.
When the first node identification of the node sequence is acquired, all the nodes of the type 0 in the relation network are acquired, and one node identification is randomly acquired from the node identifications of all the nodes of the type 0 and is used as the first node identification of the node sequence. That is, referring to fig. 3, one node identity 2 is randomly acquired from node identities (i.e., 0, 2, 3, 5) corresponding to nodes 0, 2, 3, 5 as a first node identity of the node sequence.
When the second node identifier of the node sequence is acquired, firstly, according to a preset path, a second preset type of output value of the node corresponding to the first node identifier is acquired based on the accumulation matrix. Wherein the output value is obtained by the following formula (2):
dgr[i][j]=idx[i][j+1]-idx[i][j] (2)
wherein dgr [ i ] [ j ] represents the outbound value of type i for node j, and idx [ i ] [ j ] is the same as in equation (1). For example, for obtaining the second node identifier of the example node sequence, according to a predetermined path, i.e. obtaining the output value of the type 1 of the node 2, the output value is equal to dgr [1] [2] =idx [1] [3] -idx [1] [2] =3-1=2, based on the accumulation matrix, by the formula (2).
Then, based on the out-degree value, a first integer is randomly acquired. If the out-degree value of the type i identified by the first node is denoted as D0, an integer value of [0, D0-1] can be randomly obtained, and can be denoted as k as a first integer. For example, for the example node sequence described above, d0=2, then an integer may be randomly derived from [0,1] as k, e.g., k=0 may be randomly derived.
And finally, calculating the arrangement position in the edge vector based on the first integer, the second preset type, the first node identification, the accumulation matrix and the edge vector, so as to acquire the node identification corresponding to the arrangement position as a second node identification. According to formula (1), taking the second predetermined type as i, taking the first node identification as j, substituting k as m into formula (1), calculating j ', and acquiring the node identification of the j' th column from the edge vector as the second node identification of the node sequence. For example, for the example node sequence, i=1, j=2, m=0, with equation (1), j' =9+1+0=10 can be calculated, with the edge vector, to obtain that its column 10 element is 1, i.e., the second node identification of the example node sequence should be 1.
The third node identification of the node sequence may be obtained similarly to the second node identification. And will not be described in detail herein.
In one embodiment, sequentially randomly acquiring the plurality of node identities, respectively, according to the predetermined path ends in any one of the following cases: the number of the plurality of node identifiers reaches a predetermined number; the next node identity cannot be found. Referring to fig. 6, in the lower part of fig. 6, the node identification number (walking number) is a predetermined number of a preset line of node sequences, and when the number of generated node identifications is equal to the node identification number, the acquisition of the next node identification is stopped in this step, or when the next node identification is not found according to the method, the step is also stopped.
In one embodiment, the step of sequentially randomly acquiring the plurality of node identifications, respectively, according to the predetermined path, is looped a plurality of times, the method further comprising writing a predetermined number of rows of node sequences acquired through the predetermined number of loops into the database. For example, referring to fig. 6, in the lower part of fig. 6, the number of rows is a preset predetermined number of rows written into the database, and after the step is cycled a predetermined number of times (equal to the predetermined number of rows) to obtain a node sequence of the predetermined number of rows, the node sequence of the predetermined number of rows may be written into the database.
In one embodiment, the step of sequentially randomly acquiring the plurality of node identities, respectively, according to a predetermined path stops the loop when: and the sum of the number of lines of the node sequence respectively acquired by the plurality of working machines reaches a preset value.
Fig. 7 illustrates a method for acquiring a node sequence of a relational network based on a distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identification, one type of at least one type, a plurality of out-degree values respectively corresponding to the respective types, and neighboring nodes, according to an embodiment of the present specification, the method being performed in a first server of the plurality of servers, including:
in step S702, a plurality of output values corresponding to a designated plurality of second nodes among the plurality of nodes are received from at least one of the working machines;
at step S704, calculating at least part of elements of an accumulation matrix by receiving corresponding output value data from other servers of the plurality of servers, and transmitting at least part of elements of the accumulation matrix to each of the working machines, wherein the accumulation matrix shows: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
Receiving, from at least one of the work machines, a partial element of an edge vector, wherein the edge vector includes at least one portion corresponding to the at least one type, respectively, each of the portions having node identifications of respective corresponding types of neighboring nodes of the plurality of nodes sequentially arranged therein at step S706; and
in step S708, a part of the elements of the edge vector is transmitted to each of the working machines.
The method of fig. 7 is a corresponding operation of the server when the work machine in the distributed system performs the operations of fig. 4-6. And thus may be described with reference to fig. 4-6 as well.
First, in step S702, a plurality of output values corresponding to a designated plurality of second nodes among the plurality of nodes are received from at least one of the working machines. This step may be described with reference to fig. 4. Taking server 1 as an example, server 1 receives from work machine 0 and work machine 1 respectively the degree matrices of node 1 (i.e., column 2) and node 2 (i.e., column 3), the elements in the degree matrices being the type 0 and type 1 degree values corresponding to node 1 and node 2 respectively. That is, in the figure, the distributed system designation server 1 receives the out-degree values of the nodes 1 and 2, that is, the nodes 1 and 2 are the plurality of second nodes.
At step S704, calculating at least part of elements of an accumulation matrix by receiving corresponding output value data from other servers of the plurality of servers, and transmitting at least part of elements of the accumulation matrix to each of the working machines, wherein the accumulation matrix shows: for each of the plurality of nodes, for each of the at least one type, the node identifies a sum of the outbound values of the respective nodes preceding the node.
As described above, the element value of the ith row and jth column of the accumulation matrix is equal to the sum of the j previous element values of the ith row in the degree matrix. In one embodiment, the server calculates the elements of the specified columns of the accumulation matrix in the present server simply by obtaining the required degree matrix element values from the other corresponding servers. For example, the server 1 is assigned columns 2 and 3, which acquires the values of the respective types of the 2 nd columns of the accumulation matrix by receiving the element values of the 1 st column of the degree matrix from the server 0, thereby accumulating the element values of the 1 st and 2 nd columns of the degree matrix for the respective types.
In one embodiment, the system designates one server (e.g., a first server) of the plurality of servers to obtain a complete degree matrix by obtaining its respective degree matrix element values from all other servers, and calculates an accumulation matrix based on the complete degree matrix.
In one embodiment, after computing at least a portion of the elements of the accumulation matrix, the at least a portion of the elements of the accumulation matrix are transmitted to at least one other server of the plurality of servers to backup the at least a portion of the elements of the accumulation matrix.
In one embodiment, each server obtains only a portion of the elements of the accumulation matrix and each server sends its respective portion of the elements to each work machine, thereby allowing each work machine to obtain the complete accumulation matrix. In one embodiment, a designated server (e.g., a first server) obtains a complete accumulation matrix and sends the accumulation matrix to each work machine such that each work machine obtains a complete accumulation matrix.
In step S706, a partial element of an edge vector is received from at least one of the working machines, wherein the edge vector includes at least one portion corresponding to the at least one type, respectively, each of the portions having node identifications of respective types of neighboring nodes of the plurality of nodes sequentially arranged therein. This step may be described with reference to fig. 4, where, for example, the system specifies that server 0 receives column 0-4 elements of the edge vector, and thus server 0 receives column 0-2 elements from work machine 0 and column 3-4 elements from work machine 1 in the upper portion of fig. 4. For a specific description of the edge vector, reference is made to the above specific description in step S208, and no further description is given here.
In one embodiment, after the partial elements of the edge vector are obtained, the partial elements of the edge vector are sent to at least one other server of the plurality of servers to backup the partial elements of the edge vector.
In step S708, a part of the elements of the edge vector is transmitted to each of the working machines.
In one embodiment, each of the plurality of servers sends a portion of the elements of its respective edge vector to each work machine, such that each work machine obtains the complete edge vector.
In one embodiment, a server (e.g., a first server) of a plurality of servers is designated to receive a portion of the elements of the respective edge vector from all other servers of the plurality of servers to obtain the edge vector and to send the edge vector to each of the work machines before sending the portion of the elements of the edge vector to each of the work machines. In one embodiment, the first server, after obtaining the complete edge vector, sends the edge vector to at least one other server to backup the edge vector.
In one embodiment, after transmitting the partial elements of the edge vector to each of the work machines, at least a portion of the elements of the accumulation matrix and the partial elements of the edge vector are transmitted to the corresponding work machine in response to a request from the work machine. This embodiment is applicable to some specific scenarios, for example, a sudden dead restart of the working machine 1, in which case all of the data in its memory is lost, so that the generation of the node sequence can continue by re-acquiring the accumulation matrix and edge vectors from the plurality of servers by sending data requests to the servers.
Fig. 8 illustrates an apparatus 800 for obtaining a sequence of nodes of a relational network based on a distributed system comprising a plurality of servers and a plurality of work machines, the relational network comprising a plurality of nodes connected to each other, wherein each node has a node identification, one of at least one type, a plurality of out-degree values corresponding to respective types, and neighboring nodes, the apparatus being implemented in a first work machine of the plurality of work machines, comprising:
a first obtaining unit 81 configured to obtain a plurality of output values that each of a plurality of first nodes has, and send the obtained data to at least one first specified server of the plurality of servers;
a first receiving unit 82 configured to receive an accumulation matrix from the plurality of servers, the accumulation matrix showing: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
a second obtaining unit 83 configured to obtain node identifiers and types of respective neighboring nodes of the at least one first node;
A calculating unit 84 configured to calculate, based on node identifiers and types of respective adjacent nodes of the at least one first node and the accumulation matrix, arrangement positions of the node identifiers of respective adjacent nodes of the at least one first node in an edge vector, to obtain partial elements of the edge vector, and to transmit the partial elements of the edge vector to at least one second designated server of the plurality of servers, wherein the edge vector includes at least one portion corresponding to the at least one type, respectively, in each of which node identifiers of respective adjacent nodes of the respective corresponding types of the plurality of nodes are sequentially arranged;
a second receiving unit 85 configured to receive the edge vectors from the plurality of servers; and
a third obtaining unit 86, configured to sequentially and randomly obtain a plurality of node identifiers as the node sequence according to a predetermined path, based on respective types of the plurality of nodes, the accumulation matrix and the edge vector, respectively, where the predetermined path defines types of respective nodes to which the plurality of node identifiers respectively correspond.
In one embodiment, the third obtaining unit 86 is further configured to randomly obtain, as the first node identifier of the node sequence, a node identifier from node identifiers of each of the plurality of nodes of the first predetermined type of the plurality of nodes according to a predetermined path.
In one embodiment, the third obtaining unit 86 includes:
a first obtaining subunit 861 configured to obtain, according to a predetermined path, a second predetermined type of output value of a node corresponding to the first node identifier based on the accumulation matrix;
a second obtaining subunit 862 configured to randomly obtain the first integer based on the output value; and
a calculating subunit 863 configured to calculate, based on the first integer, the second predetermined type, the first node identifier, the accumulation matrix, and the edge vector, an arrangement position in the edge vector, thereby obtaining a node identifier corresponding to the arrangement position as a second node identifier.
In one embodiment, the third acquisition unit 86 ends implementation in any of the following cases:
the number of the plurality of node identifiers reaches a predetermined number;
the next node identity cannot be found.
In an embodiment, the third acquisition unit is implemented in a number of loops, and the apparatus further comprises a writing unit 87 configured to write the sequence of nodes of a predetermined number of rows acquired by the predetermined number of loops to the database.
In one embodiment, the third acquisition unit stops the loop implementation when: and the sum of the number of lines of the node sequence respectively acquired by the plurality of working machines reaches a preset value.
In one embodiment, the third obtaining unit includes a receiving subunit 864 configured to receive the accumulation matrix and the edge vector from the plurality of servers in a case where the first working machine loses memory data.
Fig. 9 illustrates an apparatus 900 for acquiring a node sequence of a relational network based on a distributed system including a plurality of servers and a plurality of working machines, the relational network including a plurality of nodes connected to each other, wherein each node has a node identification, one of at least one type, a plurality of out-degree values respectively corresponding to the respective types, and neighboring nodes, according to an embodiment of the present specification, the apparatus being implemented in a first server of the plurality of servers, comprising:
a first receiving unit 91 configured to receive, from at least one of the working machines, a plurality of output values respectively corresponding to a specified plurality of second nodes among the plurality of nodes;
a calculating unit 92 configured to calculate at least part of the elements of an accumulation matrix by receiving corresponding output value data from other servers of the plurality of servers, and to transmit at least part of the elements of the accumulation matrix to each of the working machines, wherein the accumulation matrix shows: for each of the plurality of nodes, for each of the at least one type, the node identifying a sum of the outbound values of the respective nodes preceding the node;
A second receiving unit 93 configured to receive partial elements of an edge vector from at least one of the working machines, wherein the edge vector includes at least one portion respectively corresponding to the at least one type, and node identifications of respective corresponding types of neighboring nodes of the plurality of nodes are sequentially arranged in each of the portions; and
a first transmitting unit 94 configured to transmit a part of the elements of the edge vector to each of the working machines.
In an embodiment, the apparatus further comprises a second sending unit 95 configured to send at least part of the elements of the accumulation matrix to at least one other server of the plurality of servers for backing up at least part of the elements of the accumulation matrix after calculating the at least part of the elements of the accumulation matrix.
In one embodiment, the apparatus further includes a third sending unit 96 configured to send the partial element of the edge vector to at least one other server of the plurality of servers to backup the partial element of the edge vector after the partial element of the edge vector is acquired.
In an embodiment, the apparatus further comprises, before sending the partial elements of the edge vectors to each of the working machines, a third receiving unit 97 configured to receive the partial elements of the corresponding edge vectors from all other servers of the plurality of servers to obtain the edge vectors, wherein the first sending unit is further configured to send the edge vectors to each of the working machines.
In an embodiment, the apparatus further comprises a fourth sending unit 98 configured to send at least part of the elements of the accumulation matrix and part of the elements of the edge vector to the respective working machine in response to a request of the working machine after sending part of the elements of the edge vector to each of the working machines.
Another aspect of the present disclosure provides a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements any of the methods described above.
With the solution of acquiring the node sequence of the relational network based on the distributed system according to the embodiment of the present specification, with the increase of the number of machines for parallel computing, the computing power is nearly doubled, and with the distributed architecture of the parameter server, very large scale graph nodes can be processed. In addition, the scheme optimizes the graph storage mode, and reduces the memory usage to the maximum extent.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.