CN110019253B

CN110019253B - Distributed graph data sequence sampling method and device

Info

Publication number: CN110019253B
Application number: CN201910313368.6A
Authority: CN
Inventors: 张熙; 雷鸣涛; 杨金翠; 方滨兴
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2021-10-12
Anticipated expiration: 2039-04-18
Also published as: CN110019253A

Abstract

The embodiment of the invention provides a distributed graph data sequence sampling method and a distributed graph data sequence sampling device, which are applied to distributed computing nodes, wherein the distributed computing nodes comprise: two or more computing nodes, wherein the method comprises: acquiring preset image data, sampling times and sampling path length; the sampling times are equally divided to obtain the respective sampling times of each computing node as sampling distribution times; determining a target path with a path length having the same value as the sampling path length from the path set processed by each computing node according to the sampling path length, wherein the target path is formed by edges of the graph data with the same number of edges as the sampling path length, and each edge of the graph data comprises at least one element; and for the target path of each computing node, respectively extracting one element from at least one element included in each edge forming the target path based on a predetermined weight to obtain a sampling element sequence.

Description

Distributed graph data sequence sampling method and device

Technical Field

The invention relates to the technical field of data analysis, in particular to a distributed graph data sequence sampling method and device.

Background

With the development of communication technology, a large amount of data is formed, and how to obtain valuable information from the large amount of data becomes a concern. The graph data is used as a kind of data, in the related art, a single computation node is adopted for sampling the graph data sequence, although the single computation node realizes the graph data sequence sampling, the whole sampling process is completed through the single computation node, and the sampling efficiency is low.

Disclosure of Invention

The embodiment of the invention aims to provide a distributed graph data sequence sampling method and a distributed graph data sequence sampling device, which are used for solving the technical problem that the whole sampling process is completed through a single computing node and the sampling efficiency is low in the prior art. The specific technical scheme is as follows:

in a first aspect, the present invention provides a distributed graph data sequence sampling method, which is applied to distributed computing nodes, where the distributed computing nodes include: two or more computing nodes, the method comprising:

acquiring preset image data, sampling times and sampling path length;

the sampling times are equally divided to obtain the respective sampling times of each computing node as sampling distribution times;

determining a target path with a path length having the same value as the sampling path length from the path set processed by each computing node according to the sampling path length, wherein the target path is formed by edges of the graph data with the same number of edges as the sampling path length, and each edge of the graph data comprises at least one element;

for a target path of each computation node, extracting an element from at least one element included in each edge forming the target path respectively based on a predetermined weight to obtain a sampling element sequence, wherein the weight is used for indicating the proportion of the sampling element sequence in a full-scale sampling element sequence set, and the full-scale sampling element sequence in the full-scale sampling element sequence set is obtained by sampling the distributed computation nodes respectively according to the sampling distribution times.

Further, before determining a target path having a path length equal to the sampling path length according to the sampling path length in the path set processed from each compute node, the method further includes:

according to the number of the distributed computing nodes, partitioning the edge sets of the graph data to obtain a plurality of partitioned edge sets;

assigning a set of block edges to a compute node;

determining each path formed by each block edge set;

each path formed by each block edge set forms a path set processed by each computing node;

determining a target path with the path length same as the sampling path length in the path set processed by each computing node according to the sampling path length, wherein the target path comprises:

based on a path set processed by each computing node, expanding and searching the numerical value of the sampling path length minus one edge except for the initial edge in the direction that the initial edge in the block edge set distributed by the computing node is taken as the initial edge and the rest edges exist along the initial edge in the block edge set;

and expanding and searching the block edges intensively along the initial edge, wherein the number of the edges is the same as the number of the edges of the sampling path length, and the target path is formed.

Further, the weight is determined by the following steps:

for each computation node, determining the product of the total number of the full-scale sampling element sequences on all edges of the target path forming the computation node and the total number of all target paths included in the computation node as the weight.

Further, the extracting, for the target path of each computation node, one element from at least one element included in each edge forming the target path based on a predetermined weight includes:

for each computing node, determining the reciprocal of the total number of all target paths included by the computing node as the occurrence probability of the total number of all target paths of the target path in each computing node;

for the target path of each computing node, determining the sampling probability of the elements of the edge at the k-th position in the target path based on the occurrence probability and the reciprocal of the sum of the elements of the edge at the k-th position on the target path, wherein k is traversed to take each non-negative integer value in { k |0 is less than or equal to k and less than or equal to L }, and L is the length of the sampling path;

extracting an element from at least one element included in the edge at the k-th position according to the element sampling probability.

In a second aspect, the present invention provides a distributed graph data sequence sampling apparatus, which is applied to a distributed computing node, where the distributed computing node includes: two or more computing nodes, the apparatus comprising:

the first acquisition module is used for acquiring preset image data, sampling times and sampling path length;

the first processing module is used for equally dividing the sampling times to obtain the respective sampling times of each computing node as sampling distribution times;

a second processing module, configured to determine, from a path set processed by each compute node, a target path having a path length that is the same as the sampling path length according to the sampling path length, where the target path is formed by edges of the graph data having the same number of edges as the sampling path length, and each edge of the graph data includes at least one element;

and a third processing module, configured to extract, for a target path of each computation node, one element from at least one element included in each edge forming the target path, respectively, based on a predetermined weight, to obtain a sample element sequence, where the weight is used to indicate a proportion of the sample element sequence in a full-scale sample element sequence set, and the full-scale sample element sequence in the full-scale sample element sequence set is obtained by sampling the distributed computation nodes individually according to the sampling allocation times.

Further, the apparatus further comprises:

a partitioning module, configured to, in the path set processed from each compute node, partition an edge set of the graph data according to the number of the distributed compute nodes to obtain multiple partitioned edge sets before determining, according to the sampling path length, a target path having a path length that is the same as the sampling path length in value;

an allocation module for allocating a block edge set to a compute node;

the fourth processing module is used for determining each path formed by the edges in each block edge set;

the composition module is used for forming each path formed by each block edge set to form a path set processed by each computing node;

the second processing module is configured to:

Further, the apparatus further comprises: a fifth processing module to:

Further, the third processing module is configured to:

The embodiment of the invention provides a distributed graph data sequence sampling method and a distributed graph data sequence sampling device, wherein preset graph data, sampling times and sampling path length are obtained; the sampling times are equally divided to obtain the respective sampling times of each computing node as sampling distribution times; determining a target path with the path length being the same as the sampling path length according to the sampling path length from the path set processed by each computing node, wherein the target path is formed by edges of graph data with the edge number being the same as the sampling path length, and each edge of the graph data comprises at least one element; for the target path of each computing node, respectively extracting an element from at least one element included in each edge forming the target path based on a predetermined weight to obtain a sampling element sequence, wherein the weight is used for indicating the proportion of the sampling element sequence in a full-scale sampling element sequence set, and the full-scale sampling element sequence in the full-scale sampling element sequence set is obtained by distributing the sampling times of the respective distributed computing nodes.

Therefore, the sampling times are distributed to the same sampling number of each computing node, so that tasks of sampling element sequences can be uniformly distributed to each computing node in the distributed computing nodes, each computing node shares the task of the required sampling times, in addition, a target path with the same numerical value as the sampling path length is determined according to the sampling path length in a path set processed by each computing node, and then for the target path of each computing node, one element is extracted from at least one element included in each edge in the target path respectively on the basis of the predetermined weight to obtain a sampling element sequence, so that each computing node in the distributed computing nodes samples the respective sampling element sequence, and the sampling of the graph data sequence is completed. Compared with the prior art that a single computing node completes the whole sampling process, the method not only reduces the quantity of processing data of each computing node, but also improves the efficiency of acquiring the required sampling graph data.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a distributed graph data sequence sampling method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a distributed computing node according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating application of graph data to a social network, according to an embodiment of the present invention;

FIG. 4 is a first diagram of a block edge set according to an embodiment of the present invention;

FIG. 5 is a second diagram of a block edge set according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an exemplary application scenario according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a directed path according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a name of a distributed graph data sequence sampling apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problem that the whole sampling process is completed through a single computing node in the prior art and the sampling efficiency is low, the embodiment of the invention provides a distributed graph data sequence sampling method and a distributed graph data sequence sampling device, which are used for acquiring preset graph data, sampling times and sampling path length; the sampling times are equally divided to obtain the respective sampling times of each computing node as sampling distribution times; determining a target path with the path length being the same as the sampling path length according to the sampling path length from the path set processed by each computing node, wherein the target path is formed by edges of graph data with the edge number being the same as the sampling path length, and each edge of the graph data comprises at least one element; and for the target path of each computing node, respectively extracting an element from at least one element included in each edge forming the target path based on a predetermined weight to obtain a sampling element sequence, wherein the weight is used for indicating the proportion of the sampling element sequence in a full-scale sampling element sequence set, and the full-scale sampling element sequence in the full-scale sampling element sequence set is obtained by sampling the distributed computing nodes respectively according to the sampling distribution times.

First, a distributed graph data sequence sampling method provided by an embodiment of the present invention is described below.

The distributed graph data sequence sampling method provided by the embodiment of the invention is applied to graph data sequence sampling with a directed graph data structure among edges, such as social network friend relationship sequence sampling, fault log sequence sampling, reference relationship sequence sampling and the like, so that after the method provided by the embodiment of the invention is realized, the fault log sequence with the largest occurrence frequency or the mining of reference relationship in a network public sentiment network system can be analyzed respectively.

As shown in fig. 1 and fig. 2, a distributed graph data sequence sampling method provided in an embodiment of the present invention is applied to distributed computing nodes, where the distributed computing nodes include: the computing nodes 20 are two or more, only one computing node is labeled in fig. 2, and the computing node in the embodiment of the present invention may be a terminal device, such as a desktop computer, or may be a server. And are not limited herein. The above method may comprise the steps of:

step 110, obtaining preset graph data, sampling times and sampling path length.

The preset graph data is the basis for implementing the embodiment of the present invention, and for convenience of the following description, the preset graph data is referred to as the graph data for short.

The graph data may be, but is not limited to, a directed graph data structure, and the graph data may include, but is not limited to: vertices, edges, elements, where the connection between two vertices is called an edge, a vertex may have zero or more neighboring elements, an element is included on an edge, i.e., each edge of the graph data includes at least one element. For example, as shown in fig. 3, the directed graph data structure may refer to a social network, the vertices may refer to respective users, such as user 1, user 2, user 3, user 4, user 5, user 6, and user 7, the edges may refer to respective directed paths, and the plurality of directed paths may be formed by social network friendship between the respective users, and are used to represent data forwarding directions between the respective users. Specifically, the directed path may include: a directional path with a path length of 3 composed of the user 1, the user 3, the user 6, and the user 7, a directional path with a path length of 2 composed of the user 1, the user 2, and the user 5, a directional path with a path length of 2 composed of the user 1, the user 2, and the user 4, a directional path with a path length of 1 composed of the user 2 and the user 5, and the like. The edge elements may refer to message elements propagated by each user, and if each user propagates multiple message elements, the message elements may be propagated to multiple users due to social network friendships between the users. User i propagates the message elements, i taking a positive integer from 1 to 7 in succession.

In order to limit the number of times that a sequence formed by specified elements in the graph data can be sampled, the number of times of sampling may be obtained in this step 110. The sampling times can be set according to the requirements of users. The sequence formed by the designated elements refers to a sampling element sequence obtained by extracting one element from at least one element included in each edge in the target path based on a predetermined weight for the target path of each computing node in the embodiment of the present invention. This allows sampling the desired number of sample element sequences in terms of sample times.

Since the sampling element sequence is not randomly obtained, the sampling path length can be obtained in step 110, where the sampling path length is used to determine the length of the path to be sampled of the sampling element sequence, so that the paths to be sampled are equal in length, and the obtained sampling element sequences are also equal in length. In the embodiment of the present invention, a path to be sampled for determining the sequence of sampling elements is referred to as a target path.

And step 120, averaging the sampling times to obtain the respective sampling times of each computing node as sampling distribution times.

In step 120, the following steps may be adopted to equally divide the sampling times to obtain the respective sampling times of each computing node as the sampling distribution times: the method comprises the steps of firstly, obtaining the number m of computing nodes and the sampling times X, and secondly, dividing the sampling times X according to the number m of the computing nodes to obtain the sampling times X/m of each computing node; and thirdly, taking the respective sampling times X/m of each computing node as sampling distribution times. In this way, the tasks of the required sampling times shared by each computing node are the same, and thus, each computing node of the processed tasks is fair.

Step 130, determining a target path with the path length being the same as the sampling path length according to the sampling path length from the path set processed by each computing node, where the target path is formed by edges of the graph data with the edge number being the same as the sampling path length, and each edge of the graph data includes at least one element.

The set of paths that each compute node processes may refer to the set of paths that each compute node is assigned to be able to process in the graph data. Before this step 130, the method may obtain the path set by adopting the following steps:

step 1, partitioning the edge set of the graph data according to the number of the distributed computing nodes to obtain a plurality of partitioned edge sets; step 2, distributing a block edge set to a computing node; step 3, determining each path formed by each block edge set; here, the path may be a directed path or an undirected path. And 4, forming a path set processed by each computing node by each path formed by each block edge set.

It should be noted that, for the graph data, a set composed of all edges existing in the graph data is referred to as an original edge set, and at least one edge exists in the original edge set, and at least one directed path may be composed of at least one edge. Specifically, all edges in the original edge set do not have a directed path between each edge, as shown in FIG. 6, edge e₁And edge e₃There is a directed path between, edge e₂And edge e₃There is no directed path in between.

Generally, the target path of each computing node can be determined according to the path set processed by each computing node, but in order to avoid omission and error, the target path of the computing node can be determined, and the target path with the path length the same as the sampling path length can be determined by adopting the following steps:

step 5, based on the path set processed by each computing node, with the initial edge in the block edge set allocated by the computing node as the starting point, and along the direction that the starting point in the block edge set has other edges, expanding and searching the value of the sampling path length minus one edge other than the initial edge; wherein, the starting edge may refer to an edge where a leftmost vertex of the block edge set is located. The starting edge may also be an edge where the top point of the directed graph data which is the most started is located, and the number of the starting edges may be one or more than two, which is determined according to the actual situation.

In order to avoid missing the expansion of the value of the searched sampling path length by subtracting another edge except the starting edge, the step 5 may specifically include: according to the comparison result of comparing the numerical value of the sampling path length with the maximum path length in the path set processed by each computing node, based on the path set processed by each computing node, expanding and searching the numerical value of the sampling path length minus one edge except for the initial edge in the direction that the initial edge in the block edge set allocated by the computing node is the initial edge and the rest edge exists along the initial edge in the block edge set; wherein, the starting edge may refer to an edge where a leftmost vertex of the block edge set is located. The further implementation process is as follows:

for the step 5, the following multiple implementation manners may be adopted to implement the path set processed based on each computing node, and with a starting edge in the block edge set allocated by the computing node as a starting point, the direction in which the remaining edges exist along the starting point in the block edge set is extended and found out by subtracting one edge other than the starting edge from the value of the sampling path length:

in one implementation, when the comparison result of comparing the value of the sampling path length with the maximum path length in the path set processed by each computing node is that the value of the sampling path length is smaller than the maximum path length in the path set processed by each computing node, based on the path set processed by each computing node, taking a starting edge in a block edge set allocated by the computing node as a starting point, determining whether an edge in the direction of the remaining edge exists along the starting point in the block edge set;

if the edge along the direction of the rest edge in the block edge set is determined to be the edge, expanding and searching for other edges except the initial edge as the current other edges; and if the edge in the direction of the rest edge does not exist in the block edge set, stopping expanding and searching the edge in the direction.

Judging whether the total number of other edges in the expanded search does not reach the value of the sampling path length, and subtracting one from the total number of other edges in the expanded search, and then determining whether the total number of other edges in the expanded search does not reach the value of the sampling path lengthAnd updating other previous edges as starting points, and continuing to return to the step of determining whether the starting points in the block edge set have edges in the direction of other edges or not until the numerical value expanding the searched sampling path length is reduced by one edge except the starting edge. Therefore, all other edges can be searched according to the block edge set, and omission is avoided. Taking the sampling path length value as 2 as an example, the number of the computing nodes is 2, as shown in the left part of the arrow in fig. 4 and 5, the first block edge set is E₁＝{e₁,e₅,e₆The second block edge set is E₂＝{e₂,e₃,e₄}，E_iAnd the block edge set stored by each computing node in the m computing nodes is represented, m represents the total number of the computing nodes, and the value range of i is more than or equal to 1 and less than or equal to m. The "first" in the first block edge set and the "second" in the second block edge set are only used to distinguish the two block edge sets, and are not limited in order.

Because the number of edges in the blocking edge set changes, when the value of the sampling path length is smaller than the maximum path length in the path set processed by each computing node, the above implementation may be adopted to extend and find the value of the sampling path length minus one edge other than the starting edge, but, if the value of the sampling path length is larger than the maximum path length in the path set processed by each computing node, in order to avoid that the edge set of the graph data is blocked into multiple blocking edge sets and the number of edges is truncated or omitted, in the embodiment of the present invention, in the above step 5, another implementation may be adopted to extend and find the value of the sampling path length minus one edge other than the starting edge:

when the comparison result of comparing the value of the sampling path length with the maximum path length in the path set processed by each computing node is that the value of the sampling path length is greater than the maximum path length in the path set processed by each computing node, based on the path set processed by each computing node, with the starting edge in the block edge set allocated by the computing node as the starting point, searching the edge in the edge set of the graph data, which is the same as the starting edge, and determining whether an edge in the direction of the rest edge exists along the edge in the edge set of the graph data, which is the same as the starting edge;

if the edges in the edge set of the graph data, which are the same as the initial edges, are determined to be edges in the direction with the rest edges, expanding and searching for other edges except the initial edges to be used as current other edges; if it is determined that an edge in the direction in which no other edge exists in the same edge set as the starting edge in the graph data, the outward search for the edge in the direction is stopped.

And if the total number of the other edges found in an expanded way is judged to be less than the number obtained by subtracting one from the number of the sampling path length, updating the other edges to be the starting point, and continuously returning to the step of determining whether the edges in the direction of the rest edges exist in the edges which are the same as the starting edge in the edge set of the graph data until the number of the sampling path length found in an expanded way is subtracted by the other edges except the starting edge. Therefore, all other edges can be searched according to the edge set of the graph data, and omission is avoided. Taking the value of the sampling path length as 2 as an example, the number of the computing nodes is 2, as shown in the right part of the arrow in fig. 4 and 5, the expanded block edge set may also be called an expanded block edge set, and the first expanded block edge set is

The second expansion block edge set is

The block edge set after expanding the edge set stored by each computing node in m computing nodes is shown, m represents the total number of the computing nodes, the value range of i is more than or equal to 1 and less than or equal to m,

is contained in E_iAnd is and

to be comprised in E_iAnd contains the set of edges that form the target path. "first" and second expanded block edge sets in a first expanded block edge setThe second in (2) is only for distinguishing two extended block edge sets, and is not limited in sequence.

The method of the embodiment of the invention also comprises the following steps: from expanding a set of block edges

Determining the specific implementation mode of the edge at each position in the target path:

respectively determining the sampling probability of the kth position of each edge forming the target path in the (n-k) edges, wherein n is an expanded block edge set

The total number of middle edges, the (n-k) edges are the edges in the original edge set that do not include the determined first k positions, and k traversal takes each non-negative integer value in { k |0 ≦ k ≦ L }.

In the embodiment of the present invention, the sampling probabilities of the kth position of each edge forming the target path in the (n-k) edges may be respectively determined as follows:

wherein the content of the first and second substances,

to expand the set of block edges, E_(n-k)Is a set of (n-k) edges, E_(n-k)Does not include the edge at the determined first k positions, D_L(e_q) Is composed of

Middle distance edge e_qNumber of sides of L, e_qIs taken for a while

Each side of D_L-k+1(e_k-1) Is composed of

Middle distanceThe number of sides (L-k +1) from the (k-1) th position, D_L-k(e_j) Is composed of

Middle distance edge e_jIs the number of sides of (L-k), e_jTake over E_(n-k)Each edge of (1).

It should be noted that, in the following description,

middle distance edge e_qThe number of sides of L, in particular

Middle edge e_qIs a starting edge, a distance edge e_qThe number of sides of L. Taking FIG. 6 as an example, the distance edge e₃The edge being 1 has e₁、e₂、e₄However, in the embodiment of the present invention, the distance edge e₃An edge of 1 does not include e₁、e₂Only including e₄Thus, from the side e₃The number of sides 1 is 1. This simplifies the calculation of the sampling probability.

And 6, expanding and searching the blocked edges intensively along the initial edge, wherein the number of the edges is the same as that of all the edges of the sampling path length, and forming a target path. Because the number of the edges of all the other edges is found along the initial edge in an expanding way and is the same as the numerical value of the sampling path length, the path length of the obtained target path is the same as the sampling path length. Therefore, all the edges with the number of the edges same as the length of the sampling path can be searched along the expanded length, and the target path is determined.

Because the number of the compute nodes is limited, it is not necessarily guaranteed that each edge has one compute node, and the number of the edges may be equal to, may be greater than, or may be less than the number of the compute nodes, and specifically according to an actual situation, the number of the edges is greater than the number of the compute nodes for example, and is not limited herein, so for convenience of understanding the above-mentioned 5 th step and 6 th step, an exemplary description is as follows:

step 1, as shown in the edge set of the graph data in fig. 6, the edge set of the graph data is partitioned, that is, the edge set of the graph data is partitioned into different partitioned edge sets. Referring to fig. 4 and 5, for example, 2 compute nodes are used to block the edge sets of the graph data, so as to obtain two block edge sets. As long as the same edge is not placed in multiple partitions, and a vertex is not placed in multiple partitions, the integrity of each partition edge set is ensured, and the specific partitioning manner is not limited herein. This allows for separate processing of different sets of block edges.

After obtaining two sets of block edges, step 2, a set of block edges is assigned to a compute node so that the compute node can process the assigned set of block edges.

Step 3, determining each path formed by the edges in each block edge set, for example, the first block edge set is e1, e5, e6, the second block edge set is e2, e3, e4, and the paths with path length 1 in the first block edge set are: (v1, v2), (v1, v3), (v3, v6), the paths in the second set of partition edges with path length 1 are respectively: (v2, v4), (v4, v7), (v2, v5), the paths in the first set of tile edges with path length 2 are: (v1, v3, v,6), the paths in the second block edge set with path length 2 are respectively: (v2, v4, v 7).

And 4, forming each path formed by each block edge set into a path set processed by each computing node. For example, the paths formed by the first set of block edges, which are (v1, v2), (v1, v3), (v3, v6) and (v1, v3, v,6), respectively, constitute the set of paths for the processing of the first compute node; the paths formed by the second partition edge set are (v2, v4), (v4, v7), (v2, v5) and (v2, v4, v7), respectively, and constitute a processed path set of the second compute node. The "first" in the first computing node and the "second" in the second computing node are only for distinguishing the two computing nodes, and are not limited in sequence.

And 5, based on the path set processed by each computing node, expanding the numerical value of the length of the searched sampling path to subtract one edge except for the initial edge by taking the initial edge in the block edge set distributed by the computing node as the initial edge and along the direction that the initial edge in the block edge set has other edges.

And 6, expanding and searching the blocked edges intensively along the initial edge, wherein the number of the edges is the same as that of all the edges of the sampling path length, and a target path is formed.

Based on the path set processed by the first computing node, referring to fig. 4, the specific application example in step 5 above: assuming that the value of the sampling path length is 2 as an example, the path set processed by the first compute node is directly used: then based on the set of paths processed by the first compute node, respectively include: (v1, v2), (v1, v3), (v3, v6) and (v1, v3, v,6), starting from the starting edges e1 and e5 in the first block edge set allocated by the computing node, in the direction in which the rest edges exist in the first block edge set, i.e., in the direction toward the rest edge e6 in the figure, and in the rightward direction in the figure, the edges other than the starting edges e1 and e5 are found out in a spreading manner, and the edges other than the starting edges e1 and e5 are the edges obtained by subtracting one from the value 2 of the sampling path length, i.e., the edges other than the starting edges e1 and e5, i.e., no other edges exist behind the starting edge e1, only the other edges e6 in the rest edges are determined.

In the above specific application example in step 6, the first partitioning edge set is expanded along the start edge e1 to find all edges with the same number of edges as the value 2 of the sampling path length, that is, the start edge e5 and the other edges e6, to form the target path.

Based on the path set processed by the second computing node, referring to fig. 5, the specific application example in step 5 above: assuming that the value of the sampling path length is 2 as an example, the path set processed by the second computing node is directly used: then based on the set of paths processed by the second compute node, respectively include: (v2, v4), (v4, v7), (v2, v5) and (v2, v4, v7), starting from the starting edges e2 and e3 in the second set of block edges allocated by the computing node, in the direction in which the remaining edges exist in the second set of block edges, that is, in the direction toward the remaining edge e4 in the figure, in the rightward direction in the figure, the edges other than the starting edges e2 and e3 are found by expanding, and the edges other than the starting edges e2 and e3 are the edges obtained by subtracting one from the value 2 of the sampling path length, that is, the edges other than the starting edges e2 and e3, that is, because there are no other edges behind the starting edge e2, only the other edges e4 in the remaining edges are determined.

In the above specific application example in step 6, the second partition edge set expands and finds all the edges with the same number of edges as the value 2 of the sampling path length, that is, the starting edge e3 and the other edges e4, along the starting edge e3 to form the target path.

Step 140, for the target path of each computation node, based on a predetermined weight, respectively extracting an element from at least one element included in each edge forming the target path, to obtain a sampling element sequence, where the weight is used to indicate a proportion of the sampling element sequence in a full-scale sampling element sequence set, and the full-scale sampling element sequence in the full-scale sampling element sequence set is obtained by sampling the respective distributed computation nodes according to the sampling allocation times.

When the embodiment of the present invention is applied to an application scenario in which at least two vertices exist in an original vertex set and at least one directional path is formed by at least one edge, a full-vector element sequence may be formed by Q elements, where the Q elements are respectively from Q edges in the original edge set, and a directional path with a length of Q is formed by Q edges, that is, the full-vector sampling element sequence is various possible element sequences formed by elements included in edges forming lengths of respective paths, where positions of respective elements in the element sequence are the same as positions of the edges from the element sequence in the path. Taking FIG. 6 as an example, by edge e₁And e₂Form a path with path length 2, assume edge e₁Including the elements Y1 (1, (a, b, c)), Y2 (2 (a, d)), Y3 (3 (a, c)), and the edge e₂Comprising the elements Y11 (11, (b, c)), Y12 (12 (a, c)), wherein one of the full-length element sequences is (Y1, Y11), wherein Y1 comes from the edge e₁Y11 from edge e₂And the arrangement sequence of the elements in the full element sequence is the same as the pointing sequence of each side in the path. Assuming that the sampling path length is 2, the weight of the one full-scale element sequence (Y1, Y11) is 6 ═ 3 × 2, and 3 is the edge e₁The number of elements that are included in the composition,2 is an edge e₂The number of elements involved.

In order to obtain the predetermined weight, the method of the embodiment of the present invention further includes: the following steps are used to determine the weights:

for each computation node, determining the product of the total number of the full-scale sampling element sequences on all edges forming the target path of the computation node and the total number of all target paths included in the computation node as a weight.

The full-scale sampling element sequence may refer to all element sequences formed by extracting, for target paths of all computation nodes, one element from at least one element included in each edge forming each target path, the length of the element sequence being the same as that of the sampling element sequence. The sampling element sequence may be a sampling element sequence formed by extracting, based on a predetermined weight, one element from at least one element included in each edge forming the target path for each computation node. And obtaining the numerical value of the sampling element sequence of each calculation node according to the target path of each calculation node, wherein the numerical value of the obtained sampling element sequence is the same as the numerical value of the sampling distribution times. And calculating the target paths of the nodes, wherein the obtained numerical value of the sampling element sequence is the same as the numerical value of the sampling times.

In order to obtain the sample element sequence, the step 140 may adopt the following implementation steps from step 141 to step 143, and respectively extract one element from at least one element included in each edge forming the target path:

141, determining the reciprocal of the total number of all target paths included in each computation node as the occurrence probability of the total number of all target paths of the target path in each computation node;

142, determining the sampling probability of the elements of the edge at the k-th position in the target path based on the occurrence probability and the reciprocal of the sum of the elements of the edge at the k-th position on the target path for the target path of each computation node; the aforementioned 142 th step specifically includes, based on the occurrence probability, an inverse of a sum of edge elements at the kth position on the target path: and determining the product of the occurrence probability and the reciprocal of the sum of the elements on the edge at the k-th position on the target path as the sampling probability of the elements on the edge at the k-th position in the target path.

143, extracting an element from at least one element included in the edge at the k-th position according to the element sampling probability;

wherein the element sampling probability of the edge at the kth position is 1/(N)_kX), the weight is N_kX X, 1/X is the probability of occurrence, 1/N_kIs the inverse of the sum of the elements on the edge at the k-th position, N_kFor the sum of the elements on the edge at the kth position, k traversal takes each non-negative integer value in { k |0 ≦ k ≦ L }, where X is the total number of all target paths included in the compute node, and L is the sample path length.

Specific examples of the applications of the steps 141 to 143 are as follows:

assume that when the sampling path length L is 2, the edge e₁Edge e₂Edge e₃Edge e₄Edge e₅Edge e₆Taking fig. 6 as an example, the following manner is used in the present application to follow edge e₁Edge e₂Edge e₃Edge e₄Edge e₅Edge e₆The elements on 2 sides of one target path are determined:

first, an edge at the 0 th position in the path with L1 being 1 is determined, specifically, to ensure the correlation between the elements in the sampled sample element sequence, the edge e is the first edge₁Edge e₂Edge e₃Edge e₄Edge e₅Edge e₆Determining probabilities of the edges at the 0 th position in the path with L1 ═ 1 as the edges at the 0 th position in the path with L1 ═ 1, respectively, and then selecting the edge e based on the probability of the edge at the 0 th position in the path with L1 ═ 1₁Edge e₂Edge e₃Edge e₄Edge e₅Edge e₆One edge as the edge at position 0. In the embodiment of the invention, the following formula is adopted:

e_j∈{e₁、e₂、e₃、e₄、e₅、e₆determine the probability of an edge at position 0 in the path with L1 ═ 1, where Pr (e)_j) Probability of an edge at position 0 in a path with L1 ═ 1, D_L1(e_j) Is composed of

Middle distance edge e_jThe number of sides having a length of 1 is specified as the following side e_jAs a start edge, an edge e having a distance of 1 from the edge_jThe number of (2).

The edge e may be set to₁Edge e₂Edge e₃Edge e₄Edge e₅Edge e₆The node at the 0 th position in the path, which is regarded as L1 being 1, respectively, i.e. the edge e₁Edge e₂Edge e₃Edge e₄Edge e₅Edge e₆The probability of each node being the 0 th position in the path with L1 being 1 is the sampling probability of the edge at the 0 th position in the path with L1 being 1.

Next, the edge at the 1 st position in the path with L1 equal to 1 is determined, taking fig. 6 as an example, and is separated from the edge e along the direction of the directed graph ₁2 sides of length 1, respectively side e₂Edge e₃Thus, D_L1(e₁) 2; distance edge e₂The side of length 1 has 0, thus, D_L1(e₂) 0; in the same way, D_L1(e₃)＝1，D_L1(e₄)＝0，D_L1(e₅)＝1，D_L1(e₆)＝0。

Therefore, the temperature of the molten metal is controlled,

in a similar way, Pr (e)₂)＝0，

Pr(e₄)＝0，

Pr(e₆)＝0。

Assume that the edge at the 0 th position is determined to be the edge e according to the sampling probability of the edge at the 0 th position in the path with L1 being 1 of each node₁Then from e₂Edge e₃Edge e₄Edge e₅Edge e₆Determines the edge at the 1 st position in the path with L1 ═ 1.

In the embodiment of the present invention, the edge (e) at the 1 st position in the path where L1 is 1 may be determined first₂Edge e₃Edge e₄Edge e₅Edge e₆) The sampling probability of the edge at (e), and then the edge at the 1 st position in the path according to L1 ═ 1 (e)₂Edge e₃Edge e₄Edge e₅Edge e₆) From edge e₂Edge e₃Edge e₄Edge e₅Edge e₆To select a node as the node at the 1 st position in the path with L1 ═ 1, it should be understood that e will be₂Edge e₃Edge e₄Edge e₅Edge e₆The sampling probability of the edge at the 1 st position in the path with L1 ═ 1, which is the node at the 1 st position in the path with L1 ═ 1 respectively, is i.e. e₂Edge e₃Edge e₄Edge e₅Edge e₆The probability of each node being the 1 st position in the path with L1 being 1.

In the embodiment of the invention, the following formula is adopted:

e_j∈{e₂、e₃、e₄、e₅、e₆the sampling probability of the edge at the 1 st position in the path of L1 ═ 1, specifically,

Pr(e₄)＝0，Pr(e₅)＝0，Pr(e₆)＝0。

then, from edge e₂Edge e₃One edge is selected, assuming that the selected edge is edge e₂Then, in the path of fig. 7 where L1 is 1, the edge at the 0 th position is the edge e₁The edge at the 1 st position is the edge e₂。

Finally, respectively from edge e₁Including at least one element and an edge e₂At least one element included in the sequence is extracted to obtain a sampling element sequence:

to follow edge e₁Extract an element from the edge e as an example₂The way and edge e of the included elements to extract the elements₁The manner of extracting an element is similar, and is not described in detail herein.

Due to the edge e₁Includes 3 elements, and thus, edge e₁Has an element sampling probability of 1/3; and edge e₂Includes 2 elements, and thus, edge e₂Has an element sampling probability of 1/2.

From edge e according to element sampling probability 1/3₁Including elements Y1, Y2, Y3, one element is extracted. Assume that the slave edge e is based on the above method₁From edge e₁Including at least one element and an edge e₂At least one element is extracted, namely Y1 and Y11. Since the edge at the 0 th position in the target path is e₁The edge at the 1 st position is e₂Then the sample element sequence is (Y1, Y11), the weight determined (Y1, Y11) is the total number of the full-scale sample element sequences on all edges of the edge of the target path 6, and

the product 18 of the total number 3 of all target paths included by the compute node.

The following proceeds to describe the distributed graph data sequence sampling apparatus provided in the embodiment of the present invention.

As shown in fig. 8, an embodiment of the present invention further provides a distributed graph data sequence sampling apparatus, which is applied to distributed computing nodes, where the distributed computing nodes include: two or more computing nodes, the apparatus comprising:

a first obtaining module 31, configured to obtain preset map data, sampling times, and a sampling path length;

the first processing module 32 is configured to equally divide the sampling times to obtain respective sampling times of each computing node, which are used as sampling distribution times;

a second processing module 33, configured to determine, from the path set processed by each compute node, a target path having a path length that is the same as the sampling path length according to the sampling path length, where the target path is formed by edges of the graph data having the same number of edges as the sampling path length, and each edge of the graph data includes at least one element;

a third processing module 34, configured to, for a target path of each computation node, extract one element from at least one element included in each edge forming the target path, respectively, based on a predetermined weight, to obtain a sample element sequence, where the weight is used to indicate a proportion of the sample element sequence in a full-scale sample element sequence set, and the full-scale sample element sequence in the full-scale sample element sequence set is obtained by sampling the respective distributed computation nodes according to the sampling allocation times.

In the embodiment of the present invention, the sampling times are allocated to the same sampling number of each computing node, so that tasks of sampling element sequences can be respectively and uniformly allocated to each computing node in the distributed computing nodes, each computing node shares the task of the respective required sampling times, and a target path having the same numerical value as the sampling path length is determined according to the sampling path length from a path set processed by each computing node, and then, for the target path of each computing node, one element is respectively extracted from at least one element included in each edge in the target path based on a predetermined weight to obtain a sampling element sequence, so that the respective sampling element sequence is sampled by each computing node in the distributed computing nodes, thereby completing sampling of a graph data sequence. Compared with the prior art that a single computing node completes the whole sampling process, the method not only reduces the quantity of processing data of each computing node, but also improves the efficiency of acquiring the required sampling graph data.

In one possible implementation, the apparatus further includes:

an allocation module for allocating a block edge set to a compute node;

the second processing module is configured to:

In one possible implementation, the apparatus further includes: a fifth processing module to:

In a possible implementation manner, the third processing module is configured to:

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A distributed graph data sequence sampling method is applied to distributed computing nodes, and the distributed computing nodes comprise: two or more computing nodes, the method comprising:

acquiring preset image data, sampling times and sampling path length;

for a target path of each computation node, extracting an element from at least one element included in each edge forming the target path respectively based on a predetermined weight to obtain a sampling element sequence, wherein the weight is used for indicating the proportion of the sampling element sequence in a full-scale sampling element sequence set, and the full-scale sampling element sequence in the full-scale sampling element sequence set is obtained by sampling the distributed computation nodes respectively according to the sampling distribution times;

before determining, in the path set processed from each compute node, a target path having a path length equal to the sample path length according to the sample path length, the method further includes:

assigning a set of block edges to a compute node;

determining each path formed by each block edge set;

and forming a path set processed by each computing node by the paths formed by each block edge set.

2. The method of claim 1, wherein determining a target path from the set of paths processed by each compute node having a path length equal to the sampled path length by the sampled path length comprises:

3. The method of claim 1, wherein the weights are determined by:

4. The method according to claim 1 or 3, wherein said extracting, for the target path of each computing node, one element from at least one element included in each edge forming the target path, respectively, based on a predetermined weight, comprises:

5. A distributed graph data sequence sampling apparatus, applied to distributed computing nodes, the distributed computing nodes comprising: two or more computing nodes, the apparatus comprising:

a third processing module, configured to extract, for a target path of each computation node, one element from at least one element included in each edge forming the target path, respectively, based on a predetermined weight, to obtain a sample element sequence, where the weight is used to indicate a proportion of the sample element sequence in a full-scale sample element sequence set, and the full-scale sample element sequence in the full-scale sample element sequence set is obtained by sampling the distributed computation nodes individually according to the sampling allocation times;

the device further comprises:

an allocation module for allocating a block edge set to a compute node;

and the composition module is used for forming each path formed by each block edge set into a path set processed by each computing node.

6. The apparatus of claim 5,

the second processing module is configured to: based on a path set processed by each computing node, expanding and searching the numerical value of the sampling path length minus one edge except for the initial edge in the direction that the initial edge in the block edge set distributed by the computing node is taken as the initial edge and the rest edges exist along the initial edge in the block edge set;

7. The apparatus of claim 5, wherein the apparatus further comprises: a fifth processing module to:

8. The apparatus of claim 5 or 7, wherein the third processing module is to: