CN116302527A

CN116302527A - Social network data analysis method, system and electronic equipment

Info

Publication number: CN116302527A
Application number: CN202310239697.7A
Authority: CN
Inventors: 李真理; 王芳; 冯丹; 方鹏; 施展
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-23

Abstract

The invention discloses a social network data analysis method, a social network data analysis system and electronic equipment. Comprising the following steps: starting from the vertex with the greatest degree in the graph, accessing the next vertex from the neighbor according to the rule of priority of the greatest degree, and continuously accessing the next vertex by using the same rule until all the vertices are traversed, and sequencing the vertex streams according to the access sequence. Vertices are considered as one-dimensional horizontal arrangements and the vertex stream is divided equally into batches according to parallel granularity, with each thread equally distributing the batches. Each thread processes the vertex streams in batches according to the vertex sequence, and calculates the score for each divided area according to the multi-order attribute sensing strategy for each vertex

Then willVertices are divided into regions with highest scores. The invention can efficiently divide the graph, support large-scale graph data and ensure the effectiveness of division results.

Description

Social network data analysis method, system and electronic equipment

Technical Field

The invention belongs to the field of social network data analysis, and in particular relates to a social network data analysis method, a social network data analysis system and electronic equipment.

Background

In the internet age today, graph data is a ubiquitous data structure in the internet companies today. There are microblogs, tencent, abiba, etc. in China, pushing, google, etc. in abroad, among which a large number of users compose a large number of social networks. These companies use graph structures to represent relationships between users, some user interactions such as praise, comment, attention, etc. are edges between user vertices. Through efficient graph data analysis, people can understand the meaning of the data more deeply, and a plurality of effective methods can extract useful information from the graph data, so that companies can be helped to improve products and services better.

With the rapid growth of the internet, the rapidly growing scale of enterprise graph data makes social network graph processing more difficult. For example, the drawing data of a facebook contains more than two billion user vertices and more than one trillion edges representing relationships of attention, fan, praise, and the like; the figure network of alemba also contains more than billions of users and billions of commodity vertices. In general, if the method for extracting the graph data information runs on a single-node machine with a multi-core CPU, only graph processing with millions of vertices at most can be competed, and the current requirements cannot be met obviously. Therefore, the main approach is to process large-scale graph data in a distributed environment, divide the graph data into a plurality of parts, and operate on a plurality of vertices in parallel. On one hand, the overall calculation performance can be improved; on the other hand, graph data processing on the billion vertex scale can be supported as long as there are enough compute nodes. However, the division of the graph data is not imagined to be simple, the graph data is different from other linear data structures, and the graph data itself contains rich information. In most graph processing methods, extracting information of a vertex from a graph generally requires referencing information of a neighbor of the vertex, even a neighbor of the neighbor. This illustrates that there are a large number of data dependencies within the graph and that the graph needs to be partitioned with reduced dependency corruption.

The performance of the distributed system is also considered while taking into account data dependencies. More specifically, an edge where two vertices are stored on different computing nodes is referred to as a cut edge, where a social network graph is divided into several sub-graphs for operation in a distributed system. When operating on a cut edge, information of two vertexes is required at the same time, which involves a problem of data synchronization. In order to ensure that the vertex information of both ends is acquired to be up-to-date, communication overhead is generated at this time. Thus, the more edges are cut, the greater the communication overhead in the distributed system, and in general, the communication is the most dominant overhead of the distributed system.

In summary, the communication overhead of the existing social network data analysis method is large, and the actual requirements cannot be met.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a social network data analysis method, a system and electronic equipment, and aims to solve the problems that the communication cost of the existing social network data analysis method is large and the actual requirements cannot be met.

In order to achieve the above object, the present invention provides a social network data analysis method, which is characterized by comprising the following steps:

Storing data of a social network in a form of a social network graph, wherein each vertex in the social network graph represents a user, and if social relationship exists between the users, edge connection exists between the two corresponding vertices;

determining the vertex with the largest degree in the social network graph, sequentially accessing the neighbor vertexes of the vertex according to the order of the illumination from large to small according to the priority principle of the largest degree, and then accessing the neighbor vertexes of the vertex according to the same principle until each vertex in the social network graph is accessed;

sequentially arranging all the accessed vertexes according to the access sequence, dividing all the accessed vertexes into a plurality of batches according to the arrangement sequence and uniformly distributing the batches to each thread according to the preset parallel granularity;

in each thread, dividing each vertex into corresponding partitions according to a multi-order attribute perception strategy; the multi-order attribute sensing strategy considers the number of neighbors of the vertexes in each partition and the number of common neighbors of the vertexes in each partition, and dynamically balances and constrains the load of each partition to ensure the load balance of a plurality of partitions; vertices in the social network having similar characteristics are partitioned into the same partition to reduce communication overhead when analyzing social network data in a distributed system.

In one possible implementation, vertices in the social networking graph are accessed according to the following steps:

(1.1) finding out the vertex with the largest degree from the social network diagram, accessing the vertex with the largest degree and sequencing neighbors thereof according to the descending degree;

(1.2) starting to visit the neighbor vertexes, wherein the visit order is ordered according to degrees, and the neighbors with higher degrees visit preferentially;

(1.3) marking the vertex which has been visited at each visit, checking whether the vertex has been visited at each time when a new vertex is to be visited, and skipping the vertex if the vertex has been visited;

(1.4) after all the n-order neighbors of the maximum degree vertex are accessed, executing the step (1.2) and the step (1.3) on each n-order neighbor according to the access sequence of the n-order neighbors, and performing the same access on the n+1-order neighbors of the maximum degree vertex until each vertex in the social network diagram is accessed; wherein n represents the number of times step (1.2) and step (1.3) are repeatedly performed.

In one possible implementation, all the visited vertices are divided into a plurality of batches according to the order of arrangement, specifically:

all the accessed vertices are stored in the array in the order of access,determining the size b of a batch according to the preset thread number t and the number v of vertexes in the vertex stream; wherein,,

Each thread determines its unique id number, id=0, 1,2,3, then calculating a starting point index sid=b×id and an ending point index eid= (b+1) ×id of each thread responsible for dividing the vertex according to b and id;

and stopping processing when the vertex id of the thread processing exceeds the endpoint index eid or exceeds the array range.

In one possible implementation manner, in each thread, each vertex is divided into corresponding partitions according to a multi-order attribute sensing policy, specifically:

(3.1) each thread traversing its own vertex stream, and for each vertex, traversing all partitions;

(3.2) performing calculation of the division score S (v, p) for each vertex v currently being divided and each partition p; wherein the division score S (v, p) is composed of three parts:

the system is a punishment coefficient, wherein the punishment coefficient is related to preset super-parameters and the number of vertexes which are currently divided into partitions, punishment strength is changed along with the current division situation so as to control the load balance of each partition, and the numerical value of the super-parameters can be adjusted within a preset range so as to control whether the load balance or the minimum division is prioritized in the division; />

Is a first order attribute, consider the number of neighbors of vertex v in partition p; / >

Is a second order attribute, consider the number of common neighbors of vertex v and vertices in partition p;

(3.3) for the vertex currently being divided, comparing its division scores at each partition, selecting the partition with the highest division score, and dividing v into the corresponding partition.

In one possible implementation manner, the step (3.2) specifically includes the following steps:

(3.2.1) calculating a penalty coefficient based on the preset super-parameter beta, the number of vertexes currently divided into the partitions, the number of partitions and the number of vertexes already divided

Where P is the number of vertices in partition P, k is the number of partitions, V is the number of vertices already partitioned, then +.>

(3.2.2) solving the neighbor set N (v) of the currently divided vertex v and the vertex set Q (p) which is already divided into p by using a high-speed solving algorithm to obtain a result set R, wherein the modulus of the R is taken as a first-order attribute

Is a value of (2);

(3.2.2.1) ordering the vertices of N (v) and Q (p) according to vertex numbers from small to large, and comparing the sizes of the modes of the two sets, pointing the pointer s to the first element of the smaller set, and pointing the pointer l to the first element of the larger set;

(3.2.2.2) let pointer S point to element S, pointer L point to element L, compare the magnitudes of S and L: if S is equal to L, then S is added to R, and both S and L point to the next element; if S is less than L, only S points to the next element; if S is greater than L, setting an initial step r equal to 1, pointing L to the position of l+r, re-comparing S and L, if S is still greater than L, making r=r×2 until S is less than L; at this time, for the set where l is located

The range of (2) is searched by binary search and S 'is the same as S', if the binary search is found, S 'is added into R, so that l points to the next bit of S'; if not, making s point to the next element, and l points to the position of l+r+1;

(3.2.2.3) repeatedly performing the comparison of (3.2.2.2) until s or l points to the end of the set, returning to set R, and proceeding to step (3.2.3);

(3.2.3) traversing the set R, counting the common neighbors of each vertex R in R with the current divided vertex v, and summing the number of common neighbors of all statistics as a second order attribute

Is a value of (2);

(3.2.4) calculating a division score based on the penalty coefficient, the first-order attribute, and the second-order attribute.

In a second aspect, the present invention provides a social network data analysis system, comprising:

the social network diagram acquisition module is used for storing the data of the social network in the form of a social network diagram, wherein each vertex in the social network diagram represents a user, and if a social relationship exists between the users, the corresponding two vertices are connected by edges;

the vertex access module is used for determining the vertex with the greatest degree in the social network graph, sequentially accessing the neighbor vertexes of the vertex according to the order of the illumination from large to small according to the principle of priority of the maximum degree, and then accessing the neighbor vertexes of the vertex according to the same principle until each vertex in the social network graph is accessed;

The thread allocation module is used for sequentially arranging all the accessed vertexes according to the access sequence, dividing all the accessed vertexes into a plurality of batches according to the arrangement sequence and uniformly allocating the batches to each thread according to the preset parallel granularity;

the vertex partition module is used for dividing each vertex into corresponding partitions according to a multi-order attribute perception strategy in each thread; the multi-order attribute sensing strategy considers the number of neighbors of the vertexes in each partition and the number of common neighbors of the vertexes in each partition, and dynamically balances and constrains the load of each partition to ensure the load balance of a plurality of partitions; vertices in the social network having similar characteristics are partitioned into the same partition to reduce communication overhead when analyzing social network data in a distributed system.

In one possible implementation, the vertex access module performs vertex access according to the following steps: (1.1) finding out the vertex with the largest degree from the social network diagram, accessing the vertex with the largest degree and sequencing neighbors thereof according to the descending degree; (1.2) starting to visit the neighbor vertexes, wherein the visit order is ordered according to degrees, and the neighbors with higher degrees visit preferentially; (1.3) marking the vertex which has been visited at each visit, checking whether the vertex has been visited at each time when a new vertex is to be visited, and skipping the vertex if the vertex has been visited; (1.4) after all the n-order neighbors of the maximum degree vertex are accessed, executing the step (1.2) and the step (1.3) on each n-order neighbor according to the access sequence of the n-order neighbors, and performing the same access on the n+1-order neighbors of the maximum degree vertex until each vertex in the social network diagram is accessed; wherein n represents the number of times step (1.2) and step (1.3) are repeatedly performed.

In one possible implementation, the thread allocation module divides all the accessed vertices into a plurality of batches according to the arrangement order, specifically: storing all accessed vertexes into an array according to an access sequence, and determining the size b of the batch according to the preset thread number t and the number v of the vertexes in the vertex stream; wherein,,

each thread determines its unique id number, id=0, 1,2,3, then calculating a starting point index sid=b×id and an ending point index eid= (b+1) ×id of each thread responsible for dividing the vertex according to b and id; and stopping processing when the vertex id of the thread processing exceeds the end point index cid or exceeds the array range.

In one possible implementationIn the present mode, the vertex partition module divides each vertex into corresponding partitions according to a multi-order attribute sensing strategy in each thread, specifically: (3.1) each thread traversing its own vertex stream, and for each vertex, traversing all partitions; (3.2) performing calculation of the division score S (v, p) for each vertex v currently being divided and each partition p; wherein the division score S (v, p) is composed of three parts:

the system is a punishment coefficient, wherein the punishment coefficient is related to preset super-parameters and the number of vertexes which are currently divided into partitions, punishment strength is changed along with the current division situation so as to control the load balance of each partition, and the numerical value of the super-parameters can be adjusted within a preset range so as to control whether the load balance or the minimum division is prioritized in the division; / >

Is a first order attribute, consider the number of neighbors of vertex v in partition p; />

Is a second order attribute, consider the number of common neighbors of vertex v and vertices in partition p; (3.3) for the vertex currently being divided, comparing its division scores at each partition, selecting the partition with the highest division score, and dividing v into the corresponding partition.

In a third aspect, the present application provides an electronic device, comprising: at least one memory for storing a program; at least one processor for executing a memory-stored program, which when executed is adapted to carry out the method described in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations of the first aspect.

In general, the above technical solutions conceived by the present invention have the following beneficial effects compared with the prior art:

according to the social network data analysis method provided by the invention, in the dividing process, the consideration of the fact that the scale of the social network graph data in the real world is overlarge is taken into account, the dividing efficiency of each vertex of the traditional serial method is too low, the dividing tasks are parallelized through reasonable parallel task division, compared with the traditional method, the graph dividing efficiency is greatly improved, the larger-scale graph data division can be supported, and the algorithm optimization of high-speed intersection is carried out on intersection, so that the efficiency is further improved. In addition, in the parallelization process, the vertex stream is generated by using the maximum breadth first algorithm, so that correlation among vertices can be reserved more in batches, the deterioration of the dividing effect caused by parallelization is reduced, and the communication cost of the data processing process of the social network graph can be reduced on the basis of maintaining the original data processing effect of the social network.

According to the social network data analysis method provided by the invention, whether the vertexes are partitioned into the subareas is determined together by calculating the multi-order attributes and the punishment parameters between the vertexes and the subareas, so that the problem that only single attributes are considered in the existing method is solved, and compared with the strategy that only the number of neighbors is considered in the traditional method, the partitioned data can be more suitable for the complex social network graph processing method nowadays, the communication cost can be effectively reduced, and the actual requirements are met.

Drawings

FIG. 1 is a flowchart of a social network data analysis method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a parallel multi-order attribute aware flow chart partitioning method provided by an embodiment of the present invention;

FIG. 3 is an exemplary diagram of a parallel multi-level attribute aware stream graph partitioning method provided by an embodiment of the present invention;

FIG. 4 is an exemplary diagram of a fast intersection method provided by an embodiment of the present invention;

FIG. 5 is an exemplary graph of data for calculating a division score provided by an embodiment of the present invention;

fig. 6 is a schematic diagram of a social network data analysis system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The term "and/or" herein is an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The symbol "/" herein indicates that the associated object is or is a relationship, e.g., A/B indicates A or B.

The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects. For example, the first response message and the second response message, etc. are used to distinguish between different response messages, and are not used to describe a particular order of response messages.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise specified, the meaning of "a plurality of" means two or more, for example, a plurality of processing units means two or more processing units and the like; the plurality of elements means two or more elements and the like.

FIG. 1 is a flowchart of a social network data analysis method provided by an embodiment of the present invention; as shown in fig. 1, the method comprises the following steps:

S101, storing data of a social network in a social network diagram mode, wherein each vertex in the social network diagram represents a user, and if social relationship exists between the users, edge connection exists between the two corresponding vertices;

s102, determining the vertex with the largest degree in the social network graph, sequentially accessing the neighbor vertexes of the vertex according to the order of the illumination from large to small according to the priority principle of the largest degree, and then accessing the neighbor vertexes of the vertex according to the same principle until each vertex in the social network graph is accessed;

s103, sequentially arranging all the accessed vertexes according to the access sequence, dividing all the accessed vertexes into a plurality of batches according to the arrangement sequence and uniformly distributing the batches to each thread according to the preset parallel granularity;

s104, dividing each vertex into corresponding partitions according to a multi-order attribute sensing strategy in each thread; the multi-order attribute sensing strategy considers the number of neighbors of the vertexes in each partition and the number of common neighbors of the vertexes in each partition, and dynamically balances and constrains the load of each partition to ensure the load balance of a plurality of partitions; vertices in the social network having similar characteristics are partitioned into the same partition to reduce communication overhead when analyzing social network data in a distributed system.

It should be noted that, in the social network data processing process, graph division needs to be related, and in the graph division, the optimization target includes two items: load balancing and minimal clipping. Load balancing is to enable multiple computers in a distributed system to have similar performance; the cut includes a cut edge and a cut point, which refers to edges or points existing in two partitions at the same time, and communication overhead is often generated when the system processes the cut. The minimal cut is to reduce the cost of communication between computers. Often, reducing the trim all the way down tends to result in uneven loading, affecting overall performance. For example, in an extreme case, the edge is minimized when all vertices are divided into the same computed vertex. Therefore, in the graph dividing method, two aspects of load balancing and minimum cutting are needed to be comprehensively considered, and a balance is made, so that a distributed system can keep better load balancing under the condition that the minimum cutting is kept as much as possible, and the overall performance is improved.

Furthermore, conventional algorithms typically process the entire graph, which is difficult to estimate for large-scale graphs, and may require a large amount of space to load the graph entirely into memory. A better way is to use flow chart partitioning. Such as LDG and Fennel algorithms, which do not analyze all vertices before dividing, but instead convert the vertex set into a vertex stream, with each vertex alone determining to which vertex it is divided. And load balancing and cutting as few as possible are simultaneously considered when judging division. On the one hand, however, the today's graph scale growth rate is surprising, these algorithms are again = single-threaded, and they may even offset the performance improvement caused by partitioning when faced with oversized graphs; on the other hand, the graph processing method is more complicated, and the traditional partitioning algorithm usually only considers the number of neighbors of the vertexes and cannot meet the increasingly complex graph processing algorithm.

Therefore, how to design a parallel graph partitioning method considering multiple indexes is a problem that needs to be solved by those skilled in the art. In order to better divide social network graph data, the embodiment of the invention expands and describes a parallel multi-order attribute-aware flow graph dividing method used in the social network data processing process in detail, and particularly as shown in fig. 2, the method comprises the following steps:

(1) Starting from the vertex with the greatest degree in the graph, accessing the next vertex from the neighbor according to the rule of priority of the greatest degree, and continuously accessing the next vertex by using the same rule until all the vertices are traversed, and sequencing the vertex streams according to the access sequence.

(2) Vertices are considered as one-dimensional horizontal arrangements and the vertex stream is divided equally into batches according to parallel granularity, with each thread equally distributing the batches.

(3) And each thread processes the vertex streams in batches according to the vertex sequence, calculates the score for each divided area according to the multi-order attribute sensing strategy for each vertex, and then divides the vertex into the area with the highest score.

Specifically, the step (1) comprises:

(1.1) finding out the vertex with the largest degree from the graph, accessing the vertex and ordering the neighbors of the vertex in descending degree order.

(1.2) starting to access its neighbors, the order of access being ordered by degrees, higher degree neighbors being preferred.

(1.3) marking vertices that have been visited at each visit, and checking whether a visit has been made at each new vertex visit. If so, the point is skipped.

(1.4) after all the neighbors have been visited, starting to perform the same access process on the neighbors of the neighbors according to the previous round of access sequence until each vertex is visited.

It will be appreciated that vertices in the graph are visited starting with the vertex with the greatest degree, and that one vertex is marked as visited after it has been visited. And its neighbors are sorted in descending order of degree, and the neighbors that have not been accessed are added to the queue to be accessed. And accessing the vertexes according to the sequence of the queues to be accessed until all the vertexes are accessed, and recording the access sequence of the whole process.

Specifically, step (2) includes:

(2.1) storing the vertex sequence in the array in the order of (1). And determining the size b of the batch according to the preset thread number t and the number v of the vertexes in the vertex stream. General taking

(2.2) each thread obtains its own unique sequential id number, id=0, 1,2,3.

(2.3) stopping processing when the id of the vertex of the thread processing exceeds eid or exceeds the array range. If the array has undivided vertices, a batch of vertices is repartitioned from the unprocessed vertex region for processing until all vertices are processed.

Still further, each thread traverses its own responsible sequence of vertices, traversing all partitions for each vertex v. For each v and p, by calculating a score

And for each vertex it is computed and saved together with the scores of all the partitions.

Wherein, step (3) specifically includes:

(3.1) each thread traverses its own vertex stream V _t For each vertex, all partitions are traversed again.

(3.2) the calculation of the division score S (v, p) is performed for each current division vertex v and each partition p. The score consists of three parts:

is a penalty coefficient ω to control the load balancing of each partition; />

Is a first order attribute that considers the number of neighbors of v in p; />

Is a second order attribute, taking into account the vertex in pNumber of common neighbors.

(3.2.1) calculating penalty coefficients from the super parameter β

The degree of current partitioning is related, including the number of vertices that have been partitioned at the present time, the number of vertices partitioned into p, etc., to ensure load balancing. Whether load balancing or minimal splitting is prioritized in partitioning can be changed by adjusting β.

(3.2.2) intersection is performed by the neighbor set N (v) of the current divided vertex v and the vertex set Q (p) which is already divided into p, the intersection process uses a high-speed intersection algorithm, the obtained result set R takes the modulus of R as a first-order attribute

Is a value of (2).

(3.2.2.1) the vertices of N (v) and Q (p) are ordered from small to large in vertex number. And compares the sizes of the modes of the two sets. Pointer s is directed to the first element of the smaller set and pointer l is directed to the first element of the larger set.

(3.2.2.2) setting the initial step size r to 1, and comparing the pointer s with the element pointed to by the pointer l. If s is equal to l, adding the result to the result set R, and both pointers s and l move backward to point to the next element; if the l pointing element is larger than the s pointing element, performing binary search on the s pointing element between the position pointed by the l and the position pointed by the last time, adding a result set if the element pointed by the s is found, pointing the l to the next position of the search result, and shifting the s one position backward if the element pointed by the s is not found. Finally, r is reset to 1; will move s backward by r bits if the l-director element is less than the s-director element and will double r the original and re-perform this step.

(3.2.2.3) as the two pointers move, when s or l points to the end of the set, then the result set R is returned, and step (3.2.3) is entered.

(3.2.3) traversing the set R, and taking R asCounting the number of co-neighbors with v and as a second order attribute

Is a value of (2).

(3.2.4) storing the fractional product of the three parts in a fractional statistics area.

(3.3) for the vertex currently being divided, comparing its division score in each partition, and selecting the partition with the highest score. The data structure of the statistical partition is locked, and v is unlocked after being added to the partition.

It should be noted that, the quality of the partitioning strategy determines the performance of the partitioning of the graphics stream to a great extent. The main ideas of the division are: the vertexes are arranged according to a certain sequence, and then the vertexes are divided into different nodes in sequence until all vertexes are divided. The present invention adopts a mode of calculating a "division score" to determine which node the current vertex is divided into, that is, calculates the division score of the vertex to be divided and each node, and then divides the vertex into the node with the highest score, and fig. 3 is an example of vertex stream division. How to calculate the proximity score is the core of the whole method, and the reduction of cutting edges is considered, and node load balance is kept as much as possible. Obviously, the division score of a vertex and a node represents how tightly the vertex and the existing vertex set in the node form a new graph.

It can be appreciated that in order to reduce the overhead in the collective intersection process, the invention adopts a high-speed intersection method. The general intersection algorithm is that firstly, two sets are sorted from small to large, then, the two pointers are utilized to point to the smaller ends of the sets respectively, elements pointed by the two pointers are compared, if the elements are the same, a result set is added, otherwise, the pointers pointed to the smaller elements are moved backwards by one bit, and the comparison is repeated until the sets are traversed. However, this intersection algorithm is only applicable to cases where the two aggregate sizes differ little, and if the aggregate sizes differ too much, a large number of useless comparisons will occur. It can be seen from observation that the size of the neighbor vertex set of vertices is gradually changing from large to small scale, since the point flow is gradually decreasing in degree. At the same time, with the scratchIn the progression of the score, the vertices in the node are progressively larger, so the size of the set varies from small to large. Thus, most of the time, there is a large gap between the aggregate sizes, so there are a large number of ineffective comparisons. Therefore, the present invention adopts a fast intersection algorithm to accelerate intersection, as shown in fig. 4, when the pointer is moved, the step length is not fixed to be 1, when the element pointed by the pointer is smaller continuously, the initial position of the pointer is p, and the offset of each movement is doubled, namely, the offset is offset relative to the initial position: 2 ⁰ ，2 ¹ ，2 ² ，......，2 ⁿ . When the element in question is no longer a smaller element, then at [ p+2 ] ^(n-1) ，p+2 ⁿ ]If the same element exists, a large number of invalid comparisons can be avoided, and the pointer pointing to the smaller element can be moved quickly.

As shown in fig. 5, when the proximity score between the vertex e and the node p is calculated:

equal to 2, cn (x, y) represents the number of co-neighbors of x, y, then: />

In the conventional graph dividing methods LDG and Fennel, how to calculate the tightness between the vertex and the vertex set is simply to use the neighboring vertex of the vertex, and needless to say, the more neighbors of the vertex to be divided exist in the vertex set, which means that the higher the tightness between the vertex and the vertex set is. However, considering that modern graph processing algorithms often use a large number of common neighbors, a second order attribute, in order to cope with more complex scenarios. Here, the common neighbor refers to that two vertexes have an edge, and the other vertex and the two vertexes have an edge, so that the vertex is the common neighbor of the two vertexes. Therefore, to better accommodate complex graph processing algorithms, the present invention considers multi-order attributes in the computation of the proximity score.

Specifically, penalty coefficients are calculated from the superparameter β

|p| is the number of vertices in partition P, k is the number of partitions, and |v| is the number of vertices that have been partitioned. Then->

It will be appreciated that when a node reaches a higher degree of compactness, subsequent vertices are more likely to reach a higher proximity score with that node, since the higher compactness indicates more height vertices in the node, and that the vertices and height vertices are clearly more likely to produce neighbor and co-neighbor relationships. Without limitation, it is gradually formed that all vertices are divided into the same node, and the distribution is degraded into a single machine. The invention penalizes the score according to the load of the current node, and the higher the load is, the heavier the penalty is.

The invention calculates the load penalty coefficient by adopting the number of the vertexes which are currently divided instead of the total number of vertexes, and the penalty force is changed along with the current division condition and is called dynamic division. If the penalty factor is calculated using the total number of vertices, when the partition starts, |p| is smaller,

approaching 1, there is little penalty, when nodes are partitioned to a certain number of vertices,/when nodes are partitioned to a certain number of vertices>

When approaching 0, the penalty is too heavy, the vertex is hardly divided into the nodes, the effect of the last division is equal to that of dividing until one node is divided, the division is not performed after the division is equal to the average load size, and then the vertex is divided into the other node, so that the vertex is reciprocated. The use of the current divided vertices to compute results in the load between nodes being equal and growing uniformly.

For values of beta, the severity of the load balancing, when beta is smaller,

the smaller the load penalty becomes, the more the partitioning effect may be weakened, and the greater the beta is +.>

The greater the load penalty will be, the less likely load imbalance will result.

In addition, for each vertex, it is added to the vertex queue of the partition with the largest partition score, and the addition is ordered by the size of the vertex sequence number. The mutual exclusion lock of the partition is acquired before insertion, and insertion is performed after the mutual exclusion lock is acquired. And releasing the mutual exclusion lock of the area after the insertion is finished.

By adopting the parallel multi-order attribute sensing streaming graph dividing method, large-scale graph data can be efficiently divided, and the multi-order attribute sensing can ensure that the dividing result can also ensure enough effectiveness in the high-order graph processing method.

The graph division of the present invention is performed on a plurality of graph data sets of different scales, the division time can be reduced several times compared with the conventional graph division LDG and Fennel, and the data set Twitter of the vertex of the order of tens of millions is divided at the hour level, and the LDG requires several days. In addition to ultra-high performance, the present invention provides sufficient effectiveness to provide up to a 2-fold performance improvement in the downstream task HuGE, an entropy-based high-order random walk algorithm. Therefore, the parallel multi-order attribute-aware streaming graph dividing method provided by the invention can ensure the effectiveness, high efficiency and expandability of graph division.

FIG. 6 is a schematic diagram of a social network data analysis system according to an embodiment of the present invention; as shown in fig. 6, includes:

the social network diagram obtaining module 610 is configured to store data of a social network in a form of a social network diagram, where each vertex in the social network diagram represents a user, and if a social relationship exists between the users, an edge connection exists between two corresponding vertices;

the vertex accessing module 620 is configured to determine a vertex with the greatest degree in the social network graph, sequentially access the neighboring vertices of the vertex according to the order of the illumination from the large to the small according to the rule of priority of the greatest degree, and then access the neighboring vertices of the neighboring vertices according to the same rule until each vertex in the social network graph is accessed;

the thread allocation module 630 is configured to sequentially arrange all the accessed vertices according to an access order, divide all the accessed vertices into a plurality of batches according to a preset parallel granularity, and uniformly allocate the batches to each thread;

the vertex partition module 640 is configured to partition each vertex into corresponding partitions according to a multi-order attribute awareness policy in each thread; the multi-order attribute sensing strategy considers the number of neighbors of the vertexes in each partition and the number of common neighbors of the vertexes in each partition, and dynamically balances and constrains the load of each partition to ensure the load balance of a plurality of partitions; vertices in the social network having similar characteristics are partitioned into the same partition to reduce communication overhead when analyzing social network data in a distributed system.

It should be understood that, the foregoing apparatus is used to perform the method in the foregoing embodiment, and corresponding program modules in the apparatus implement principles and technical effects similar to those described in the foregoing method, and reference may be made to corresponding processes in the foregoing method for the working process of the apparatus, which are not repeated herein.

Based on the method in the above embodiment, an embodiment of the present application provides an electronic device. The apparatus may include: at least one memory for storing programs and at least one processor for executing the programs stored by the memory. Wherein the processor is adapted to perform the method described in the above embodiments when the program stored in the memory is executed.

Based on the method in the above embodiment, the present application provides a computer-readable storage medium storing a computer program, which when executed on a processor, causes the processor to perform the method in the above embodiment.

Based on the methods in the above embodiments, the present application provides a computer program product, which when run on a processor causes the processor to perform the methods in the above embodiments.

It is to be appreciated that the processor in embodiments of the present application may be a central processing unit (centralprocessing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signalprocessor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for analyzing social network data, comprising the steps of:

2. The method of claim 1, wherein vertices in the social network graph are accessed by:

3. The method according to claim 1, wherein all visited vertices are divided into a plurality of batches in a ranking order, in particular:

storing all accessed vertexes into an array according to an access sequence, and determining the size b of the batch according to the preset thread number t and the number v of the vertexes in the vertex stream; wherein,,

each thread determines a unique id number, id=0, 1,2,3 and … …, and then calculates a starting index sid=b×id and an ending index eid= (b+1) ×id of each thread responsible for dividing the vertex according to b and id;

4. The method according to claim 1, wherein in each thread, each vertex is divided into corresponding partitions according to a multi-order attribute aware policy, in particular:

is punishment coefficient, the punishment coefficient is related to preset super-parameters and the number of vertexes currently divided into subareas, and punishment force is along with the current dividing condition The condition changes to control the load balance of each partition, and the numerical value of the super parameter can be adjusted in a preset range to control whether the load balance or the minimum cut is prioritized in the division; />

5. The method according to claim 4, wherein the step (3.2) comprises the steps of:

Where I is the number of vertices in partition p, k is the number of partitions, I is the number of vertices already partitioned, then +.>

Is a value of (2);

Is searched for S identical to S using binary search ^′ If find then S ^′ Adding R so that l points to S ^′ Is the next bit of (2); if not, making s point to the next element, and l points to the position of l+r+1;

Is a value of (2);

6. A social networking data analysis system, comprising:

7. The system of claim 6, wherein the vertex access module performs vertex access according to the steps of: (1.1) finding out the vertex with the largest degree from the social network diagram, accessing the vertex with the largest degree and sequencing neighbors thereof according to the descending degree; (1.2) starting to visit the neighbor vertexes, wherein the visit order is ordered according to degrees, and the neighbors with higher degrees visit preferentially; (1.3) marking the vertex which has been visited at each visit, checking whether the vertex has been visited at each time when a new vertex is to be visited, and skipping the vertex if the vertex has been visited; (1.4) after all the n-order neighbors of the maximum degree vertex are accessed, executing the step (1.2) and the step (1.3) on each n-order neighbor according to the access sequence of the n-order neighbors, and performing the same access on the n+1-order neighbors of the maximum degree vertex until each vertex in the social network diagram is accessed; wherein n represents the number of times step (1.2) and step (1.3) are repeatedly performed.

8. The system of claim 6, wherein the thread allocation module divides all the accessed vertices into a plurality of batches in a permutation order, specifically: storing all accessed vertexes into an array according to an access sequence, and determining the size b of the batch according to the preset thread number t and the number v of the vertexes in the vertex stream; wherein,,

Each thread determines a unique id number, id=0, 1,2,3 and … …, and then calculates a starting index sid=b×id and an ending index eid= (b+1) ×d of each thread responsible for dividing the vertex according to b and id; and stopping processing when the vertex id of the thread processing exceeds the endpoint index eid or exceeds the array range.

9. The system according to claim 6, wherein the vertex partition module divides each vertex into corresponding partitions according to a multi-order attribute aware policy in each thread, specifically: (3.1) each thread traversing its own vertex stream, and for each vertex, traversing all partitions; (3.2) performing calculation of the division score S (v, p) for each vertex v currently being divided and each partition p; wherein the division score S (v, p) is composed of three parts:

Is a first order attribute, consider vertex v in the partition Number of neighbors in p; />

10. An electronic device, comprising:

at least one memory for storing a program;

at least one processor for executing the memory-stored program, which processor is adapted to perform the method according to any of claims 1-5, when the memory-stored program is executed.