CN116595267A

CN116595267A - Unbalanced social network-oriented graph sampling method

Info

Publication number: CN116595267A
Application number: CN202310635601.9A
Authority: CN
Inventors: 周芳芳; 仇雨湦; 刘亦文; 张楚涵; 武宜韬; 赵颖
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-15
Anticipated expiration: 2043-05-31
Also published as: CN116595267B

Abstract

The invention discloses a graph sampling method for an unbalanced social network, which comprises the following steps: step S1, identifying candidate seed nodes: identifying candidate seed nodes from the initial unbalanced social network diagram; step S2, screening seed nodes: deleting center nodes in communities in the candidate seed nodes, and reserving bridge nodes; step S3, seed node selection: selecting the bridge nodes obtained in the step S2 through an optimization function to obtain graph sampling starting point nodes; and S4, performing graph sampling from a graph sampling starting point node by using a passing degree guided random walk sampling method to obtain a sampled unbalanced social network graph. The method solves the problems that the existing graph sampling method is easy to lose the community structure, destroy the association relation of the community structure and distribute distortion of the community structure scale after graph sampling is carried out on the unbalanced social network, so that analysts cannot quickly and accurately locate on the unbalanced social network and the like.

Description

Unbalanced social network-oriented graph sampling method

Technical Field

The invention belongs to the technical field of graph visualization, and particularly relates to a graph sampling method for an unbalanced social network.

Background

The graph is a common visualization means, wherein nodes and edges are used for describing objects and association relations among the objects in a two-dimensional plane space, and a social network refers to a complex graph consisting of a plurality of members and member association relations. The social relationships of members in a social network are often quite complex, and local aggregations often exist, namely, the connections among part of the members tightly form a community structure, and the connections among the members of different community structures are sparse. The sizes of community structures formed by the close connection of members in the real-world social network are usually different, which means that the real-world social network is unbalanced, and an unbalanced social network graph refers to a social network graph containing complex community structures and different in size of the community structures, so that the real-world social network can also be called as an unbalanced social network.

Members and member associations within a community structure in an unbalanced social network are generally similar, and the community structure can be represented by a small number of member nodes and associations, so that node and edge redundancy phenomena often exist, and visual confusion is caused.

Currently, there have been some studies on graph sampling methods, which can be classified into a single-start graph sampling method and a multi-start graph sampling method. The core idea of the single-start-point graph sampling method is that a start point node is selected in a graph, then other nodes in the graph are accessed in a statistical mode of traversal, random walk and the like, all accessed nodes and the connected edges thereof are reserved, all the accessed nodes are removed, a better graph sampling result can be obtained from the statistical angle, and more statistical characteristics of the graph such as node average degree and the like of the graph are reserved. The multi-start point diagram sampling method selects a plurality of start points on the basis of single-start point diagram sampling, performs diagram sampling asynchronously by controlling the plurality of start points, and finally combines a plurality of diagram sampling results.

However, the above methods have some disadvantages, and when facing graphs with complex relationships, especially for sampling unbalanced social networks with community structures and large-scale differences of the community structures, the above methods cannot well balance sampling inside the community structures and between the community structures, because the single-start-point graph sampling method has the limitation of single start point, is difficult to well access all nodes in the graphs, and easily leaves member nodes inside the same community too much, so that the graph sampling result can lose some key community structures; the multi-origin graph sampling method has the limitation that the results are not communicated, a plurality of origins are distributed at all places in the graph as much as possible, although nodes in the graph can be better accessed, the possibility that the sampling results lose key community structures is reduced, the problem of non-communication often occurs after the graph sampling results of all origins are combined, and the problem is represented as the damage of association relations among community structures of unbalanced social networks. All the problems can cause that the sampling result of the unbalanced social network can not truly reflect the original unbalanced social network, and the accuracy of the subsequent visual analysis of the unbalanced social network is affected.

Therefore, a graph sampling method of the unbalanced social network is needed, and the method can remove redundant member nodes and connecting edges in the unbalanced social network, retain the community structures of the unbalanced social network, the association relation among the community structures, the scale distribution of the community structures and related statistical characteristics, reduce visual confusion of the unbalanced social network, intuitively and clearly display key information and overall distribution of the unbalanced social network, and enable analysts to quickly and accurately find association relations among core member information, potential social relation circles and different social circles in two-dimensional visualization of the unbalanced social network.

Disclosure of Invention

The embodiment of the invention aims to provide a graph sampling method for an unbalanced social network, which aims to solve the problems that community structure loss, community structure association relation destruction and community structure scale distribution distortion easily occur after graph sampling is carried out on the unbalanced social network by the existing graph sampling method, so that analysts cannot quickly and accurately perform the graph sampling on the unbalanced social network.

In order to solve the technical problems, the technical scheme adopted by the invention is that a graph sampling method facing an unbalanced social network comprises the following steps:

step S1, identifying candidate seed nodes: identifying candidate seed nodes from the initial unbalanced social network diagram;

step S2, screening seed nodes: deleting center nodes in communities in the candidate seed nodes, and reserving bridge nodes;

step S3, seed node selection: selecting the bridge nodes obtained in the step S2 through an optimization function to obtain graph sampling starting point nodes;

and S4, performing graph sampling from a graph sampling starting point node by using a passing degree guided random walk sampling method to obtain a sampled unbalanced social network graph.

Further, the step S1 specifically includes:

step S11, acquiring unbalanced social network data in the real world, representing members by nodes, representing association relations among the members by connecting edges, and performing two-dimensional plane visualization to obtain an initial unbalanced social network diagram;

and S12, calculating the intermediation centrality of all the member nodes, selecting an average value of the intermediation centrality of all the member nodes as a separation threshold value, and reserving the member nodes with the intermediation centrality higher than the segmentation threshold value as candidate seed nodes.

Further, the intermediation centrality of the member node v in the step S12 is calculated by the following formula:

wherein, for each member node v, any one other node s is selected as a starting point and a node t is selected as an end point, path _st Representing the number of paths of node s and node t in the graph, path _st (v) The number of paths including node v among all paths in the graph for node s and node t is represented.

Further, the method for judging the central node in the community in the step S2 specifically includes:

(1) Removing a member node and a direct connection edge thereof in the unbalanced social network;

(2) Detecting whether the direct neighbors of the member node remain connected after the node is removed;

(3) If the member node is removed, the direct neighbors of the member node are not communicated any more, the member node is a bridge node, otherwise, the member node is a central node of a community structure.

Further, the step S3 specifically includes: adopting a greedy strategy to calculate an optimization function value for the bridge nodes obtained in the step S2, removing the node with the minimum optimization function value in each round, and ensuring that the sum of the total optimization functions is always maximum in the removal process until the number of the rest nodes meets the number requirement of the sampling starting points of the graph;

the calculation formula of the optimization function is as follows:

wherein w is _i Is a weight coefficient, w ₁ 、w ₂ 、w ₃ Respectively represent the first optimization indexes Factor _bc (v) Second optimization index Factor _degree (v) Third optimization index Factor _community (v) Weight coefficient of (2);

wherein, the first optimization index Factor _bc (v) Expressed by the formula:

wherein Betwenness_Centrality (v) represents the intermediacy of member node v in the bridge nodes obtained in step S2, and ΣBetwenness_Centrality (u) represents the sum of the intermediacy of all bridge nodes obtained in step S2; factor (Factor) _bc (v) The value of (1) belongs to (0, 1)]In the interval, the larger the index value of the first optimization index is, the stronger the capability of the member node v to connect a plurality of community structures is;

second optimization index Factor _degree (v) Expressed by the formula:

wherein Degree (v) represents the node Degree of the member node v, Σdegree (u) represents the sum of the node degrees of all bridge nodes obtained in the step S2, the value of the second optimization index belongs to the (0, 1) interval, and the smaller the index value of the second optimization index is, the stronger the capability of the member node to connect small communities is indicated;

third optimization index Factor _community (v) Expressed by the formula:

wherein, seed_ratio (v) represents the proportion of the number of bridge nodes in the direct neighbor nodes of the member node v to the number of the direct neighbor nodes, and Σseed_ratio (u) represents the sum of the seed_ratio (v) of all the bridge nodes obtained in the step S2; the smaller the value of the third optimization index belongs to the (0, 1) interval, the stronger the capability of the member node v to be the bridge end node.

Further, the step S4 specifically includes:

step S41, calculating the selection interval of all starting points based on the degree of the nodes;

step S42, randomly selecting a sampling starting point from the sampling interval by using a random number;

step S43, randomly selecting any direct neighbor node of the sampling start point for reservation;

and S44, generating an induction subgraph to finish unbalanced social network graph sampling.

Further, the step S41 selects an Interval select_interval (v) _i ) The calculation formula of (2) is expressed as:

where k is the total number of origin nodes, v _i Representing an ith origin node in the set of origin nodes; p (v) _i ) Representing the origin node v _i Probability of being selected, P (v _i ) The calculation formula of (2) is expressed as:

wherein Degree (v _i ) Representing the origin node v _i Sigma Degre (u) is the sum of the degrees of all the starting nodes obtained in step S3.

Further, the step S42 specifically includes: generating a random number between (0, 1), inquiring which starting point node in the starting point set is located in the selection interval of the starting point node, and selecting the starting point node as the starting point of the graph sampling.

Further, the step S43 specifically includes:

and according to the starting point selected in the step S42, randomly and uniformly selecting one from the direct neighbor nodes of the starting point for reservation, replacing the original starting point in the starting point set by taking the direct neighbor node as a new starting point, and carrying out step S41 again to calculate the selection interval of all nodes in the starting point set, and continuously repeating the steps S41-S43 until the number of reserved nodes reaches the number requirement of graph sampling.

Further, the step S44 specifically includes:

traversing the connecting edges of the original unbalanced social network graph, if both nodes of a certain connecting edge are reserved in the step S43, the connecting edges are reserved, otherwise, the connecting edges are not reserved, and finally, a clearer unbalanced social network graph is obtained.

The beneficial effects of the invention are as follows:

the method provided by the invention realizes accurate and efficient sampling of the unbalanced social network, can remove redundant member nodes and association relations among members in the unbalanced social network so as to retain key member and association relation information, has a better retaining effect on community structures which are commonly existing and unbalanced in the unbalanced social network after drawing sampling, is not lost and added in the sampling, effectively retains the association relations among community structures and better retains the size scale distribution of the community structures, can sample the unbalanced social network to obtain a clear and visual sample diagram, and is beneficial to researchers to quickly and accurately observe and analyze the association relations among key members, potential social circles and different social circles of the unbalanced social network.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a graph sampling method for unbalanced social networks according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The unbalanced social network has association relations among a plurality of members and complex members, the relationship among the members in one social circle often forms a community structure tightly, and the relationship among the community structures formed by the members in different social circles is sparse. And carrying out graph sampling on the unbalanced social network, removing redundant member nodes and association relation connecting edges, and reserving important information such as community structures, community structure association relations, community structure scale distribution and the like, thereby being beneficial to more clearly and intuitively observing the member information, potential social circles and association relations of different social circles of the unbalanced social network. Fig. 1 shows a graph sampling method for an unbalanced social network, which is provided by the embodiment of the invention, and specifically includes the following steps:

step1 candidate seed node identification:

and acquiring unbalanced social network data of certain social software in the real world, representing members by nodes, representing association relations among the members by connecting edges, and carrying out two-dimensional plane visualization on the unbalanced social network data to obtain an initial unbalanced social network visual overview, wherein the nodes of the unbalanced social network have local aggregations, the relationships among the members of the same social circle tightly form a community structure, and the unbalanced social network generally comprises a plurality of community structures with different community structures. This step requires identifying member nodes that can become the starting point of the sampling method.

And selecting a bridge node connected with a plurality of community structures as a starting point, wherein members represented by the bridge node are positioned among community structures representing a plurality of different social circles, namely, the members represented by the bridge node are positioned in the plurality of social circles in the unbalanced social network. By taking bridge nodes as the starting points of graph sampling, on one hand, the nodes and community structures in the graph can be accessed and reserved more comprehensively, and on the other hand, the association relationship between community structures in the unbalanced social network can be well protected, and because the bridge nodes are directly used as the association nodes of the community structures, the association relationship between the community structures of the unbalanced social network after sampling can be effectively maintained.

The intermediation centrality is calculated for all member nodes. For each member node v, selecting any one other node s as a starting point and a node t as an end point, path _st Representing the number of paths in the graph of node s and node t, i.e. the number of all jump paths that node s can jump to node t by means of other nodes, path _st (v) Representing the number of nodes v included in these hop paths, the computation of the mediating centrality can be expressed by the following formula:

the larger the intermediation centrality of the member node is, the more other member nodes can be connected, the characteristics of bridge nodes are met, graph sampling is carried out from the node with the large intermediation centrality, the nodes in the unbalanced social network can be comprehensively accessed, and the sampling result is better representative to the whole graph.

The unbalanced social network has good connectivity, and for any member node, all two different node combinations can be selected from the graph to calculate his intermediation centrality. And defining an intermediate centrality dividing threshold value to divide member nodes, selecting an average value of the intermediate centralities of all the nodes as a dividing threshold value, and reserving the member nodes with the intermediate centralities higher than the threshold value as candidate seed nodes. Candidate seed nodes are not all bridge nodes, as member nodes within a social circle may also have the ability to connect to many other nodes, often a central node of a community structure, which have even negative impact on the redundancy of graph sampling compared to bridge nodes. Step1 needs to cooperate with Step2 to screen the starting point, and delete the central node in the community structure so as to reserve the bridge node.

Step2, deleting a central node in the community:

step1 is followed by obtaining starting points for the sampling method, which are mostly bridge nodes representing members who have multiple social circles at the same time, and can establish connections among multiple different social circle members. This step will further preserve these nodes, deleting the central node of the community structure in the starting point. Considering that the central node of the community structure is a member node which is positioned in a certain social circle and has stronger social ability, in the community structure represented by the same social circle, the member nodes have very close connection, the phenomenon that the connecting edges in the community structure are very close is shown, and if the central node is deleted, the rest members still have close connection. Bridge nodes are of various types, a plurality of different community structures are connected, the connections of members among the different community structures are sparse, and if the bridge nodes are deleted, the connections of members of different social circles may not be generated. By means of the characteristics, the bridge nodes in the starting points and the central nodes of the community structure can be distinguished by the following steps:

(2) Detecting whether the direct neighbor of the member node remains connected after the node is removed, namely detecting whether other members directly connected with the member node can be connected with each other without the help of the member;

(3) If the member node is removed, the direct neighbors of the member node are not communicated any more, the node is a bridge node, otherwise, the node is a central node of the community structure.

By the method, the starting point set obtained in Step1 is detected one by one, and the central node of the community structure is deleted, so that after the Step, the accurate bridge node is obtained, and the method is also an ideal starting point, and is beneficial to effectively keeping the association relationship between the community structure of the original graph and the community structure after graph sampling.

Step3 is based on the starting point selection of the multi-objective decision:

and obtaining an accurate bridge node after Step2, and taking the accurate bridge node as an ideal starting point of the graph sampling method. The bridge nodes obtained in Step2 are carefully selected based on three sampling targets, and under a certain number of starting points, the optimal starting point combination can be calculated by means of a multi-target decision optimization function.

First, three indices are defined in consideration of the targets of the plurality of graph samples, the definition of the indices is as follows:

step3.1 defines a first optimization index a: the ability of nodes to connect to multi-community structures;

bridge nodes in the unbalanced social network are calculated and obtained through the previous steps, the nodes are often connected with a plurality of different community structures, access to more nodes in the graph is facilitated in the process of sampling the graph, the community structures are not lost after sampling is facilitated, the more the number of community structures connected by the nodes is, the more the community structures are facilitated to be reserved, a first optimization index a is defined to measure the capacity of connecting the multi-community structures, and the calculation of the first optimization index a can be expressed as follows:

wherein Betwenness_Centrability (v) represents the intermediacy of a member node v in bridge nodes obtained by Step2, and Sigma Betwenness_Centrability (u) represents the sum of the intermediacy of bridge nodes obtained by all Step2, so that the value of the index belongs to the (0, 1) interval, and the larger the index value, the stronger the capability of the member node to connect a plurality of community structures is indicated.

Step3.2 defines a second optimization index b: the ability of the nodes to connect small communities;

in the unbalanced social network, the scales of community structures are often different, the large community structure has a plurality of nodes, the average degree is large, and the nodes of the large community are easy to access and reserve in the sampling process. The small communities have few nodes and small average degree, and the nodes of the small communities are not easy to access and small in the sampling process, so that the small communities can be lost after sampling. In order to realize that the community structure in the unbalanced social network is not lost before and after sampling, it is obvious that bridge nodes connected with the small community structure are more important among bridge nodes in Step2, and an index b is defined to measure the capability of connecting with the small community structure, and the calculation of the index can be expressed as follows:

the index belongs to the (0, 1) interval, the smaller the index value is, the stronger the ability of the member node to connect the small community is, and the small community structure node can be accessed by the node in the graph sampling process so as to preserve the small community structure.

Step3.3 defines a third optimization index c: the node is the bridge end node's capability;

bridge nodes obtained by Step2 can be divided into two types, wherein one type is a node which is not only positioned in a community structure, but also bears the task of connecting a plurality of other community structures, and the bridge nodes indicate that the member nodes have a main social circle and are also connected with a plurality of secondary social circles, and are called bridge end nodes; the other group is a node which is positioned among a plurality of community structures and does not obviously belong to any community structure, and the social circle of the member is not divided into primary and secondary groups, so that the social scope of the member is relatively wide and dispersed, and the member is difficult to fall into the community structure represented by a certain social circle, and is called as a node in a bridge. Obviously, the bridge end node in the bridge node obtained in Step2 is more important, because the node directly represents a certain community structure and is beneficial to accessing other community structures in the graph sampling process, the node in the bridge functions similarly to the bridge end node but has community structure pertinence unlike the bridge end node, an index c is defined to measure the capability of the bridge end node, and the calculation of the index can be expressed as follows:

wherein Seed_Ratio (v) represents the Ratio of the number of bridge nodes in the direct neighbor nodes of the node to the number of the direct neighbor nodes, sigma Seed_Ratio (u) represents the sum of Seed_Ratio (v) of all bridge nodes obtained by Step2, and Seed_Ratio (v) can be expressed as:

neighbor (v) represents the direct neighbor set of nodes, and Seeds is the set of bridge nodes obtained by Step 2. The index c belongs to the (0, 1) interval, and the smaller the index is, the stronger the node is the bridge end node, because a large number of nodes in the direct neighbor of the node do not belong to the bridge node obtained by Step2, and the node is located in the inner part or the edge of a certain community structure.

Step3.4 define an optimization function

Based on three indexes defined by Step3.1, step3.2 and Step3.3, the optimization function is a weighted average of the three indexes, the weights of the indexes can be adjusted within the range of (0, 1) according to actual requirements, under the condition that a certain number of graph sampling starting points are given, a greedy strategy is adopted, the bridge nodes obtained by Step2 are calculated, the node with the minimum optimization function value is removed each time, so that the total sum of the optimization functions is always maximum in the removal process, until the number of the rest nodes meets the number requirement of the graph sampling starting points, the calculation formula of the optimization function in the Step is as follows:

wherein w is _i Is a weight coefficient, w ₁ 、w ₂ 、w ₃ The weight coefficients of the three optimization indexes are respectively represented, and the weight coefficients of the three optimization indexes are equal under the default condition and can be appropriately adjusted according to the requirements. After this step a certain number of starting nodes are obtained, and the next step will be to start sampling the graph with these member nodes.

Step4 degree guided multi-start point with offset random walk sampling:

in Step3, a number of starting points for the graph sampling are obtained, from which the graph will be sampled, the main steps being as follows:

step4.1, calculating a selection interval of all starting points based on the degree of the node;

in the set of starting points obtained in Step3, for each starting point, the probability P (v _i ) The calculation formula can be expressed as:

wherein Degree (v) _i ) The Degree of a node representing a starting point, ΣDegre (u), which is the sum of the degrees of starting points obtained by all Steps 3, is used to Select one starting point from a set of starting points for convenience, and the selection Interval SelectInterval (v) of the node is calculated _i ) The calculation formula can be expressed as:

where k is the total number of starting points, v _i Representing the ith origin in the set of origins. And calculating the selection intervals of all the starting points, wherein the selection intervals of all the starting points are not overlapped with each other, so that one starting point is selected in each image sampling process.

Step4.2, selecting a sampling starting point by using a random number;

first, a random number between (0, 1) is generated, and the random number is found in the selected section of which starting point in the starting point set, and the node is selected as the starting point of the graph sampling.

Step4.3, randomly selecting a direct neighbor for reservation;

according to the starting point selected in step4.2, one of the direct neighbor nodes of the node is randomly and uniformly selected for reservation, and the direct neighbor node is used as a new starting point to replace the original starting point in the starting point set, at this time, the starting point set is changed, and the step4.1 is required to calculate the selection interval of all nodes in the starting point set again. And continuously repeating Step4.1-Step4.3 until the number of reserved nodes reaches the number requirement of graph sampling.

Generating an induction subgraph by Step4.4;

after step4.3, a certain number of member nodes are reserved in the original unbalanced social network, and the step further reserves the connecting edges among the nodes in the original unbalanced social network. Traversing the connecting edge of the original unbalanced social network, if both nodes of the connecting edge are reserved in step4.3, the connecting edge is reserved, otherwise, the connecting edge is not reserved.

Finally, a certain number of member nodes and connected edges thereof are reserved in the original unbalanced social network, the nodes and the connected edges form a clearer unbalanced social network, the community structure of the original image can be reserved well, the community structure is not lost, the association relationship among communities is not damaged, and community scale distribution is maintained. The sampling graph reduces the problem of visual confusion caused by the fact that nodes and connecting edges are complicated, and researchers can observe the structure of the unbalanced social network more quickly, accurately and intuitively so as to analyze the relationship among key member information, potential social circles and social circles.

In summary, the invention provides a starting point selection method of a graph sampling method facing an unbalanced social network, which can select a proper graph sampling starting point under a certain starting point quantity, is beneficial to the graph sampling process to have a sufficient access process to member nodes representing different social circles and community structures formed by the member nodes, so that the structural characteristics of an original graph are well reserved in a sampling result. Meanwhile, a sampling process of a graph sampling method facing the unbalanced social network is provided, and the process can be used for carrying out the sampling of the biased random walk implementation graph according to the degree of the node under the given sampling starting point, so that the progress of the sampling process of each starting point can be asynchronously controlled under the action of a plurality of starting points, and the statistical distribution characteristics of the original graph can be well reserved in the sampling result.

Examples:

data set preparation:

users selecting 10 different daily life sharing apps pay attention to each other's data. The data contains complex mutual attention information of users, and partial users possibly have frequent and close mutual attention conditions due to certain common hobbies to form a social circle.

The social network comprising complex community structures can be obtained by carrying out graph visualization on the data, the community structures in the graph are social circles of users who pay close attention to each other by common hobbies, and the size difference of the community structures is large due to large difference of people of the common hobbies, so that the social network is a typical unbalanced social network.

The internal nodes of the community structure in the unbalanced social network are connected in a complex and redundant manner, so that different social circles contained in the data can be analyzed clearly, more topic plates are divided by the app later, the unbalanced social network is required to sample the graph, and redundant nodes and edges are reduced, and meanwhile, the community structure and the edges between the community structures are not lost.

Preparing an algorithm:

the present embodiment selects 19 existing graph sampling methods, including random, traversal, and walk-around sampling methods, including single-start and multi-start sampling methods. Some of these methods are classical methods, some are common methods, and some are the latest methods. Each method parameter uses recommended or default values in the relevant article as much as possible.

Preparing experimental indexes:

in the embodiment, the effect of the graph sampling method is evaluated from two aspects, namely, the effect of the graph sampling method on maintaining the community structure in the unbalanced social network is evaluated, and the effect of the graph sampling algorithm on maintaining the node statistical feature distribution of the unbalanced social network is evaluated. Thus, the index preparation for evaluation also includes two aspects.

Three new indexes are designed for evaluating the holding effect of the graph sampling method on the community structure in the unbalanced social network, namely a community structure holding index, an inter-community association relation holding index and a community scale distribution holding index. The specific description of the existing index and the new index is as follows:

the index 1 is a community structure preservation index MCN used for evaluating the preservation degree of the community structure in the unbalanced social network before and after graph sampling, and the preservation effect of the graph sampling on the community structure is better if the number of new communities or lost old communities is smaller. The index takes a value between 0 and 1, and the closer the index is to 0, the better.

And 2, maintaining an index MCR in the association relation among communities. The method is used for evaluating the retention degree of the association relationship between community structures in the unbalanced social network before and after graph sampling, and if the two community structures in the original graph have continuous edges, the more the continuous edges are retained after sampling, the better the association relationship between the graph sampling and the community structures is retained. The index takes a value between 0 and 1, and the closer the index is to 1, the better.

And 3, maintaining an index MCD of community scale distribution. The method is used for evaluating the similarity degree of the relative size relationship of communities in the unbalanced social network before and after graph sampling. The index takes a value between 0 and 1, and the closer the index is to 0, the better.

Index 4: the degree distribution difference index DD compares the similarity degree of the node degree distribution of the sample graph and the original graph, and the index takes a value between 0 and 1, and the closer the index is to 0, the better.

Index 5: and (3) clustering coefficient distribution difference index CCD, and comparing the similarity degree of node clustering coefficient distribution of the sample graph and the original graph. The index takes a value between 0 and 1, and the closer the index is to 0, the better.

Index 6: the web page ranking distribution difference index PRD compares the similarity degree of the node web page ranking distribution of the sample graph and the node web page ranking distribution of the original graph. The index takes a value between 0 and 1, and the closer the index is to 0, the better.

Table 1 results of sample evaluation

Analysis of results: as can be found from Table 1, the graph sampling method for the unbalanced social network obtains the optimum in three structural evaluation indexes of community structure maintenance, inter-community association relation maintenance and community scale distribution maintenance. This means that in all graph sampling parties, the method of the invention has the best effect on the structure retention facing the unbalanced social network, the best effect on the number of communities, the association relationship of the community structure and the scale distribution of the community structure while reducing the redundant nodes and the connected edges of the unbalanced social network, and the best algorithm performance on the capability of reducing the visual redundancy of the unbalanced social network and improving the analysis efficiency of the potential social circles of the unbalanced social network.

Meanwhile, the method is also positioned in the front of a plurality of methods in the statistical characteristic index degree distribution difference degree, the cluster coefficient distribution difference degree and the webpage ranking distribution difference degree, which shows that the method also has good performance in maintaining the statistical characteristics of the unbalanced social network, and means that the graph sampling method has excellent effect of maintaining the overall statistical properties of the social unbalanced network, such as average age, sex proportion, regional proportion and the like, while reducing the redundant nodes and edges of the social unbalanced network.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. The graph sampling method for the unbalanced social network is characterized by comprising the following steps of:

2. The graph sampling method for unbalanced social networks according to claim 1, wherein the step S1 is specifically:

3. The graph sampling method for unbalanced social networks according to claim 2, wherein the intermediation centrality of the member node v in the step S12 is calculated by the following formula:

4. The graph sampling method for the unbalanced social network according to claim 1, wherein the judging method of the center node in the community in step S2 specifically comprises:

5. The graph sampling method for unbalanced social networks according to claim 1, wherein the step S3 specifically includes: adopting a greedy strategy to calculate an optimization function value for the bridge nodes obtained in the step S2, removing the node with the minimum optimization function value in each round, and ensuring that the sum of the total optimization functions is always maximum in the removal process until the number of the rest nodes meets the number requirement of the sampling starting points of the graph;

the calculation formula of the optimization function is as follows:

wherein, the first optimization index Factor _bc (v) Expressed by the formula:

second optimization index Factor _degree (v) Expressed by the formula:

third optimization index Factor _community (v) Expressed by the formula:

6. The graph sampling method for unbalanced social networks according to claim 1, wherein the step S4 specifically includes:

7. The method for graph sampling for unbalanced social networks according to claim 6, wherein the step S41 selects an Interval select_interval (v _i ) The calculation formula of (2) is expressed as:

8. The graph sampling method for unbalanced social networks of claim 6, wherein the step S42 is specifically: generating a random number between (0, 1), inquiring which starting point node in the starting point set is located in the selection interval of the starting point node, and selecting the starting point node as the starting point of the graph sampling.

9. The graph sampling method for unbalanced social networks according to claim 6, wherein the step S43 specifically includes:

10. The graph sampling method for unbalanced social networks of claim 6, wherein the step S44 specifically includes: