CN116595267A - Unbalanced social network-oriented graph sampling method - Google Patents

Unbalanced social network-oriented graph sampling method Download PDF

Info

Publication number
CN116595267A
CN116595267A CN202310635601.9A CN202310635601A CN116595267A CN 116595267 A CN116595267 A CN 116595267A CN 202310635601 A CN202310635601 A CN 202310635601A CN 116595267 A CN116595267 A CN 116595267A
Authority
CN
China
Prior art keywords
node
nodes
unbalanced
graph
social network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310635601.9A
Other languages
Chinese (zh)
Other versions
CN116595267B (en
Inventor
周芳芳
仇雨湦
刘亦文
张楚涵
武宜韬
赵颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202310635601.9A priority Critical patent/CN116595267B/en
Publication of CN116595267A publication Critical patent/CN116595267A/en
Application granted granted Critical
Publication of CN116595267B publication Critical patent/CN116595267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a graph sampling method for an unbalanced social network, which comprises the following steps: step S1, identifying candidate seed nodes: identifying candidate seed nodes from the initial unbalanced social network diagram; step S2, screening seed nodes: deleting center nodes in communities in the candidate seed nodes, and reserving bridge nodes; step S3, seed node selection: selecting the bridge nodes obtained in the step S2 through an optimization function to obtain graph sampling starting point nodes; and S4, performing graph sampling from a graph sampling starting point node by using a passing degree guided random walk sampling method to obtain a sampled unbalanced social network graph. The method solves the problems that the existing graph sampling method is easy to lose the community structure, destroy the association relation of the community structure and distribute distortion of the community structure scale after graph sampling is carried out on the unbalanced social network, so that analysts cannot quickly and accurately locate on the unbalanced social network and the like.

Description

Unbalanced social network-oriented graph sampling method
Technical Field
The invention belongs to the technical field of graph visualization, and particularly relates to a graph sampling method for an unbalanced social network.
Background
The graph is a common visualization means, wherein nodes and edges are used for describing objects and association relations among the objects in a two-dimensional plane space, and a social network refers to a complex graph consisting of a plurality of members and member association relations. The social relationships of members in a social network are often quite complex, and local aggregations often exist, namely, the connections among part of the members tightly form a community structure, and the connections among the members of different community structures are sparse. The sizes of community structures formed by the close connection of members in the real-world social network are usually different, which means that the real-world social network is unbalanced, and an unbalanced social network graph refers to a social network graph containing complex community structures and different in size of the community structures, so that the real-world social network can also be called as an unbalanced social network.
Members and member associations within a community structure in an unbalanced social network are generally similar, and the community structure can be represented by a small number of member nodes and associations, so that node and edge redundancy phenomena often exist, and visual confusion is caused.
Currently, there have been some studies on graph sampling methods, which can be classified into a single-start graph sampling method and a multi-start graph sampling method. The core idea of the single-start-point graph sampling method is that a start point node is selected in a graph, then other nodes in the graph are accessed in a statistical mode of traversal, random walk and the like, all accessed nodes and the connected edges thereof are reserved, all the accessed nodes are removed, a better graph sampling result can be obtained from the statistical angle, and more statistical characteristics of the graph such as node average degree and the like of the graph are reserved. The multi-start point diagram sampling method selects a plurality of start points on the basis of single-start point diagram sampling, performs diagram sampling asynchronously by controlling the plurality of start points, and finally combines a plurality of diagram sampling results.
However, the above methods have some disadvantages, and when facing graphs with complex relationships, especially for sampling unbalanced social networks with community structures and large-scale differences of the community structures, the above methods cannot well balance sampling inside the community structures and between the community structures, because the single-start-point graph sampling method has the limitation of single start point, is difficult to well access all nodes in the graphs, and easily leaves member nodes inside the same community too much, so that the graph sampling result can lose some key community structures; the multi-origin graph sampling method has the limitation that the results are not communicated, a plurality of origins are distributed at all places in the graph as much as possible, although nodes in the graph can be better accessed, the possibility that the sampling results lose key community structures is reduced, the problem of non-communication often occurs after the graph sampling results of all origins are combined, and the problem is represented as the damage of association relations among community structures of unbalanced social networks. All the problems can cause that the sampling result of the unbalanced social network can not truly reflect the original unbalanced social network, and the accuracy of the subsequent visual analysis of the unbalanced social network is affected.
Therefore, a graph sampling method of the unbalanced social network is needed, and the method can remove redundant member nodes and connecting edges in the unbalanced social network, retain the community structures of the unbalanced social network, the association relation among the community structures, the scale distribution of the community structures and related statistical characteristics, reduce visual confusion of the unbalanced social network, intuitively and clearly display key information and overall distribution of the unbalanced social network, and enable analysts to quickly and accurately find association relations among core member information, potential social relation circles and different social circles in two-dimensional visualization of the unbalanced social network.
Disclosure of Invention
The embodiment of the invention aims to provide a graph sampling method for an unbalanced social network, which aims to solve the problems that community structure loss, community structure association relation destruction and community structure scale distribution distortion easily occur after graph sampling is carried out on the unbalanced social network by the existing graph sampling method, so that analysts cannot quickly and accurately perform the graph sampling on the unbalanced social network.
In order to solve the technical problems, the technical scheme adopted by the invention is that a graph sampling method facing an unbalanced social network comprises the following steps:
step S1, identifying candidate seed nodes: identifying candidate seed nodes from the initial unbalanced social network diagram;
step S2, screening seed nodes: deleting center nodes in communities in the candidate seed nodes, and reserving bridge nodes;
step S3, seed node selection: selecting the bridge nodes obtained in the step S2 through an optimization function to obtain graph sampling starting point nodes;
and S4, performing graph sampling from a graph sampling starting point node by using a passing degree guided random walk sampling method to obtain a sampled unbalanced social network graph.
Further, the step S1 specifically includes:
step S11, acquiring unbalanced social network data in the real world, representing members by nodes, representing association relations among the members by connecting edges, and performing two-dimensional plane visualization to obtain an initial unbalanced social network diagram;
and S12, calculating the intermediation centrality of all the member nodes, selecting an average value of the intermediation centrality of all the member nodes as a separation threshold value, and reserving the member nodes with the intermediation centrality higher than the segmentation threshold value as candidate seed nodes.
Further, the intermediation centrality of the member node v in the step S12 is calculated by the following formula:
wherein, for each member node v, any one other node s is selected as a starting point and a node t is selected as an end point, path st Representing the number of paths of node s and node t in the graph, path st (v) The number of paths including node v among all paths in the graph for node s and node t is represented.
Further, the method for judging the central node in the community in the step S2 specifically includes:
(1) Removing a member node and a direct connection edge thereof in the unbalanced social network;
(2) Detecting whether the direct neighbors of the member node remain connected after the node is removed;
(3) If the member node is removed, the direct neighbors of the member node are not communicated any more, the member node is a bridge node, otherwise, the member node is a central node of a community structure.
Further, the step S3 specifically includes: adopting a greedy strategy to calculate an optimization function value for the bridge nodes obtained in the step S2, removing the node with the minimum optimization function value in each round, and ensuring that the sum of the total optimization functions is always maximum in the removal process until the number of the rest nodes meets the number requirement of the sampling starting points of the graph;
the calculation formula of the optimization function is as follows:
wherein w is i Is a weight coefficient, w 1 、w 2 、w 3 Respectively represent the first optimization indexes Factor bc (v) Second optimization index Factor degree (v) Third optimization index Factor community (v) Weight coefficient of (2);
wherein, the first optimization index Factor bc (v) Expressed by the formula:
wherein Betwenness_Centrality (v) represents the intermediacy of member node v in the bridge nodes obtained in step S2, and ΣBetwenness_Centrality (u) represents the sum of the intermediacy of all bridge nodes obtained in step S2; factor (Factor) bc (v) The value of (1) belongs to (0, 1)]In the interval, the larger the index value of the first optimization index is, the stronger the capability of the member node v to connect a plurality of community structures is;
second optimization index Factor degree (v) Expressed by the formula:
wherein Degree (v) represents the node Degree of the member node v, Σdegree (u) represents the sum of the node degrees of all bridge nodes obtained in the step S2, the value of the second optimization index belongs to the (0, 1) interval, and the smaller the index value of the second optimization index is, the stronger the capability of the member node to connect small communities is indicated;
third optimization index Factor community (v) Expressed by the formula:
wherein, seed_ratio (v) represents the proportion of the number of bridge nodes in the direct neighbor nodes of the member node v to the number of the direct neighbor nodes, and Σseed_ratio (u) represents the sum of the seed_ratio (v) of all the bridge nodes obtained in the step S2; the smaller the value of the third optimization index belongs to the (0, 1) interval, the stronger the capability of the member node v to be the bridge end node.
Further, the step S4 specifically includes:
step S41, calculating the selection interval of all starting points based on the degree of the nodes;
step S42, randomly selecting a sampling starting point from the sampling interval by using a random number;
step S43, randomly selecting any direct neighbor node of the sampling start point for reservation;
and S44, generating an induction subgraph to finish unbalanced social network graph sampling.
Further, the step S41 selects an Interval select_interval (v) i ) The calculation formula of (2) is expressed as:
where k is the total number of origin nodes, v i Representing an ith origin node in the set of origin nodes; p (v) i ) Representing the origin node v i Probability of being selected, P (v i ) The calculation formula of (2) is expressed as:
wherein Degree (v i ) Representing the origin node v i Sigma Degre (u) is the sum of the degrees of all the starting nodes obtained in step S3.
Further, the step S42 specifically includes: generating a random number between (0, 1), inquiring which starting point node in the starting point set is located in the selection interval of the starting point node, and selecting the starting point node as the starting point of the graph sampling.
Further, the step S43 specifically includes:
and according to the starting point selected in the step S42, randomly and uniformly selecting one from the direct neighbor nodes of the starting point for reservation, replacing the original starting point in the starting point set by taking the direct neighbor node as a new starting point, and carrying out step S41 again to calculate the selection interval of all nodes in the starting point set, and continuously repeating the steps S41-S43 until the number of reserved nodes reaches the number requirement of graph sampling.
Further, the step S44 specifically includes:
traversing the connecting edges of the original unbalanced social network graph, if both nodes of a certain connecting edge are reserved in the step S43, the connecting edges are reserved, otherwise, the connecting edges are not reserved, and finally, a clearer unbalanced social network graph is obtained.
The beneficial effects of the invention are as follows:
the method provided by the invention realizes accurate and efficient sampling of the unbalanced social network, can remove redundant member nodes and association relations among members in the unbalanced social network so as to retain key member and association relation information, has a better retaining effect on community structures which are commonly existing and unbalanced in the unbalanced social network after drawing sampling, is not lost and added in the sampling, effectively retains the association relations among community structures and better retains the size scale distribution of the community structures, can sample the unbalanced social network to obtain a clear and visual sample diagram, and is beneficial to researchers to quickly and accurately observe and analyze the association relations among key members, potential social circles and different social circles of the unbalanced social network.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a graph sampling method for unbalanced social networks according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The unbalanced social network has association relations among a plurality of members and complex members, the relationship among the members in one social circle often forms a community structure tightly, and the relationship among the community structures formed by the members in different social circles is sparse. And carrying out graph sampling on the unbalanced social network, removing redundant member nodes and association relation connecting edges, and reserving important information such as community structures, community structure association relations, community structure scale distribution and the like, thereby being beneficial to more clearly and intuitively observing the member information, potential social circles and association relations of different social circles of the unbalanced social network. Fig. 1 shows a graph sampling method for an unbalanced social network, which is provided by the embodiment of the invention, and specifically includes the following steps:
step1 candidate seed node identification:
and acquiring unbalanced social network data of certain social software in the real world, representing members by nodes, representing association relations among the members by connecting edges, and carrying out two-dimensional plane visualization on the unbalanced social network data to obtain an initial unbalanced social network visual overview, wherein the nodes of the unbalanced social network have local aggregations, the relationships among the members of the same social circle tightly form a community structure, and the unbalanced social network generally comprises a plurality of community structures with different community structures. This step requires identifying member nodes that can become the starting point of the sampling method.
And selecting a bridge node connected with a plurality of community structures as a starting point, wherein members represented by the bridge node are positioned among community structures representing a plurality of different social circles, namely, the members represented by the bridge node are positioned in the plurality of social circles in the unbalanced social network. By taking bridge nodes as the starting points of graph sampling, on one hand, the nodes and community structures in the graph can be accessed and reserved more comprehensively, and on the other hand, the association relationship between community structures in the unbalanced social network can be well protected, and because the bridge nodes are directly used as the association nodes of the community structures, the association relationship between the community structures of the unbalanced social network after sampling can be effectively maintained.
The intermediation centrality is calculated for all member nodes. For each member node v, selecting any one other node s as a starting point and a node t as an end point, path st Representing the number of paths in the graph of node s and node t, i.e. the number of all jump paths that node s can jump to node t by means of other nodes, path st (v) Representing the number of nodes v included in these hop paths, the computation of the mediating centrality can be expressed by the following formula:
the larger the intermediation centrality of the member node is, the more other member nodes can be connected, the characteristics of bridge nodes are met, graph sampling is carried out from the node with the large intermediation centrality, the nodes in the unbalanced social network can be comprehensively accessed, and the sampling result is better representative to the whole graph.
The unbalanced social network has good connectivity, and for any member node, all two different node combinations can be selected from the graph to calculate his intermediation centrality. And defining an intermediate centrality dividing threshold value to divide member nodes, selecting an average value of the intermediate centralities of all the nodes as a dividing threshold value, and reserving the member nodes with the intermediate centralities higher than the threshold value as candidate seed nodes. Candidate seed nodes are not all bridge nodes, as member nodes within a social circle may also have the ability to connect to many other nodes, often a central node of a community structure, which have even negative impact on the redundancy of graph sampling compared to bridge nodes. Step1 needs to cooperate with Step2 to screen the starting point, and delete the central node in the community structure so as to reserve the bridge node.
Step2, deleting a central node in the community:
step1 is followed by obtaining starting points for the sampling method, which are mostly bridge nodes representing members who have multiple social circles at the same time, and can establish connections among multiple different social circle members. This step will further preserve these nodes, deleting the central node of the community structure in the starting point. Considering that the central node of the community structure is a member node which is positioned in a certain social circle and has stronger social ability, in the community structure represented by the same social circle, the member nodes have very close connection, the phenomenon that the connecting edges in the community structure are very close is shown, and if the central node is deleted, the rest members still have close connection. Bridge nodes are of various types, a plurality of different community structures are connected, the connections of members among the different community structures are sparse, and if the bridge nodes are deleted, the connections of members of different social circles may not be generated. By means of the characteristics, the bridge nodes in the starting points and the central nodes of the community structure can be distinguished by the following steps:
(1) Removing a member node and a direct connection edge thereof in the unbalanced social network;
(2) Detecting whether the direct neighbor of the member node remains connected after the node is removed, namely detecting whether other members directly connected with the member node can be connected with each other without the help of the member;
(3) If the member node is removed, the direct neighbors of the member node are not communicated any more, the node is a bridge node, otherwise, the node is a central node of the community structure.
By the method, the starting point set obtained in Step1 is detected one by one, and the central node of the community structure is deleted, so that after the Step, the accurate bridge node is obtained, and the method is also an ideal starting point, and is beneficial to effectively keeping the association relationship between the community structure of the original graph and the community structure after graph sampling.
Step3 is based on the starting point selection of the multi-objective decision:
and obtaining an accurate bridge node after Step2, and taking the accurate bridge node as an ideal starting point of the graph sampling method. The bridge nodes obtained in Step2 are carefully selected based on three sampling targets, and under a certain number of starting points, the optimal starting point combination can be calculated by means of a multi-target decision optimization function.
First, three indices are defined in consideration of the targets of the plurality of graph samples, the definition of the indices is as follows:
step3.1 defines a first optimization index a: the ability of nodes to connect to multi-community structures;
bridge nodes in the unbalanced social network are calculated and obtained through the previous steps, the nodes are often connected with a plurality of different community structures, access to more nodes in the graph is facilitated in the process of sampling the graph, the community structures are not lost after sampling is facilitated, the more the number of community structures connected by the nodes is, the more the community structures are facilitated to be reserved, a first optimization index a is defined to measure the capacity of connecting the multi-community structures, and the calculation of the first optimization index a can be expressed as follows:
wherein Betwenness_Centrability (v) represents the intermediacy of a member node v in bridge nodes obtained by Step2, and Sigma Betwenness_Centrability (u) represents the sum of the intermediacy of bridge nodes obtained by all Step2, so that the value of the index belongs to the (0, 1) interval, and the larger the index value, the stronger the capability of the member node to connect a plurality of community structures is indicated.
Step3.2 defines a second optimization index b: the ability of the nodes to connect small communities;
in the unbalanced social network, the scales of community structures are often different, the large community structure has a plurality of nodes, the average degree is large, and the nodes of the large community are easy to access and reserve in the sampling process. The small communities have few nodes and small average degree, and the nodes of the small communities are not easy to access and small in the sampling process, so that the small communities can be lost after sampling. In order to realize that the community structure in the unbalanced social network is not lost before and after sampling, it is obvious that bridge nodes connected with the small community structure are more important among bridge nodes in Step2, and an index b is defined to measure the capability of connecting with the small community structure, and the calculation of the index can be expressed as follows:
the index belongs to the (0, 1) interval, the smaller the index value is, the stronger the ability of the member node to connect the small community is, and the small community structure node can be accessed by the node in the graph sampling process so as to preserve the small community structure.
Step3.3 defines a third optimization index c: the node is the bridge end node's capability;
bridge nodes obtained by Step2 can be divided into two types, wherein one type is a node which is not only positioned in a community structure, but also bears the task of connecting a plurality of other community structures, and the bridge nodes indicate that the member nodes have a main social circle and are also connected with a plurality of secondary social circles, and are called bridge end nodes; the other group is a node which is positioned among a plurality of community structures and does not obviously belong to any community structure, and the social circle of the member is not divided into primary and secondary groups, so that the social scope of the member is relatively wide and dispersed, and the member is difficult to fall into the community structure represented by a certain social circle, and is called as a node in a bridge. Obviously, the bridge end node in the bridge node obtained in Step2 is more important, because the node directly represents a certain community structure and is beneficial to accessing other community structures in the graph sampling process, the node in the bridge functions similarly to the bridge end node but has community structure pertinence unlike the bridge end node, an index c is defined to measure the capability of the bridge end node, and the calculation of the index can be expressed as follows:
wherein Seed_Ratio (v) represents the Ratio of the number of bridge nodes in the direct neighbor nodes of the node to the number of the direct neighbor nodes, sigma Seed_Ratio (u) represents the sum of Seed_Ratio (v) of all bridge nodes obtained by Step2, and Seed_Ratio (v) can be expressed as:
neighbor (v) represents the direct neighbor set of nodes, and Seeds is the set of bridge nodes obtained by Step 2. The index c belongs to the (0, 1) interval, and the smaller the index is, the stronger the node is the bridge end node, because a large number of nodes in the direct neighbor of the node do not belong to the bridge node obtained by Step2, and the node is located in the inner part or the edge of a certain community structure.
Step3.4 define an optimization function
Based on three indexes defined by Step3.1, step3.2 and Step3.3, the optimization function is a weighted average of the three indexes, the weights of the indexes can be adjusted within the range of (0, 1) according to actual requirements, under the condition that a certain number of graph sampling starting points are given, a greedy strategy is adopted, the bridge nodes obtained by Step2 are calculated, the node with the minimum optimization function value is removed each time, so that the total sum of the optimization functions is always maximum in the removal process, until the number of the rest nodes meets the number requirement of the graph sampling starting points, the calculation formula of the optimization function in the Step is as follows:
wherein w is i Is a weight coefficient, w 1 、w 2 、w 3 The weight coefficients of the three optimization indexes are respectively represented, and the weight coefficients of the three optimization indexes are equal under the default condition and can be appropriately adjusted according to the requirements. After this step a certain number of starting nodes are obtained, and the next step will be to start sampling the graph with these member nodes.
Step4 degree guided multi-start point with offset random walk sampling:
in Step3, a number of starting points for the graph sampling are obtained, from which the graph will be sampled, the main steps being as follows:
step4.1, calculating a selection interval of all starting points based on the degree of the node;
in the set of starting points obtained in Step3, for each starting point, the probability P (v i ) The calculation formula can be expressed as:
wherein Degree (v) i ) The Degree of a node representing a starting point, ΣDegre (u), which is the sum of the degrees of starting points obtained by all Steps 3, is used to Select one starting point from a set of starting points for convenience, and the selection Interval SelectInterval (v) of the node is calculated i ) The calculation formula can be expressed as:
where k is the total number of starting points, v i Representing the ith origin in the set of origins. And calculating the selection intervals of all the starting points, wherein the selection intervals of all the starting points are not overlapped with each other, so that one starting point is selected in each image sampling process.
Step4.2, selecting a sampling starting point by using a random number;
first, a random number between (0, 1) is generated, and the random number is found in the selected section of which starting point in the starting point set, and the node is selected as the starting point of the graph sampling.
Step4.3, randomly selecting a direct neighbor for reservation;
according to the starting point selected in step4.2, one of the direct neighbor nodes of the node is randomly and uniformly selected for reservation, and the direct neighbor node is used as a new starting point to replace the original starting point in the starting point set, at this time, the starting point set is changed, and the step4.1 is required to calculate the selection interval of all nodes in the starting point set again. And continuously repeating Step4.1-Step4.3 until the number of reserved nodes reaches the number requirement of graph sampling.
Generating an induction subgraph by Step4.4;
after step4.3, a certain number of member nodes are reserved in the original unbalanced social network, and the step further reserves the connecting edges among the nodes in the original unbalanced social network. Traversing the connecting edge of the original unbalanced social network, if both nodes of the connecting edge are reserved in step4.3, the connecting edge is reserved, otherwise, the connecting edge is not reserved.
Finally, a certain number of member nodes and connected edges thereof are reserved in the original unbalanced social network, the nodes and the connected edges form a clearer unbalanced social network, the community structure of the original image can be reserved well, the community structure is not lost, the association relationship among communities is not damaged, and community scale distribution is maintained. The sampling graph reduces the problem of visual confusion caused by the fact that nodes and connecting edges are complicated, and researchers can observe the structure of the unbalanced social network more quickly, accurately and intuitively so as to analyze the relationship among key member information, potential social circles and social circles.
In summary, the invention provides a starting point selection method of a graph sampling method facing an unbalanced social network, which can select a proper graph sampling starting point under a certain starting point quantity, is beneficial to the graph sampling process to have a sufficient access process to member nodes representing different social circles and community structures formed by the member nodes, so that the structural characteristics of an original graph are well reserved in a sampling result. Meanwhile, a sampling process of a graph sampling method facing the unbalanced social network is provided, and the process can be used for carrying out the sampling of the biased random walk implementation graph according to the degree of the node under the given sampling starting point, so that the progress of the sampling process of each starting point can be asynchronously controlled under the action of a plurality of starting points, and the statistical distribution characteristics of the original graph can be well reserved in the sampling result.
Examples:
data set preparation:
users selecting 10 different daily life sharing apps pay attention to each other's data. The data contains complex mutual attention information of users, and partial users possibly have frequent and close mutual attention conditions due to certain common hobbies to form a social circle.
The social network comprising complex community structures can be obtained by carrying out graph visualization on the data, the community structures in the graph are social circles of users who pay close attention to each other by common hobbies, and the size difference of the community structures is large due to large difference of people of the common hobbies, so that the social network is a typical unbalanced social network.
The internal nodes of the community structure in the unbalanced social network are connected in a complex and redundant manner, so that different social circles contained in the data can be analyzed clearly, more topic plates are divided by the app later, the unbalanced social network is required to sample the graph, and redundant nodes and edges are reduced, and meanwhile, the community structure and the edges between the community structures are not lost.
Preparing an algorithm:
the present embodiment selects 19 existing graph sampling methods, including random, traversal, and walk-around sampling methods, including single-start and multi-start sampling methods. Some of these methods are classical methods, some are common methods, and some are the latest methods. Each method parameter uses recommended or default values in the relevant article as much as possible.
Preparing experimental indexes:
in the embodiment, the effect of the graph sampling method is evaluated from two aspects, namely, the effect of the graph sampling method on maintaining the community structure in the unbalanced social network is evaluated, and the effect of the graph sampling algorithm on maintaining the node statistical feature distribution of the unbalanced social network is evaluated. Thus, the index preparation for evaluation also includes two aspects.
Three new indexes are designed for evaluating the holding effect of the graph sampling method on the community structure in the unbalanced social network, namely a community structure holding index, an inter-community association relation holding index and a community scale distribution holding index. The specific description of the existing index and the new index is as follows:
the index 1 is a community structure preservation index MCN used for evaluating the preservation degree of the community structure in the unbalanced social network before and after graph sampling, and the preservation effect of the graph sampling on the community structure is better if the number of new communities or lost old communities is smaller. The index takes a value between 0 and 1, and the closer the index is to 0, the better.
And 2, maintaining an index MCR in the association relation among communities. The method is used for evaluating the retention degree of the association relationship between community structures in the unbalanced social network before and after graph sampling, and if the two community structures in the original graph have continuous edges, the more the continuous edges are retained after sampling, the better the association relationship between the graph sampling and the community structures is retained. The index takes a value between 0 and 1, and the closer the index is to 1, the better.
And 3, maintaining an index MCD of community scale distribution. The method is used for evaluating the similarity degree of the relative size relationship of communities in the unbalanced social network before and after graph sampling. The index takes a value between 0 and 1, and the closer the index is to 0, the better.
Index 4: the degree distribution difference index DD compares the similarity degree of the node degree distribution of the sample graph and the original graph, and the index takes a value between 0 and 1, and the closer the index is to 0, the better.
Index 5: and (3) clustering coefficient distribution difference index CCD, and comparing the similarity degree of node clustering coefficient distribution of the sample graph and the original graph. The index takes a value between 0 and 1, and the closer the index is to 0, the better.
Index 6: the web page ranking distribution difference index PRD compares the similarity degree of the node web page ranking distribution of the sample graph and the node web page ranking distribution of the original graph. The index takes a value between 0 and 1, and the closer the index is to 0, the better.
Table 1 results of sample evaluation
Analysis of results: as can be found from Table 1, the graph sampling method for the unbalanced social network obtains the optimum in three structural evaluation indexes of community structure maintenance, inter-community association relation maintenance and community scale distribution maintenance. This means that in all graph sampling parties, the method of the invention has the best effect on the structure retention facing the unbalanced social network, the best effect on the number of communities, the association relationship of the community structure and the scale distribution of the community structure while reducing the redundant nodes and the connected edges of the unbalanced social network, and the best algorithm performance on the capability of reducing the visual redundancy of the unbalanced social network and improving the analysis efficiency of the potential social circles of the unbalanced social network.
Meanwhile, the method is also positioned in the front of a plurality of methods in the statistical characteristic index degree distribution difference degree, the cluster coefficient distribution difference degree and the webpage ranking distribution difference degree, which shows that the method also has good performance in maintaining the statistical characteristics of the unbalanced social network, and means that the graph sampling method has excellent effect of maintaining the overall statistical properties of the social unbalanced network, such as average age, sex proportion, regional proportion and the like, while reducing the redundant nodes and edges of the social unbalanced network.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. The graph sampling method for the unbalanced social network is characterized by comprising the following steps of:
step S1, identifying candidate seed nodes: identifying candidate seed nodes from the initial unbalanced social network diagram;
step S2, screening seed nodes: deleting center nodes in communities in the candidate seed nodes, and reserving bridge nodes;
step S3, seed node selection: selecting the bridge nodes obtained in the step S2 through an optimization function to obtain graph sampling starting point nodes;
and S4, performing graph sampling from a graph sampling starting point node by using a passing degree guided random walk sampling method to obtain a sampled unbalanced social network graph.
2. The graph sampling method for unbalanced social networks according to claim 1, wherein the step S1 is specifically:
step S11, acquiring unbalanced social network data in the real world, representing members by nodes, representing association relations among the members by connecting edges, and performing two-dimensional plane visualization to obtain an initial unbalanced social network diagram;
and S12, calculating the intermediation centrality of all the member nodes, selecting an average value of the intermediation centrality of all the member nodes as a separation threshold value, and reserving the member nodes with the intermediation centrality higher than the segmentation threshold value as candidate seed nodes.
3. The graph sampling method for unbalanced social networks according to claim 2, wherein the intermediation centrality of the member node v in the step S12 is calculated by the following formula:
wherein, for each member node v, any one other node s is selected as a starting point and a node t is selected as an end point, path st Representing the number of paths of node s and node t in the graph, path st (v) The number of paths including node v among all paths in the graph for node s and node t is represented.
4. The graph sampling method for the unbalanced social network according to claim 1, wherein the judging method of the center node in the community in step S2 specifically comprises:
(1) Removing a member node and a direct connection edge thereof in the unbalanced social network;
(2) Detecting whether the direct neighbors of the member node remain connected after the node is removed;
(3) If the member node is removed, the direct neighbors of the member node are not communicated any more, the member node is a bridge node, otherwise, the member node is a central node of a community structure.
5. The graph sampling method for unbalanced social networks according to claim 1, wherein the step S3 specifically includes: adopting a greedy strategy to calculate an optimization function value for the bridge nodes obtained in the step S2, removing the node with the minimum optimization function value in each round, and ensuring that the sum of the total optimization functions is always maximum in the removal process until the number of the rest nodes meets the number requirement of the sampling starting points of the graph;
the calculation formula of the optimization function is as follows:
wherein w is i Is a weight coefficient, w 1 、w 2 、w 3 Respectively represent the first optimization indexes Factor bc (v) Second optimization index Factor degree (v) Third optimization index Factor community (v) Weight coefficient of (2);
wherein, the first optimization index Factor bc (v) Expressed by the formula:
wherein Betwenness_Centrality (v) represents the intermediacy of member node v in the bridge nodes obtained in step S2, and ΣBetwenness_Centrality (u) represents the sum of the intermediacy of all bridge nodes obtained in step S2; factor (Factor) bc (v) The value of (1) belongs to (0, 1)]In the interval, the larger the index value of the first optimization index is, the stronger the capability of the member node v to connect a plurality of community structures is;
second optimization index Factor degree (v) Expressed by the formula:
wherein Degree (v) represents the node Degree of the member node v, Σdegree (u) represents the sum of the node degrees of all bridge nodes obtained in the step S2, the value of the second optimization index belongs to the (0, 1) interval, and the smaller the index value of the second optimization index is, the stronger the capability of the member node to connect small communities is indicated;
third optimization index Factor community (v) Expressed by the formula:
wherein, seed_ratio (v) represents the proportion of the number of bridge nodes in the direct neighbor nodes of the member node v to the number of the direct neighbor nodes, and Σseed_ratio (u) represents the sum of the seed_ratio (v) of all the bridge nodes obtained in the step S2; the smaller the value of the third optimization index belongs to the (0, 1) interval, the stronger the capability of the member node v to be the bridge end node.
6. The graph sampling method for unbalanced social networks according to claim 1, wherein the step S4 specifically includes:
step S41, calculating the selection interval of all starting points based on the degree of the nodes;
step S42, randomly selecting a sampling starting point from the sampling interval by using a random number;
step S43, randomly selecting any direct neighbor node of the sampling start point for reservation;
and S44, generating an induction subgraph to finish unbalanced social network graph sampling.
7. The method for graph sampling for unbalanced social networks according to claim 6, wherein the step S41 selects an Interval select_interval (v i ) The calculation formula of (2) is expressed as:
where k is the total number of origin nodes, v i Representing an ith origin node in the set of origin nodes; p (v) i ) Representing the origin node v i Probability of being selected, P (v i ) The calculation formula of (2) is expressed as:
wherein Degree (v i ) Representing the origin node v i Sigma Degre (u) is the sum of the degrees of all the starting nodes obtained in step S3.
8. The graph sampling method for unbalanced social networks of claim 6, wherein the step S42 is specifically: generating a random number between (0, 1), inquiring which starting point node in the starting point set is located in the selection interval of the starting point node, and selecting the starting point node as the starting point of the graph sampling.
9. The graph sampling method for unbalanced social networks according to claim 6, wherein the step S43 specifically includes:
and according to the starting point selected in the step S42, randomly and uniformly selecting one from the direct neighbor nodes of the starting point for reservation, replacing the original starting point in the starting point set by taking the direct neighbor node as a new starting point, and carrying out step S41 again to calculate the selection interval of all nodes in the starting point set, and continuously repeating the steps S41-S43 until the number of reserved nodes reaches the number requirement of graph sampling.
10. The graph sampling method for unbalanced social networks of claim 6, wherein the step S44 specifically includes:
traversing the connecting edges of the original unbalanced social network graph, if both nodes of a certain connecting edge are reserved in the step S43, the connecting edges are reserved, otherwise, the connecting edges are not reserved, and finally, a clearer unbalanced social network graph is obtained.
CN202310635601.9A 2023-05-31 2023-05-31 Unbalanced social network-oriented graph sampling method Active CN116595267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310635601.9A CN116595267B (en) 2023-05-31 2023-05-31 Unbalanced social network-oriented graph sampling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310635601.9A CN116595267B (en) 2023-05-31 2023-05-31 Unbalanced social network-oriented graph sampling method

Publications (2)

Publication Number Publication Date
CN116595267A true CN116595267A (en) 2023-08-15
CN116595267B CN116595267B (en) 2024-01-19

Family

ID=87600668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310635601.9A Active CN116595267B (en) 2023-05-31 2023-05-31 Unbalanced social network-oriented graph sampling method

Country Status (1)

Country Link
CN (1) CN116595267B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120086893A (en) * 2011-01-27 2012-08-06 한양대학교 산학협력단 Method and apparatus for graph sampling based on community using dpl
CN108009933A (en) * 2016-10-27 2018-05-08 中国科学技术大学先进技术研究院 Figure centrality computational methods and device
CN110889001A (en) * 2019-11-25 2020-03-17 浙江财经大学 Big image sampling visualization method based on image representation learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120086893A (en) * 2011-01-27 2012-08-06 한양대학교 산학협력단 Method and apparatus for graph sampling based on community using dpl
CN108009933A (en) * 2016-10-27 2018-05-08 中国科学技术大学先进技术研究院 Figure centrality computational methods and device
CN110889001A (en) * 2019-11-25 2020-03-17 浙江财经大学 Big image sampling visualization method based on image representation learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张翔 等: "大图采样方法综述", 计算机辅助设计与图形学报 *
徐冰冰;岑科廷;黄俊杰;沈华伟;程学旗;: "图卷积神经网络综述", 计算机学报, no. 05 *
石晨: "基于表征学习的大规模网络图简化表达与可视分析研究", 中国优秀硕士学位论文全文数据库社会科学Ⅱ辑, no. 06 *

Also Published As

Publication number Publication date
CN116595267B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN109255054B (en) Relation weight-based community discovery method in enterprise map
Wiil et al. Measuring link importance in terrorist networks
CN109034562B (en) Social network node importance evaluation method and system
CN103888541A (en) Method and system for discovering cells fused with topology potential and spectral clustering
CN109921921B (en) Method and device for detecting aging-stable community in time-varying network
Ding et al. Community detection by propagating the label of center
CN110245692B (en) Hierarchical clustering method for collecting numerical weather forecast members
CN112836771A (en) Business service point classification method and device, electronic equipment and storage medium
CN104965846B (en) Visual human's method for building up in MapReduce platform
CN115374106A (en) Intelligent data grading method based on knowledge graph technology
CN115001983A (en) Network structure difference evaluation method based on high-order clustering coefficient
CN116595267B (en) Unbalanced social network-oriented graph sampling method
CN106100921B (en) Dynamic flow chart parallel sampling method based on point information synchronization
CN114142923A (en) Optical cable fault positioning method, device, equipment and readable medium
CN106911512B (en) Game-based link prediction method and system in exchangeable graph
CN107423319B (en) Junk web page detection method
CN111711530A (en) Link prediction algorithm based on community topological structure information
Hong et al. Profiling facebook public page graph
WO2024001102A1 (en) Method and apparatus for intelligently identifying family circle in communication industry, and device
CN113408867B (en) Urban burglary crime risk assessment method based on mobile phone user and POI data
CN113256124A (en) Screening method, device and equipment of low-efficiency network points and storage medium
CN112488236B (en) Integrated unsupervised student behavior clustering method
CN114529096A (en) Social network link prediction method and system based on ternary closure graph embedding
CN116205507A (en) Service processing method, device, computer equipment, storage medium and program product
CN112052549A (en) Method for selecting roads in small mesh gathering area

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant