CN110119462B

CN110119462B - Community search method of attribute network

Info

Publication number: CN110119462B
Application number: CN201910266196.1A
Authority: CN
Inventors: 曲强; 罗捷桓
Original assignee: Hangzhou Zhongke Advanced Technology Research Institute Co ltd
Current assignee: Hangzhou Zhongke advanced technology development Co.,Ltd.
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2021-07-23
Anticipated expiration: 2039-04-03
Also published as: CN110119462A

Abstract

The invention provides a community searching method of an attribute network. The method comprises the following steps: defining a search area range according to the spatial position of a network user; searching a target community according to the connection closeness among network users in the attribute network, wherein the space position of the users in the target community is within the range of the defined search area. According to the method, the target community meeting the structural cohesion and the spatial cohesion can be effectively searched, and the method is used for behavior analysis, recommendation, disease prediction and the like of social network users.

Description

Community search method of attribute network

Technical Field

The invention relates to the technical field of community search, in particular to a community search method of an attribute network.

Background

Attribute networks are used to model a variety of networks, including social networks, knowledge graphs, and protein interaction networks, among others. The increasing amount of data and the rich nature of these networks have presented enormous challenges to community search and have attracted much attention in recent years. Research on finding communities can be divided into community detection and community search. Community detection methods are commonly used to discover communities in a social network based on predefined implicit criteria, while community search is an online approach to finding cohesive communities that meet a given set of explicit criteria, such as k-core (k-kernel) and k-tress based community search.

Spatial attributes are one of the most important and useful features in attribute networks. In a spatially aware network, each node is accompanied by spatial information, e.g., social networks such as Twitter and Foursquare can be modeled by networks in which each node (i.e., user) has one or more locations (e.g., a current location or a historical enrollment location). By searching the community in view of the user's location information, understanding of the user's behavior can be changed from a virtual world to reality.

However, in the prior art, only non-attribute networks are generally considered, and rich information of vertices in attribute networks is ignored. In addition, communities in space-aware networks have been searched using various measures of structural cohesion, which is a query constraint in existing research, e.g., for k-core or k-tress measures, users need to specify a value of k in community search, but without considering spatial closeness between users.

Therefore, there is a need for improvement in the prior art to search out a web community that takes structural cohesion and spatial compactness into consideration, and further improve the efficiency of community search.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a community searching method for the attribute network.

According to a first aspect of the invention, a community searching method of a property network is provided. The method comprises the following steps:

step S1: defining a search area range according to the spatial position of a network user;

step S2: searching a target community according to the connection closeness among network users in the attribute network, wherein the spatial position of the users in the target community is within the range of the defined search area.

In one embodiment, step S1 includes the following sub-steps:

the attribute network is characterized by an undirected connected graph G ═ V, E and S, wherein V represents a vertex set, E represents an edge set, S represents a space position set, and the vertex represents a network user;

in the undirected connected graph G, searching for a target community represented by a connected subgraph, wherein the vertex position of the subgraph can be surrounded by a circle with the diameter D and relative to other subgraphs of the undirected connected graph G, the vertex in the subgraph forms the highest-order k-core.

In one embodiment, in step S2, the target communities represented in the connected subgraph are searched according to the following steps:

step S21: constructing a quadtree index structure for the undirected connected graph G, wherein a root node corresponds to the whole space of G;

step S22: traversing the quad-tree index structure to obtain all nodes with the side length smaller than D and the parent node with the side length larger than D, and storing the nodes in a node list nodeList;

step S23: for each node in the node list nodeList, the maximum number of cores k is obtained_cur；

Step S24: prune N.DistMap [ k ] from node list_cur]>D node N, wherein N_cur]A distance map representing node N;

step S25: for the remaining nodes in nodeList, sorting in ascending order according to the upper bound of the number of kernels and verifying in sequence to search out the nodes satisfying the k-core with the highest order and capable of being surrounded by a circle with diameter D.

In one embodiment, in step S25, for a node N in the node list nodeList, the following steps are used for verification:

expanding N by length D, performing kernel decomposition in the expanded square region and neglecting that the number of kernels is less than k_curThe vertex of (1);

verifying whether remaining vertices in the expanded square region have an order higher than k_curIf so, record the k-core and update k_cur。

In one embodiment, the following steps are used to verify whether the remaining vertices in the expanded square region have an order higher than k_curK-core of (2):

for one vertex in node N, place it on the boundary of a circle of diameter D and rotate the circle;

when a new vertex enters the circle, checking whether the order is higher than k_curK-core of (1).

dividing the expanded square area into m × m cells, and searching for k-core in the expanded square area using a square covering s × s cells that can enclose a circle having a diameter D, where s, m are positive integers and s is smaller than m.

In one embodiment, the following steps are taken to verify the extensionWhether the remaining vertexes in the square region of the exhibition have orders higher than k_curK-core of (2):

when rotating a circle, k is satisfied when a new vertex into the circle_cCore, stop rotation, where k_cRepresenting the number of currently verified cores.

In one embodiment, the target communities represented in the connected subgraph are searched according to the following steps:

searching all circles with the diameter D in the undirected connectivity graph G;

for all searched circles, checking the maximum number of kernels of the vertices that can be surrounded by the circles and regarding the vertices surrounded by the circle with the maximum number of kernels as the target community.

Compared with the prior art, the invention has the advantages that: the invention provides a solution for co-located community search with structural cohesion; in the community searching process, the spatial information and the local structure information are integrated together by constructing the index structure, so that the efficiency and effectiveness of community searching are improved.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

FIG. 1 is a flow diagram of a community search method for an attribute network according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of an attribute network and co-located communities according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of a distance-aware k-core quadtree, according to one embodiment of the present invention;

FIG. 4 is a diagram of constructing a distance map, according to one embodiment of the present invention;

FIG. 5 is a diagram of a quadtree-based co-located community search, according to one embodiment of the invention;

FIG. 6 is a schematic diagram of a quadtree-based co-located community search according to another embodiment of the present invention;

7(a) -7 (c) are schematic diagrams of the correlation of diameter and community search time according to one embodiment of the present invention;

8(a) -8 (b) are schematic diagrams illustrating the correlation between the number of user locations and the community search time according to an embodiment of the present invention;

9(a) -9 (b) are schematic diagrams of the relevance of the location distribution of users and community search time according to one embodiment of the present invention;

fig. 10(a) to 10(b) are diagrams of effects of scalability according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

One of the research goals of the present invention is to provide the search problem for the most cohesive co-sited communities (referred to herein as MC)³The most social co-located community), wherein the searched community satisfies the following two attributes: structural cohesion, which means that the members in the community are most closely related; spatially co-localized, meaning that members are close to each other in geographic location and have spatial cohesion.

According to one embodiment of the invention, a community searching method of a home network is provided, which is briefly summarized as using a directed connectivity graph to characterize an attribute network, and determining a searched target community by searching a connectivity sub-graph satisfying structural cohesion and spatial cohesion criteria in the directed connectivity graph. Specifically, referring to fig. 1, the method comprises the steps of:

step S110, the property network is represented by using the undirected connectivity graph.

In the embodiment of the present invention, a undirected connected graph is taken as an example to characterize an undirected attribute network G ═ V, E, S, where G has a vertex set V, an edge set E, and a spatial position set S. The degree of vertices v (e.g., users in a social network) in G is represented by degg (v), each vertex v having a spatial position v.l ═ x, y ∈ S (e.g., the user' S enrollment position), x and y representing coordinates along the x-axis and y-axis, respectively, in two-dimensional space.

For convenience of description, the definitions of the symbols involved in the present invention are summarized as follows:

g (V, E, S): representing a geo-social graph having a set of vertices V, a set of edges E, and a set of spatial locations S;

(v.x, v.y): representing the position of one vertex V in the set of vertices V along the x-axis and the y-axis;

degG (v): representing the degree of one vertex v in G.

γ (N): representing the side length of node N in the index structure.

The research goal of the invention is to find a community represented by a connected subgraph from an undirected connected graph G, wherein the community meets the following conditions: structural cohesion, i.e., the connection of vertices in connected subgraphs is densest; the spatial cohesion, i.e. the vertices in the connected subgraph, is very compact in space.

In the embodiment of the present invention, the evaluation of the structural cohesion is described by taking k-core as an example, but it should be understood that the method of the present invention can also be extended to other algorithms for adapting structural cohesion, such as k-tress, clique, and the like.

For ease of explanation, the following concepts are first introduced:

1) definition of k-kore

For k-core, given a non-negative integer k, the k-core of G is the largest subgraph of G, where the degree of each vertex v in the subgraph is no less than k.

Specifically, in the present embodiment, a connected k-core in G (denoted as G) is used_k) To represent a community, called G_kIs k. Given a graph, k-core may be obtained by An algorithm in the prior art, e.g., a linear kernel decomposition algorithm, the complexity of which is denoted herein as O (| E |) (e.g., the reference "An O (m) algorithm for core decomposition of networks", Batagelj, v., zaversonik, m., arXiv preprints/0310049 (2003)).

2) Definition of core number

For vertex v in a given G, its number of kernels is the highest order of the k-core containing v, denoted C_G[v]。

3) Definition of co-located communities

In the embodiment of the invention, the co-located community refers to a connected subgraph (k-core) G_kWherein the vertex positions in the subgraph can be surrounded by a circle of a predetermined diameter D. It is desirable herein that vertices in a co-sited community are located closer together, which can reflect the "co-siting" of this community.

In an embodiment of the invention, given a undirected property graph G and a diameter D, Co-sited Community search (MC)³) Returning any vertex groups and their positions, the following constraints are satisfied: the location of the apex can be surrounded by a circle of diameter D; the vertices form the highest order k-core.

FIG. 2 is a schematic diagram of an attribute network and co-located communities, where C₁And C₂Are two co-located communities in the attribute network, whose members may be surrounded by a circle of diameter D, C₁Members of (2) include A, B, C, C₂Includes D, G, H, F, E. C₂Is 3-core, the core with the highest order (with respect to diameter D) in the two co-sited communities, hence C₂Is the MC of the attribute network³I.e. the target community to be searched.

And step S120, searching a connected subgraph in the undirected connected graph to enable the connected subgraph to meet structural cohesion and spatial cohesion standards.

The present invention aims to find the most structural cohesion community and the community can be surrounded by a circle of diameter D, and various embodiments can be employed to search for communities from the attribute network that meet the structural cohesion and spatial cohesion criteria.

Example 1: space priority mode

In this embodiment, for the attribute network, first, all possible circles (diameter D) in space are searched; then, checking the maximum number of kernels of the vertex which can be surrounded by the circle; finally, the maximum number of kernels in all circles is returned.

In particular, all possible circles are enumerated, for example, two positions are fixed in the set S of spatial positions, the distance of which is less than or equal to D, and a maximum of two circles of diameter D are obtained spatially from these two positions. Then, vertices are obtained from the map to which these positions belong, and the maximum number of kernels is calculated using a known linear kernel decomposition algorithm. This approach requires checking the worst case circle with an overhead of O (V)²) And is therefore very time consuming.

Example 2: structure priority mode

In this embodiment, a structure-first search is performed, the idea being to use the network structure to speed up the search.

Specifically, first, kernel decomposition is performed to calculate the number of kernels per vertex; then, the maximum k value in the kernel is searched, denoted as k_maxAnd at k_max-in core, obtaining a position from a vertex; next, a search is performed at these positions by enumerating all possible circles (similar to the space-first method), after which the current best number of kernels (denoted k) can be obtained_cur) (ii) a Then, pair (k)_max-1) -core further examination. Repeating the above process until k is reached_cur-core. In this way, verification of a reduced number of circles is facilitated. However, there are still limitations to the efficiency of this approach since globally cohesive subgraphs may not have local cohesion.

Example 3: k-kernel quadtree mode based on distance perception

Neither the space-first approach nor the structure-first approach described above can achieve good performance because of MC³The problem is to consider both spatial and structural cohesion, but both ways either ignore the spatial characteristics of the data or ignore the knotsAnd (5) structural characteristics.

In a preferred embodiment, a Distance-aware k-kernel Quadtree is employed to search for target communities, referred to herein as DkQ-TREE (Distance-aware k-core Quadtree). An index for pre-computing local structure cohesion can be constructed using a quadtree structure to speed up the search and prune the search space.

The following will specifically describe a tree index structure of a quadtree based on spatial index, and a community search method proposed based on the tree index structure to solve MC³And (5) problems are solved.

1) Index structure of quadtree

Known linear k-kernel decomposition algorithms can only compute the global kernel number of vertices, so local cohesion information is unknown during the query. In a quadtree-based index structure, the structural information and spatial information are integrated together to calculate local cohesion (with respect to diameter D).

quad-TREE structure referring to fig. 3, in brief, DkQ-TREE is constructed by dividing a root node into a whole space and dividing the whole space into four subspaces, each subspace corresponding to one child node of the root node. Each node is then repeatedly subdivided into four sub-nodes, e.g., for fig. 3, the entire space is a root node (root), whose four sub-nodes correspond to { a, B, C }, { K, J }, { L } and { D, E, F, G, H, I }, respectively, which similarly can be further subdivided.

In this embodiment, the local cohesion and other useful information for each tree node is pre-computed using a quadtree and based on the spatial monotonicity of the local cohesion. Spatial monotonicity refers to the fact that, given a spatial region R (e.g., a square), if the vertices in this region are able to form k-core of order h at most, then for any region R 'within R, the order of the k-core formed by the vertices in R' is no greater than h. The spatial monotonicity property has fewer vertices based on smaller regions.

In each node N of DkQ-TREE, the number of cores of each vertex of the node in the subgraph extracted from the region is pre-calculated, and the maximum number of cores of the vertex in the node is recorded and marked as LC_N. This calculation is performed due to the following principle (referred to herein as lemma 1): given a query diameter D and tree nodes N, MC that can be surrounded by a circle of diameter D³Is not less than LC_N。

Since N can be surrounded by a circle of diameter D, the above principle can be demonstrated from the spatial monotonicity property. Thus, MC can be obtained from DkQ-TREE based on the pre-calculated information³Is estimated at the lower bound of the order of (a).

However, this is still not sufficient to obtain local cohesion, only the maximum number of nuclei in each node. As can be seen from FIG. 3, when some vertices are not on a node, the vertices of the node may form a k-core. Therefore, for a given diameter D, no bounds on the number of cores of these nodes can be obtained. Therefore, the distance mapping table DistMap of the vertex in each tree is further calculated. The idea is that, given a node N, for each value k>LC_NThe nodes are extended to the vertices with the smallest distance d, so that the vertices involved during the extension can form a k-core, which distance d and its corresponding k are recorded in the distance map.

The distance map helps prune the search space according to the following principles (referred to herein as lemma 2): suppose a current MC³Is k_curGiven a query diameter D and a node N, if N.DistMap [ k ]_cur]>D, then N cannot be to MC³Contribute any vertex, where N.DistMap [ k ]_cur]Distance map representing node N, the optimum order of N being k_cur。

The above principle can also be demonstrated using the spatial monotonicity property, i.e. if n_cur]>D, then means that when the boundary length of N is extended to diameter D, k cannot be found in this region_curCore, so that the number of cores of any node in the region is less than k_curAnd can be trimmed.

In addition, to quickly obtain a vertex from a location, when the vertex has multiple locations, a vertex map table may also be used to organize the mapping information.

In summary, in the embodiment of the present invention, for each node of DkQ-TREE, the stored information includes: a vertex in the node; the maximum number of cores in the node; a vertex mapping table; a distance mapping table.

2) Index construction of quadtrees

Still referring to the quadtree structure shown in fig. 3, the whole space is a root node (root), four child nodes of the root node correspond to { a, B, C }, { K, J }, { L } and { D, E, F, G, H, I }, respectively, the nodes { a, B, C } are further subdivided into { a }, { B }, { C }, and the nodes { D, E, F, G, H, I } are further subdivided into { D }, { E }, { F } and { G, H, I }. When a new node is obtained, the vertex in the node is used for kernel decomposition and the maximum kernel number is stored. If the maximum kernel number is less than a certain value k_εThen the node is not further split. For example, in FIG. 3, vertex { A, B, C } forms a 2-core, and this region is split to form { A }, { B }, and { C }. After splitting, any sub-regions cannot form a 2-core, and therefore, splitting of nodes corresponding to the sub-regions is stopped.

In addition, when a new node is obtained, a Distance Map (Distance Map) and a Vertex Map (Vertex Map) thereof are also constructed. Building a vertex map, i.e. marking the position of each vertex, e.g. in FIG. 3, the position map record v for vertex A_AIs A (v)_A' s locations: A) and the others are similar. The idea of building a distance map is to, for each value k, perform a binary search to extend the node to the vertex with the smallest distance, so that vertices introduced during the extension can form the k-kernel. For example, referring to FIG. 4, a node has only one vertex C, when extended to vertex B, forming a 1-core, the extended distance is d 1; when extended to vertex A, a 2-core is first formed, the extended distance being d 2. The distances d1 and d2 are stored to a distance map, for example, in the format 1-core: d 1; 2-core: d2.

3) community search method MC based on quadtree³Alg

In the embodiment of the invention, two algorithms are provided based on a quadtree index structure, and are respectively called MC for distinction³Alg algorithm and MC³Alg + Alc, MC³Alg + is MC³Alg algorithmImprovement of (1).

Briefly, MC³The Alg algorithm involves two iterative steps: pruning DkQ-TREE nodes; discovering MCs from nodes that cannot be pruned³. Specifically, MC³The Alg comprises the following steps:

step S211, pruning DkQ-TREE nodes

In this step, MC is obtained according to the above theorem 1³Lower bound of the order.

Specifically, given a diameter D, traversing DkQ-TREE from top to bottom, obtaining all nodes with a side length less than D and whose parent nodes have a side length greater than D. These nodes are stored in a node list nodeList. Then, the maximum number of cores is obtained from the nodes in the node list, which is used as the lower bound, with k_curAnd (4) showing. Using the MC³Lower bound of order, according to lemma 2 (i.e. given query diameter D and node N, if N.DistMap [ k ]_cur]>D, then N cannot be to MC³Contributing any vertices) further prune the nodes in nodeList.

In step S212, the target community is searched from the nodes remaining after pruning.

After pruning, the remaining nodes in nodeList are sorted according to the upper bound of the number of kernels obtained from the distance mapping table, and then verification of the best node N is started.

Specifically, given node N, if N₁]≤D≤N.distMap[k₂]Then k is₁Is the upper bound on the number of kernels for the vertices in N. First, N is extended by length D and a kernel decomposition is performed on the extended square region. Then, it can safely be ignored that the number of kernels is less than k_curBecause these vertices cannot be included in the MC³In (1). To verify if there are k-cores with higher order for the remaining vertices in the extension area, rather than checking all possible circles as in the space-first approach. In one embodiment, a circle of revolution method is used, the basic idea being to place each vertex in node N on the boundary of a circle of diameter D, and then rotate the circle clockwise. When a vertex enters a circle, it is checked whether there is an order higher than k_curK-core of (1). If so, recordk-core and update k_cur. For example, referring to FIG. 5, with vertex G on the boundary of the circle and rotating the circle clockwise, when F enters the circle, the 2-core formed by { G, F, H, I } can be found.

K may be updated after verifying N_curAnd further based on the updated k_curPrune more nodes in nodeList and then perform verification from the next best node. The above process is repeated until all nodes in the nodeList are processed.

For further clarity, example 1 below describes MC in pseudo code form³Framework of Alg. First, nodeList is obtained from DkQ-TREE (line 1); then, MC is obtained³Lower bound of order and use phi to store N_maxFor each node in nodeList, get its distance mapping table DistMap and check that it needs to be extended to the distance containing k-core; node deletion is done securely by lemma 2 (lines 5-8); acquiring the upper bound of the kernel number of the vertex in the node (line 9); next, the nodeLists are sorted in ascending order of the node's upper bound (line 10), for each node, expanded with length D and the vertices pruned as described above; for each vertex in N that is not clipped, the rotate circle method is used to check k-core and update φ (lines 11-15). The k-core with the highest order is finally stored in φ (line 16).

Still referring to FIG. 5, given a candidate node containing G, H, I, having G on the boundary of a circle and I, H, F, E, D in the rotated region of the circle, an ordered list { I, H, F, E, D } is obtained according to the order in which they entered the circle. Then, the circle is rotated clockwise, and whenever a vertex in { I, H, F, E, D } enters the circle (on its boundary), the rotation stops and checks if there is a k-core inside it. For example, when F enters a circle, a 2-core ({ G, I, H, F }) can be obtained in the circle. When the circle is rotated to vertex D, a 3-core ({ G, H, F, E, D }) is obtained. After processing H and I in the same manner, it can be seen that { G, H, F, E, D } is the k-core with the highest order in the node.

For MC³Alg algorithm, computational complexity analysis is as follows:

assume that on average each unit space region contains n vertices and m edges, and X nodes are obtained from DkQ-TREE given D.

First, the nodes are sorted according to the upper limit of the number of cores, and the complexity is O (XlogX). Then, for each node N having γ (N) ═ l, N is extended by a length D, i.e., γ (N)_ex) 2D + l and the nuclei were decomposed in this square area. In the expanded square, there is (2D + l)²m edges, so the cost of nuclear decomposition is O ((2D + l)²m). Next, a circle is rotated on each vertex in N. In each circle, there are

A vertex and

and (7) edge.

Note that the k-core verification performed in the circle can be divided into three steps:

the inspection cost is

The cost of nuclear decomposition is

BFS (breadth first search algorithm) checking cost is

Therefore, the k-core verification cost is highest

In the worst case, a maximum of π D is performed for each vertex in N²N times (number of vertices in N is l)²n). Thus, MC³The overall complexity of the Alg algorithm is

4) Community search method MC based on quadtree³Alg+

MC³Alg is still not efficient enough and limited in large-scale attribute networks. This is because, first, in each node to be checked, there are many vertices, and each vertex needs to apply the rotation circle method; second, the extended area of the node has many vertices, so that the k-core needs to be verified many times when rotating the circle. To overcome these problems, a more efficient algorithm, referred to herein as MC, is provided³Alg+。MC³Alg + and MC³The main difference between Alg is the cost of authentication of the node, whereas the node pruning in DkQ-TREE is compared to MC³Alg is the same.

At MC³In Alg +, for each node N to be examined, a binary search is performed to find the maximum number of cores in that node. The upper limit of the number of kernels is obtained from the distance mapping table of N, and MC³Alg is similar, with the lower limit being the current optimum order. In the binary search process, whether the expansion area of N has the current kernel number k or not is checked_cK-core of (1). In this way, a larger k can be obtained quickly_cThe beneficial effect is that firstly, the vertexes in the N for detection are reduced; second, the number of vertices in the extended region introduced in the circle rotation is reduced.

Next, to further reduce the vertices in N to be examined, the expanded square area is divided into m × m cells, and a small square is used to filter out vertices that cannot form a solution. The basic principle is that instead of checking vertices directly one by one, a square covering an s x s cell is used, which can enclose a circle of diameter D to search for all k-cores in the expanded square area. Moving (s x s) the square from the upper left corner to the lower right corner of the expanded square area (comprising m x m cells), checking whether there is k at each position of the square_c-core. Record contains k_cAll squares of core, with circular rotation only for the vertices, i.e. in N and squares where m, s are positiveInteger and s is smaller than m, and in practical application, appropriate m and s can be set according to the diameter of the circle, the requirement on search granularity and the like. In this way, the verification granularity is a unit rather than a vertex, so the verification speed is faster.

Finally, a binary circle-of-revolution method is proposed to examine candidate vertices to improve verification cost. And MC³The main difference of Alg is that when rotating a circle, the rotation is not stopped when a new vertex enters the circle, but rather a binary search strategy is used to deal with this problem. Specifically, the rotation is stopped when such a vertex is reached, and k is satisfied first from the start of entering the vertex to the vertex_c-core. Then, the circle with the vertex on the boundary is checked, if k exists_cCore, then record it and stop rotating; otherwise, starting from the checked circle, find that k can be satisfied_cThe next vertex of core. This approach is very efficient since large areas that do not contain any nuclei can be skipped.

Referring to the example of the binary search process shown in FIG. 6, given the same candidate nodes as in FIG. 4, a binary search is performed based on the number of cores. First, there is an upper bound upper-3 (from the distance map) and a lower bound lower-2 (the current best value), so the current core number k_cIs that

Then, vertices G, H, I are set as boundary vertices. During the rotation, a bisection strategy is considered. Firstly, an ordered list { I, H, F, E, D } is obtained according to the sequence of entering a search circle, and the ordered list is marked as InAnglelList. Next, a binary search is performed on InAnglelList to find vertices that first satisfy 2-core. Because the rotated region { G, H, I } forms a 2-core, vertex H is found first, i.e., the circle is rotated to H and 2-core (i.e., { G, H, I } is found). Record and update lower 2+1 3. Now, k_cIs 3, set vertex G as the boundary vertex and repeat the above process. When vertex D is on the boundary of the search circle, the rotation region ({ G, D, E, F, H } forms the 3-core. directly rotating the circle to vertex D, the 3-core can be found within the circle, in this way, when rotating to vertex DF. E, it does not stop, but rotates directly to vertex D. Finally, G, D, E, F, H is found to be the best core in the node.

For MC³Alg + algorithm, computational complexity analysis is as follows:

do and M C³All g the same assumption, at each extension node N that needs to be checked_ex(γ(N_ex) 2D + l) is performed. Suppose the maximum number of kernels obtained from the distance map is k_maxAnd the binary search for k is at most logk_maxNext, the process is carried out. The expanded square area is divided into T cells and some vertices are filtered out using small squares covering s. Small square coverage

A vertex and

side, need to move small square (T-s)²Next, the process is carried out. Therefore, the overhead of the moving process is at most

For each vertex in N, the overhead is at most during the rotation of the bisecting circle

(each circle covers

A vertex and

and (4) arranging edges. Thus, in the worst case, MC³The overall complexity of Alg + is

To further verify the effect of the present invention, a simulation experiment was performed to evaluate the technical effect of the above-described embodiment, in which M based on a quad tree was evaluatedC³Alg and MC³Alg + algorithm, structure priority mode and space priority mode. However, since the structure-first and space-first modes operate very slowly, their performance is only reported in one set of experiments below. The experimental conditions were set as follows:

1) settings on data sets

The experiment utilized four data sets, including three real data sets (Gowalla, FourSquare, Flickr) and one synthetic data set (YoutubeSyn). In Gowalla, each vertex is a user in Gowalla, and each edge represents a friendship between two users. Each user has many registrations and chooses the most common one of the registrations as his location. Further, experiments were also conducted for the case where the user had multiple registrations in this data set. In the FourSquare, each vertex is a user of the FourSquare website, and each edge represents a social relationship between two users. For each user, his most common registration information is selected as his location. In Flickr, the vertices are users and the edges represent the "following" relationship between two users. The location in which the user has the most photo tokens is marked. In Youtube syn, each vertex represents a user of Youtube, and each edge is a "following" relationship between two users. However, without the user's location information, a location is generated for each user. Furthermore, in the experiments, two distribution methods were also used to generate the positions, including random distribution and gaussian distribution. The details of the data set are shown in table 1, wherein,

is the average degree, max_kIs the maximum number of locations on the node.

Table 1: data set attributes

2) And setting parameters.

Setting the number of m (the number of grid cells in the extended search area) to 10, experiments have shown that this parameter does not have a large impact on performance, and when m is 10, the optimal runtime is achieved, so m is 10 as a default value in all experiments. In the experiment of the multiple locations of the user, for Gowalla, the location of the user is all the registered information of the user. For youtube syn, the user's location is randomly generated. In different distribution experiments, locations are generated that meet the requirements of two distributions, including a random distribution and a gaussian distribution. For all data sets, the bits are placed in squares of size [0,100] × [0,100 ].

3) And the experimental equipment.

Experiments were conducted on machines equipped with Intel i 7-67003.40 GHz processors and 16GB memory, Windows10 was installed, and all algorithms were implemented in java.

Experiment results show that factors such as changing the diameter D, having a plurality of registration positions at one vertex, changing the position distribution of users and the like have influence on the technical effect of the embodiment of the invention.

FIGS. 7(a) to 7(c) are schematic diagrams showing the correlation between the diameter and the operation time, and specifically, changing the diameter D affects the structure priority method, the space priority method, the MC³Alg and MC³Search area and efficiency of Alg +. See fig. 7(a) to 7(c), wherein the abscissa represents the diameter D, which varies from 2.5 to 12.5 (referring to the conversion of the actual coordinates to [0,100]]x[0,100]Coordinates after the square search area) and the ordinate represents the run time in seconds (sec). FIGS. 7(a) to 7(c) show the running times of four algorithms, i.e., spatial priority (spatial), structural priority (structural), MC³Alg and MC³FIG. 7(a) shows the results of experiments in Flickr, FIG. 7(b) shows the results of experiments in FourSquare, and FIG. 7(c) shows the results of experiments in Gowalla. It can be observed that MC³Alg + is always preferred over other algorithms because it has the most pruning and optimization strategies, while the spatial and structural precedence methods are very time consuming and will therefore be ignored in subsequent experiments.

FIGS. 8(a) to 8(b) are schematic diagrams showing the correlation between the number of positions and the running time, in which the abscissa represents the number of positions and the ordinate represents the number of positionsRun time (sec), FIG. 8(a) is the experimental result for the data set YoutubeSyn, and FIG. 8(b) is the experimental result for the data set Gowalla. When a vertex has multiple registered positions, more registered positions will result in more k-core checks. Therefore, the number of registrations may affect the MC³Alg and MC³Performance of Alg +. It can be observed that MC³Alg + is less affected by multiple registrations, since performing a binary search can speed up MC³And rotation process of Alg +. In addition, MC³Running time ratio MC of Alg +³Alg is about 7 times faster.

Fig. 9(a) to 9(b) are schematic diagrams showing the correlation between the location distribution and the runtime, where the abscissa is the diameter value and the ordinate is the runtime, fig. 9(a) corresponds to the gaussian distribution of the data set youtube syn, and fig. 9(b) corresponds to the random distribution of the data set youtube syn. It can be observed that MC³Alg + is always better than MC³And (4) Alg. It should be noted that MC³The superiority of Alg + is more pronounced in Gaussian distributions, since some nodes contain a very large number of vertices, which results in MC³Alg has a higher complexity in searching for these nodes.

10(a) through 10(b) are effect diagrams of extensibility, wherein the abscissa is the percentage of vertices, which refers to the percentage of the vertex number of the entire dataset (for example, 20% represents the experiment performed on a subdata set of 20% scale of the vertex number of a certain dataset), and the ordinate is the running time, FIG. 10(a) corresponds to a Flickr dataset, and FIG. 10(b) corresponds to a FourSquare dataset, and the extensibility of the embodiment of the present invention is verified by changing two datasets. It can be observed that both algorithms adapt well to dataset size and MC³Alg + also operates fastest due to more pruning strategies.

In summary, the present invention provides various embodiments for the search problem of the most cohesive co-located community, and integrates the spatial information and the local structure information in the preferred quadtree-based index structure (i.e. DkQ-TREE), so as to speed up the search of the target community. And based on DkQ-TREE, two effective algorithms are provided, and the efficiency and effectiveness of the provided algorithms are proved through carrying out a large amount of experiments on real and synthetic data sets. The community searching method provided by the embodiment of the invention can be used for behavior analysis, recommendation, disease prediction and the like of social network users.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved. Furthermore, those skilled in the art can make appropriate modifications to some embodiments, such as rotating a circle counterclockwise, setting an appropriate diameter D based on the scale of the attribute network, user requirements, query speed requirements, etc., without departing from the spirit of the invention.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A community searching method of a property network comprises the following steps:

step S1: defining a search area range according to the spatial position of an attribute network user, wherein the attribute network is a social network;

step S2: searching a target community according to the contact compactness among network users in the attribute network, wherein the spatial position of the users in the target community is within the range of the defined search area;

wherein step S1 includes the following substeps:

searching for a target community represented by a connected subgraph in the undirected connected graph G, wherein the vertex position of the subgraph can be surrounded by a circle with the diameter D and the vertex in the subgraph forms the highest-order k-core relative to other subgraphs of the undirected connected graph G;

wherein, in step S2, searching for a target community represented by a connected subgraph according to the following steps:

2. The method according to claim 1, wherein in step S25, for a node N in the node list nodeList, the following steps are performed:

3. The method of claim 2, wherein the remaining vertices in the expanded square region are verified for the presence of vertices of an order higher than k_curK-core of (2):

4. The method of claim 2, wherein the remaining vertices in the expanded square region are verified for the presence of vertices of an order higher than k_curK-core of (2):

5. The method of claim 2, wherein the remaining vertices in the expanded square region are verified for the presence of vertices of an order higher than k_curK-core of (2):

6. The method of claim 1, wherein searching for target communities represented in a connected subgraph is performed according to the following steps:

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

8. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when executing the program.