CN114691958A

CN114691958A - Community retrieval method based on user geographical location diversity

Info

Publication number: CN114691958A
Application number: CN202210340345.6A
Authority: CN
Inventors: 陈政辉; 武泽文; 徐建
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-01

Abstract

The invention discloses a community retrieval method based on user geographic position diversity, which comprises the following steps: step one, preprocessing data of a geographical social network, modeling and integrating sign-in data of a user, and finding a geographical position area of the user as an area tag of the user; step two, decomposing users in the position social network into one or more groups with close social connections by using a community decomposition method, namely performing k-core decomposition on the social network; step three, establishing an EDC-index structure; establishing an index structure of a user community according to geographical location area information between user nodes or the relation between area labels; fourthly, based on the input expected area label set and the social relationship constraint degree k value, expecting the number n of returned communities, inquiring user communities of which the geographic position information diversity degree rank is at the top n and which meet the social relationship constraint; and step five, returning a query result.

Description

Community retrieval method based on user geographic position diversity

Technical Field

The invention belongs to the field of computer application, and particularly relates to a community retrieval method based on user geographic location diversity.

Background

In recent years, with the increasing popularity of smart terminals, social networks based on user location information have become more and more popular. In social networking applications, a community of users refers to a sub-graph consisting of a set of users with close internal connections and sparse external connections that exists in a network (typically represented using graphs). User community analysis is an important component in social network analysis and research. Because a large amount of check-in data are accumulated in the social network based on the position information, and the geographic position information of the check-in contains the behavior characteristics of the users, the aggregation of cross-region user groups with similar behavior characteristics has important significance for applications such as cross-region interest point recommendation and marketing. However, the traditional user community analysis method only focuses on social connections among users, ignores the position information of the users and the cross-regional relation of geographic positions among the users.

The invention has the innovation points that the diversity and the mutual relation of the geographic positions among users are considered during the user community retrieval, and a proper indexing technology is designed to accelerate the retrieval process.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, fully considers the diversity of user geographic position information and aggregates user communities in a position social network. Given a social network G ═ (E, V), V is a set of user nodes, and E is a connection between users, i.e., a set of edges between users. The geographical location attribute referred to in the present invention is not specific latitude and longitude information but is regional information of the user. For example, user ai (ai e V) is used to indicate that the user's geographic location belongs to the A region, and similarly user cj (cj e V) is used to indicate that the user's geographic location belongs to the C region. The user community related in the invention is a k-core community, one k-core community meets three conditions, and (1) the degrees of all user nodes in the community are more than or equal to k (k is an integer), namely the number of neighbors of any node is more than or equal to k; (2) subgraphs formed by nodes in the community are connected, namely at least one path exists between any two nodes; (3) the k-core community is a very large subgraph satisfying the conditions (1) (2). In the present invention, if there is a third user ai '(ai' e.v), the two edges (ai, ai ') and (ai, cj) represent different meanings, (ai, ai') represents the connection between users in the same region, and (ai, cj) represents the connection between users in different regions. The invention aims to return the user community with the top rank according to the contact degree among users in different areas in the community.

The invention provides a community retrieval method based on user geographic location diversity. Another innovation of the present invention is to speed up the query process by building an index of edges for all communities.

The invention adopts the following specific technical scheme:

a community retrieval method based on user geographic location diversity comprises the following steps:

step one, preprocessing data of a geographical social network, modeling and integrating sign-in data of a user, and finding a geographical position area of the user as an area tag of the user;

step two, decomposing the users in the position social network into one or more groups with close social connections by using a community decomposition method, namely performing k-core decomposition on the social network;

step three, establishing an EDC-index structure; establishing an index structure of a user community according to geographical location area information between user nodes or the relation between area labels;

fourthly, based on the input expected area label set and the social relationship constraint degree k value, expecting the number n of returned communities, inquiring user communities of which the geographic position information diversity degree rank is at the top n and which meet the social relationship constraint;

and step five, returning the query result, namely returning the n user communities in the priority queue.

Further, the first step is as follows:

for each of the data setsThe user acquires all the corresponding check-in position sets L (U)_i)＝{l₁,l₂…, the check-in data is first cleaned, and each check-in position offset dist (i) ═ avg (l) is calculated_i-l_k),l_k∈L(U_i) Here l_kMeans that_iOther check-in information; deleting noise check-in information with very large dist offset, then using a clustering algorithm to find a circle which contains as many check-in places as possible and has as small radius as possible, and then taking a geographical position area where the center of the circle is as an area label of a user.

Further, in the second step, a k-core decomposition algorithm is used, and all the graph nodes are arranged in a reverse order according to the degree; for a given graph G, defining the core number (core-number) as an integer omega, wherein v is a node related to omega-core of the graph G, then selecting a node v with the minimum degree, recording the degree of the current node by using d, deleting all nodes with the degree of d and edges of the nodes, judging whether the degrees of all the remaining nodes are all larger than d, if not, continuously deleting the nodes with the degree of less than d and the edges of the nodes until all the nodes are all larger than d, and simultaneously assigning the core number of all the deleted nodes in the current round as d; this process is repeated until all nodes have been processed.

Further, in step three, based on the number of cores obtained by the decomposition in step two, all graph nodes and edges are firstly sorted according to the following rules:

ordering of graph nodes: for a graph node label set L, if there is a subset

A contains two tags li and lj, then the type order of the two tags is li according to some rule<lj, in rank order; if the two labels belong to the same type, sequencing according to the ID numbers of the graph nodes;

and (3) sequencing of edges: when representing one edge (u, v), the kernel number of u is always greater than or equal to v; for two edges (u, v) and (u ', v'), their order is as follows:

first sorted by the kernel value of u, u'. If the core value of u is greater than the core-number value of u ' and the ordering of one edge is represented using the Order () function, then Order ((u, v)) < Order ((u ', v '));

second, if the kernel value of u is equal to the kernel value of u ', the tag type ordering of u is less than the tag type ordering of u', then Order ((u, v)) < Order ((u ', v'));

if the kernel value of u is equal to the kernel value of u ', and the tag type Order of u is equal to the tag type Order of u ', but the tag type Order of v is less than the tag type Order of v ', then Order ((u, v)) < Order ((u ', v ')).

If the kernel value of u is equal to the kernel value of u ', the tag type order of u is equal to the tag type order of u ', and the tag type order of v is equal to the tag type order of v ', ordering the two edges according to the ID of the graph node;

after the sorting is finished, constructing an EDC-index tree index; the construction of the index uses and looks up a set data structure, the selection of the edges is carried out according to the appointed sequence in the construction process, and the merging operation of the edges is carried out in sequence from the edges with higher sequence to generate tree nodes in the index; and when the index nodes are generated, the attributes of the tree nodes are counted incidentally.

Furthermore, in the fourth step, after the EDC-index is constructed, the whole social network is decomposed into a plurality of k-core communities; and only storing a subgraph formed by a k-core community in the EDC-index, giving a geographic label set, integers k and n, and searching a k-list of the EDC-index to obtain a query result.

Further, the specific query process is as follows:

4-1, searching k-list with the kernel number equal to k according to the input query value k;

4-2, using EDD (edge Diversity Degrid) to represent the geographical location Diversity of a user community; storing the search result by using a minimum priority queue, wherein the length of the priority queue is n; the user communities which are currently found and have EDD values ranked at the top n are stored in the priority queue; inserting a new user community into the priority queue each time the EDD value of the new user community is found to be larger than the community at the top of the priority queue;

4-3 for each tree node in the k-list, searching its corresponding sub-tree: firstly, traversing subtrees and merging the geographical position labels of all child nodes from bottom to top; if the geographical position label after the tree node is merged contains the input label set Q, namely the merged label contains Q, carrying out EDD statistics on the tree node, namely the corresponding user community; after counting the EDD value, if the result is larger than the EDD value of the top tree node of the priority queue, pushing the tree node into the priority queue;

4-4, after traversing the subtree in one k-list tree node, searching the next tree node in the k-list until traversing the whole EDC-index.

The beneficial effects of the technical scheme are as follows:

the invention innovatively takes the user sign-in data as the basis, utilizes an effective semantic distance conversion model to convert the spatial position into a semantic geographic position label, establishes an efficient indexing mechanism and provides the inquiry of the social network user community based on the geographic position diversity under different situations. The method has higher query efficiency because the method uses a label conversion model based on the region information and designs a proper index structure.

Drawings

In order to more clearly illustrate the technical process of the present invention, the drawings required by the present invention are further described below;

FIG. 1 is an overall flow diagram of the method of the present invention;

FIG. 2 is an example of a social network based on location information of the present invention;

FIG. 3 is a parameter diagram of node cases in the process of building an EDC-index structure;

FIG. 4 is a schematic diagram of the EDC-Index structure construction process of the present invention;

FIG. 5 is an exemplary diagram of an EDC-Index structure;

FIG. 6 is a flow chart of the present invention for conducting a geographically diverse user-community search based on entered query parameters.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, the community retrieval method based on the diversity of the user geographic location of the present invention includes the following steps:

step one, preprocessing data of a geographical social network, modeling and integrating sign-in data of a user, finding a geographical position area of the user as an area tag of the user, and establishing a basis for subsequent community retrieval of user diversity;

secondly, decomposing the users in the position social network into one or more groups with close social connections by using a community decomposition method, namely performing k-core decomposition on the social network;

and step three, establishing an EDC-index structure on the basis of the step one and the step two. Establishing an index structure of a user community according to geographical location area information between user nodes or relation between area labels, wherein the index structure can accelerate the query process of a user diversity community;

and fourthly, inquiring the number n of communities expected to return based on the input expected area label set and the social relationship constraint degree k value, and inquiring the user communities of which the geographic position information diversity degree rank is positioned at the top n and which meet the social relationship constraint.

And step five, returning a query result.

In the overall flow chart of the method of the present invention shown in fig. 1, the first step is specifically implemented as follows:

for each user in the data set, acquiring all corresponding check-in position sets L (U)_i)＝{l₁,l₂…, the check-in data is first cleaned, and each check-in position offset dist (i) ═ avg (l) is calculated_i-l_k),l_k∈L(U_i) Here l_kMeans that_iAnd deleting the noise check-in information with very large dist offset from other check-in information, then using a clustering algorithm to find a circle which contains as many check-in places as possible and has as small radius as possible, and then taking the geographical position area where the center of the circle is as the area label of the user.

Fig. 2 shows an example of a geo-social network, in which a, b, and c represent three different region labels, for the sake of understanding, the graph nodes are named using their region labels, which are modeled in step one, and the user groups enclosed by dotted lines are two user groups with closer social relationships obtained by the k-core decomposition algorithm in step three described later.

The concrete realization of the second step is as follows:

in the step, a k-core decomposition algorithm is used, and all the graph nodes are arranged in a reverse order according to the degree. For a given graph G, the number of cores (core-number) is defined as an integer ω in the present invention, where v is a node for ω -core of the graph G, but not for (ω +1) -core of the graph G. And then selecting the node v with the minimum degree, recording the degree of the current node by using d, deleting all nodes with the degree of d and edges of the nodes, judging whether the degrees of all the remaining nodes are all larger than d, if the degrees of all the remaining nodes are not larger than the nodes with the continuous deletion degree smaller than d and the edges of the nodes until the degrees of all the nodes are larger than d, and simultaneously assigning the core-number of all the deleted nodes in the current round as d. This process is repeated until all nodes have been processed.

The nodes constituting the intermediate connected component in fig. 2, namely the graph nodes (a1, a2, a3, b1, b2, b3, c1 and c2) are subjected to the step to obtain the core-number of the graph nodes a2, b1, b2 and c2 as 3, the core-number of the nodes a1 and b3 as 2, and the core-number of the nodes a3 and c1 as 1.

The third step is realized as follows:

in this step, based on the core-number obtained by the decomposition in the second step, all graph nodes and edges are firstly sorted according to the following rules:

ordering of graph nodes: for a graph node label set L, if there is a subset

A contains two tags li and lj, then the type order of the two tags is li according to some rule<lj, in rank order; if the two tags are of the same type, thenSorting is performed according to the ID numbers of the graph nodes.

And (3) sequencing of edges: when representing one edge (u, v), the core-number value of u is always equal to or greater than v. For two edges (u, v) and (u ', v'), their order is as follows:

first sorted by the core-number value of u, u'. If the core-number value of u is greater than the core-number value of u ' and the ordering of an edge is represented using an Order () function, then Order ((u, v)) < Order ((u ', v ')).

Second, if the core-number value of u is equal to the core-number value of u ', but the tag type ordering of u is less than the tag type ordering of u', then Order ((u, v)) < Order ((u ', v')).

Order ((u, v)) < Order ((u ', v ')) if the core-number value of u is equal to the core-number value of u ' and the tag type Order of u is equal to the tag type Order of u ', but the tag type Order of v is less than the tag type Order of v '.

If the core-number value of u is equal to the core-number value of u ', the label type order of u is equal to the label type order of u ', and the label type order of v is equal to the label type order of v ', then the two edges are ordered according to the ID of the graph node.

After sorting, an EDC-index tree index is constructed. The construction of the index uses a parallel-searching set data structure, but is different from the common parallel-searching set operation, the selection of the edges is carried out according to the appointed sequence in the construction process, and the merging operation of the edges is carried out in sequence from the edges with higher sequence to generate the tree nodes in the index. The index node is generated by incidentally counting attributes of the tree node, such as a label set of a subtree to which the node belongs. Specifically, one tree node contains the following attributes:

a set of sub-tree nodes; a subtree label set; the list of the corresponding edges of each label is classified according to the nodes of the original graph of the edges and is sorted according to the termination nodes of the edges; and pointers including pointers to parent nodes, left and right sibling nodes of the index tree. For each k-core value, a k-list is built in the index to facilitate subsequent lookups.

Fig. 3 shows a table of tree node conditions in the process of constructing the EDC-index structure based on the example of fig. 2, and in order to simplify the table complexity, only indexes of intermediate connected component parts in fig. 2, that is, indexes of the graph nodes (a1, a2, a3, b1, b2, b3, c1, c2) are shown. Each row of the table in the graph represents the process of traversing the edge from top to bottom, starting from (a2, b1) to (c1, c2), and the tree nodes in the index tree are represented from left to right, all the graph nodes are initially initialized to be a tree node, and then a union merging operation is performed on two tree nodes each time one edge is traversed, and finally an index tree is obtained.

Fig. 4 is a schematic diagram illustrating the process of constructing the EDC-index structure by the graph nodes (a1, a2, a3, b1, b2, b3, c1, and c2), which is constructed based on the geo-social network illustrated in fig. 2, wherein each tree node is a user group with close social relationship, and each tree node is constructed with an inverted index of relevant edges.

FIG. 5 shows an EDC-index structure that has been constructed.

The concrete implementation of the fourth step is as follows:

in this step, based on the input expected area tag set Q, the social relationship constraint degree k value and the number n of communities expected to return, the user communities with the geographical location information diversity degree ranking in the top n are queried, which specifically includes the following steps:

after the EDC-index is constructed in the third step, the whole social network is decomposed into a plurality of k-core communities. A subgraph formed by a k-core community is stored in the EDC-index, obviously, a child node in the index is a component of a parent node k-core community, and therefore each tree node corresponds to a user community. Given a set of geotags, integers k and n, query results are conveniently obtained by searching the k-list of EDC-index. For example, if the user inputs that the k value is 3 and the n value is 9, only the list with the k value greater than or equal to 3 needs to be searched, and the first 9 communities are selected. And in the searching process, the EDC-index can be pruned by using the minimum value of the diversity of currently known geographic position information so as to accelerate the searching speed.

The query algorithm is divided into two steps:

and (3) verification of the label: the k-core community corresponding to the tree node in the EDC-index actually contains the users in the sub-tree nodes, so that the geographic location labels contained in the sub-tree nodes need to be considered at the same time to judge whether the query conditions are met.

Calculation of geographic location diversity: if the community corresponding to a tree node meets the requirement of the geographical location label, then the calculation of the geographical location diversity is carried out, namely the number of edges of different geographical location labels in the subtree is counted. If the geographic location diversity calculated by the community corresponding to one tree node is higher than the existing n communities, the newly found communities are reserved, and the communities with the minimum diversity are removed.

The specific process is as follows:

4-1 searches k-list with core-number equal to k according to the input query value k.

4-2 use Edge Diversity Degrid (EDD) to represent geographic location Diversity for a community of users. The EDD is the ratio of the number of edges in a community of users for different geographical location areas to the total number of edges in the community of users. The search results are stored using a minimum priority queue, the length of which is n of the input. Stored in the priority queue is the currently discovered user community with the EDD value ranked at the top n. Each time a new community of users is discovered, the EDD value is greater than the community at the top of the priority queue, and the community is inserted into the priority queue. Because the length of the priority queue is n, the priority queue automatically deletes the original minimum EDD value user community, so that n or less than n user communities are always maintained in the priority queue.

4-3 for each tree node in the k-list, a search is made for its corresponding sub-tree. First, traverse the subtree and merge the geographical location labels of the child nodes from bottom to top. And if the merged geographical position label of one tree node contains the input label set Q, namely the merged label contains Q, carrying out EDD statistics on the tree node, namely the corresponding user community. Because all edges are arranged according to a certain sequence, the EDD statistics can be conveniently completed. After the EDD value is counted, if the result is larger than the EDD value of the top tree node of the priority queue, the tree node is pushed into the priority queue.

Fig. 6 shows the query process in step four, which is to perform processes of k-core decomposition, index construction, tree node traversal and the like according to the input query parameters k, S and n, and finally output the user group with the greatest diversity of the geographical location information in the geographical location social network.

Claims

1. A community retrieval method based on user geographic location diversity is characterized by comprising the following steps:

step two, decomposing users in the position social network into one or more groups with close social connections by using a community decomposition method, namely performing k-core decomposition on the social network;

step three, establishing an EDC-index structure; establishing an index structure of a user community according to geographical position area information among user nodes or the relation among area labels;

2. The community retrieval method based on the diversity of the user geographic locations as claimed in claim 1, wherein the first step is as follows:

for each user in the data set, acquiring all corresponding check-in position sets L (U)_i)＝{l₁,l₂…, the check-in data is first cleaned and each check-in position offset dist (i) ═ avg (l) is calculated_i-l_k),l_k∈L(U_i) Where l is_kMeans that_iOther check-in information; deleting noise check-in information with very large dist offset, then using a clustering algorithm to find a circle which contains as many check-in places as possible and has as small radius as possible, and then taking a geographical position area where the center of the circle is as an area label of a user.

3. The community retrieval method based on the diversity of the user geographical locations as claimed in claim 1, wherein:

in the second step, all the graph nodes are arranged in a reverse order according to the degree by using a k-core decomposition algorithm; for a given graph G, defining the core number (core-number) as an integer omega, wherein v is a node related to omega-core of the graph G, then selecting a node v with the minimum degree, recording the degree of the current node by using d, deleting all nodes with the degree of d and edges of the nodes, judging whether the degrees of all the remaining nodes are all larger than d, if not larger than the nodes with the continuous deletion degree of less than d and the edges of the nodes, and assigning the core number of all the deleted nodes in the current round as d; this process is repeated until all nodes have been processed.

4. The community retrieval method based on the diversity of the user geographical locations as claimed in claim 3, wherein:

in step three, based on the core number obtained by the decomposition in step two, all graph nodes and edges are firstly sorted according to the following rules:

sequencing of graph nodes: for a graph node label set L, if there is a subset

A comprisesTwo labels li and lj, then the type order of the two labels is li according to a certain rule<lj, in the ordering; if the two labels belong to the same type, sequencing according to the ID numbers of the graph nodes;

first ordered by the kernel value of u, u'. If the core value of u is greater than the core-number value of u ' and the ordering of one edge is represented using the Order () function, then Order ((u, v)) < Order ((u ', v '));

after the sorting is finished, constructing an EDC-index tree index; the construction of the index uses and looks up a data structure, the selection of edges is carried out according to the appointed sequence in the construction process, and the merging operation of the edges is carried out in sequence from the edges with higher sequence to generate tree nodes in the index; and when the index nodes are generated, the attributes of the tree nodes are counted incidentally.

5. The community retrieval method based on the diversity of the user geographical locations as claimed in claim 4, wherein:

in the fourth step, after the EDC-index is constructed, the whole social network is decomposed into a plurality of k-core communities; and only storing a subgraph formed by a k-core community in the EDC-index, giving a geographic label set, integers of k and n, and searching a k-list of the EDC-index to obtain a query result.

6. The community retrieval method based on the diversity of the user geographical locations as claimed in claim 5, wherein:

the specific query process is as follows:

4-2, EDD (edge Diversity degree) is used for representing the geographic position Diversity of a user community; storing the search result by using a minimum priority queue, wherein the length of the priority queue is n; the user communities which are currently found and have the EDD values sorted at the top n are stored in the priority queue; inserting a new user community into the priority queue each time the EDD value of the new user community is found to be larger than the community at the top of the priority queue;