CN110674390B

CN110674390B - Confidence-based group discovery method and device

Info

Publication number: CN110674390B
Application number: CN201910747703.3A
Authority: CN
Inventors: 井雅琪; 李扬曦; 任博雅; 杨亚茹; 沈华伟; 佟玲玲; 时磊; 王永庆; 段运强; 段东圣
Original assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2022-05-20
Anticipated expiration: 2039-08-14
Also published as: CN110674390A

Abstract

The invention discloses a group discovery method and a device based on confidence coefficient, wherein the method comprises the following steps: step 1, setting a constraint condition of a group, and generating a candidate user set and a candidate network of the group based on the constraint condition; step 2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network; step 3, comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value to find a new seed user and a new candidate user; and 4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration times are reached.

Description

Confidence-based group discovery method and device

Technical Field

The invention relates to the technical field of computers, in particular to a group discovery method and device based on confidence.

Background

With the rapid development of the internet, the social network has become an important platform for people to communicate and share information daily today. The group is an important mesoscopic organization of the social network, the group discovery and analysis not only have important theoretical significance, but also promote the application and development of the social network, can discover malicious behavior groups which harm social security, guide the reasonable management and control, and have important research significance and application value for promoting social network service and safety control. However, the massive data generated by users on the social network platform brings huge opportunities and challenges to group discovery and behavior analysis, and how to discover a specific group from a large number of network users and analyze the behaviors of the specific group is a problem which needs to be solved urgently.

The traditional group discovery algorithm is based on the idea of community structure cohesion, and mainly considers the characteristic of social network structure cohesion, namely that nodes in the same community are connected closely and nodes in different communities are connected sparsely. Comparative classical group discovery algorithms include: LPA algorithm, Louvain algorithm, CPM algorithm, etc.

The lpa (label Propagation algorithm) algorithm proposed by usahanndiniiraghavan et al in 2007 is a semi-supervised learning algorithm based on a graph, and its basic idea is to predict label information of unlabelled nodes from label information of labeled nodes, and to establish a complete graph model using the relationship between samples. The LPA algorithm predicts and propagates the label of the unlabeled data by using the intrinsic structure of the unlabeled data, the distribution rule and the label of the adjacent data. The LPA algorithm has the greatest characteristics of simplicity and high efficiency, and has the defects of unstable results and low accuracy in each iteration.

The Louvain algorithm is an algorithm based on multi-level optimization modularity, and the modularity is used for measuring the quality of a community discovery algorithm result initially and can depict the compactness of a discovered community. The Louvain algorithm includes two phases, in the first of which nodes in the network are traversed continuously, trying to join a single node in the community that can maximize the modularity improvement, until all nodes no longer change. And processing the result of the first stage in the second stage, and merging small communities into a super node to reconstruct the network, wherein the weight of the edge is the sum of the edge weights of all the original nodes in the two nodes.

The CPM (Clique Percolationmethod) algorithm was the earliest overlapping community discovery algorithm, the idea of which was based on the theory of group penetration. The algorithm considers communities as a fully connected subgraph set with shared nodes and identifies community structures in the network through a clique filtering algorithm. The algorithm first searches all the complete subgraphs with k nodes and then builds a new graph with k-clique as nodes, where if two k-cliques have (k-1) common nodes then an edge is built in the new graph for representing them between nodes. Finally, in the new graph, each connected subgraph is a community. The algorithm is applied to bipartite graphs, directed graphs and weighted graphs.

The current prior art has different problems:

1) the existing group discovery method processes the whole network, is only suitable for small-scale social networks, and has huge calculation amount for large-scale networks of millions or even millions of users, so that the actual operation cannot be realized;

2) the existing group discovery method is based on the structural information of the network, does not consider other factors such as the geographic position, text, conversation and the like of a user, and has low group discovery accuracy.

Disclosure of Invention

The embodiment of the invention provides a group discovery method and device based on confidence coefficient, which are used for solving the problems in the prior art.

The embodiment of the invention provides a group discovery method based on confidence coefficient, which comprises the following steps:

step 1, setting a constraint condition of a group, and generating a candidate user set and a candidate network of the group based on the constraint condition;

step 2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;

step 3, comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value to find a new seed user and a new candidate user;

and 4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration times are reached.

The embodiment of the present invention further provides a group discovery apparatus based on confidence, including:

the generating module is used for setting a constraint condition of the group and generating a candidate user set and a candidate network of the group based on the constraint condition;

the confidence coefficient determining module is used for comprehensively obtaining the confidence coefficient of each candidate user belonging to the group based on the candidate user set and the candidate network;

the comparison module is used for comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value, finding a new seed user and a new candidate user and acquiring the new seed user;

and the calling module calls the generating module, the confidence coefficient determining module and the comparing module in sequence until the preset iteration times are reached.

The embodiment of the invention also provides a group discovery method based on confidence coefficient, which comprises the following steps: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the confidence-based population discovery method described above.

The embodiment of the invention also provides a computer-readable storage medium, wherein an implementation program for information transmission is stored on the computer-readable storage medium, and when the program is executed by a processor, the steps of the group discovery method based on the confidence degree are implemented.

By adopting the embodiment of the invention, the method is not limited by the network scale, and can quickly and effectively discover the specific group on a large-scale network; network structure information and attribute information of the users are fully utilized, and the accuracy rate of group discovery is improved; under the condition that the seed users are rare or even missing, related group users can still be found; the technical scheme of the embodiment of the invention has expandability and is also suitable for other attribute information of the user.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic illustration of a confidence-based population discovery method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a detailed process of a confidence-based population discovery method according to an embodiment of the present invention;

FIG. 3 is a network property diagram of a confidence-based population discovery method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a confidence-based population discovery apparatus according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a confidence-based population discovery apparatus according to a second embodiment of the present invention.

Detailed Description

When the inventor conducts group discovery in practical application, the inventor discovers that the traditional group discovery algorithm can only process small-scale networks and is difficult to process large-scale networks, and the traditional algorithm only utilizes the structure information of the networks, namely the incidence relation information among users, and other attribute information of the users, such as geographical position, text, conversation and the like, is not fully utilized, so that the accuracy of group discovery is low. In order to solve the above problems, an embodiment of the present invention provides a group discovery method based on confidence.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

According to an embodiment of the present invention, a confidence-based group discovery method is provided, fig. 1 is a schematic diagram of the confidence-based group discovery method according to the embodiment of the present invention, as shown in fig. 1, the confidence-based group discovery method according to the embodiment of the present invention specifically includes:

step S1, setting the constraint condition of the group, and generating the candidate user set and the candidate network of the group based on the constraint condition;

step S1 specifically includes the following processing:

1. defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keyword, region, time;

2. searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;

3. searching short text data in full text, finding out texts containing group keywords, and adding related users into a candidate user set;

4. and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.

Step S2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;

step S2 specifically includes the following processing:

1. for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and group keywords by using a formula 1:

2. wherein s is_keyRepresenting a set of group keywords, s_userA set of words representing a user;

3. extracting multi-dimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each dimension communication characteristic, wherein the multi-dimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;

4. and calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:

wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.

Step S3, according to the confidence of the candidate user, comparing with the preset confidence threshold value, finding out a new seed user and a new candidate user;

step S3 specifically includes:

1. defining a confidence threshold β and a confidence threshold γ, and 0< β < γ < 1;

2. screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;

3. screening out users with the threshold value of the found users in the found user set being larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.

And step S4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration number is reached.

The technical scheme of the embodiment of the invention solves the problems that the prior art is difficult to realize group discovery on a large-scale network and has low discovery accuracy, and the like, can be not limited by the network scale, fully utilizes the attribute information of the user and the structure information of the network, and realizes the efficient and accurate discovery of the specific group on the large-scale network.

According to the technical scheme of the embodiment of the invention, the related users can be searched from the mass data based on the constraint conditions of seed users, artificially defined group keywords, regions, time and the like, so as to form a candidate user set of the group, and the incidence relation among the candidate users forms a candidate network; for each user in the candidate user set, obtaining multidimensional characteristics such as network structure characteristics, short text characteristics and the like, synthesizing the characteristics to obtain the confidence degree of each user belonging to the group, wherein the confidence degree is used for indicating the confidence degree of the user belonging to the group; the group discovery is an iterative process, two confidence degree threshold values (beta < gamma) are set, in each iteration, users with the confidence degree reaching the threshold value beta are screened out as discovery users of the group and added into a discovery user set, the users with the confidence degree reaching gamma consider that the confidence degree is high, the users can serve as seed users to discover new candidate users from mass data, and the final discovery user set and the seed set jointly form the group.

According to the method, the candidate user set of the group is obtained from the mass data according to constraints such as seed users, group keywords, regions, time and the like, so that the problem that the group is difficult to process directly on a large-scale network is avoided; the concept of confidence is introduced, multidimensional features are extracted from attributes such as structures and short texts, the confidence of the user is obtained comprehensively, and a threshold is set to screen the user; through multiple iterations, new users are continually discovered and replenished, and the discovered users and the seed set together form the population.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

FIG. 2 is a schematic diagram of a detailed process of a confidence-based population discovery method according to an embodiment of the present invention, as shown in FIG. 2:

and step S1, generating a candidate user set and a candidate network of the group based on the constraint conditions such as seed users, artificially defined group keywords, regions, time and the like.

Firstly, defining a group, and setting constraints (at least one constraint condition exists) such as a seed user set, keywords of the group, regions, time and the like. The users who have communication relation with the seed user are most likely to belong to the group, so that the users who have communication relation with the seed user are searched from the short text data and the call data of the users, users who do not accord with the regional and time constraints are filtered out, and the users are added into the candidate user set. The keywords of the group describe the characteristics of the group, and if the content of the keywords is contained in the short text content published and received by the user, the user is likely to belong to the group, so that the short text data needs to be searched in full text, the text containing the keywords is found, and the related users are added into the candidate user set. And the candidate users are associated through texts and calls to form a candidate network. The attribute map of the candidate network is shown in fig. 3.

Step S2, obtaining the confidence degree of each user belonging to the group based on the multi-dimensional feature synthesis of the candidate users;

in the step, for each user in the candidate user set, multi-dimensional features such as short text features, network structure features and the like are obtained. Specifically, for the short text features, the short text content of each user is subjected to word segmentation and stop word removal processing to obtain a word set of the user, and the matching degree of the set and a group-defined keyword set is calculated by using the following formula, wherein s_keyRepresenting a set of group keywords, s_userSet of words representing the user:

for the network structure characteristics, extracting the communication characteristics between each candidate user and the seed user in the candidate network, wherein the extracting comprises the following steps: the short text sending frequency, the short text receiving frequency, the telephone calling frequency, the telephone receiving frequency, the number of seed users in the communication, the total communication time and the like, and normalization processing is carried out on each dimension characteristic.

And calculating the confidence degree of each user belonging to the group by combining the above characteristics, wherein the confidence degree is used for indicating the confidence degree of the user belonging to the group. The confidence coefficient is calculated based on the following formula, wherein u is the feature of the candidate user, alpha is the weight of the feature, and k is the total number of the features:

step S3, the found user and the new seed user of the population are obtained based on two confidence thresholds.

Defining two confidence degree threshold values beta and gamma, wherein beta is more than 0 and less than gamma and less than 1, screening out users more than the threshold value beta as newly discovered users of the group according to the confidence degree of each candidate user, and adding the users into a discovered user set. The users with the threshold value larger than gamma in the found users are considered to have higher credibility, and can be used as seed users to continuously find new candidate users. The seed user generated in step S3 will repeatedly execute steps S1, S2 until the set number of iterations is reached. The final set of discovery users and the seed set together form the discovery community.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

1. the method is not limited by the network scale, and a specific group can be quickly and effectively found on a large-scale network;

2. network structure information and attribute information of the user, including geographical position, text, conversation and the like, are fully utilized, and the accuracy of group discovery is improved;

3. under the condition that the seed users are rare or even missing, related group users can still be found;

4. the method has expandability and is also suitable for other attribute information of the user.

Apparatus embodiment one

According to an embodiment of the present invention, there is provided a confidence-based group discovery apparatus, and fig. 4 is a schematic diagram of a confidence-based group discovery apparatus according to a first embodiment of the present invention, as shown in fig. 4, the confidence-based group discovery apparatus specifically includes:

the generating module 40 is used for setting a constraint condition of the group and generating a candidate user set and a candidate network of the group based on the constraint condition; the generating module 40 is specifically configured to:

defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keyword, region, time;

searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;

searching short text data in full text, finding out texts containing group keywords, and adding related users into a candidate user set;

and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.

The confidence coefficient determining module 42 is used for comprehensively obtaining the confidence coefficient of each candidate user belonging to the group based on the candidate user set and the candidate network; the confidence determination module 42 is specifically configured to:

for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and group keywords by using a formula 1:

wherein s is_keyRepresenting a set of group keywords, s_userA set of words representing a user;

extracting the multidimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each multidimensional communication characteristic, wherein the multidimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;

and calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:

A comparison module 44, configured to compare the confidence level of the candidate user with a preset confidence level threshold, find a new seed user and a new candidate user, and obtain a new seed user; the comparison module 44 is specifically configured to:

defining two confidence coefficient thresholds beta and gamma, wherein beta is more than 0 and less than gamma and less than 1;

screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;

and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.

And the calling module 46 sequentially calls the generating module, the confidence coefficient determining module and the comparing module until the preset iteration number is reached.

The specific processing of each module in the above embodiments may be understood with reference to the method embodiments, and is not described herein again.

In summary, with the aid of the technical solution of the embodiments of the present invention, the candidate user set of the group is obtained from the mass data according to the constraints of the seed users, the group keywords, the regions, the time, and the like, so as to avoid the problem that the candidate user set is difficult to process directly on a large-scale network; the concept of confidence is introduced, multidimensional features are extracted from attributes such as structures and short texts, the confidence of the user is obtained comprehensively, and a threshold is set to screen the user; through multiple iterations, new users are continually discovered and replenished, and the discovered users and the seed set together form the population.

Device embodiment II

An embodiment of the present invention provides a group discovery apparatus based on confidence, as shown in fig. 5, including: a memory 50, a processor 52 and a computer program stored on the memory 50 and executable on the processor 1020, which computer program when executed by the processor 52 implements the method steps of:

step S1 specifically includes the following processing:

1. defining a group, and setting constraints of the group, wherein the constraints comprise at least one of the following conditions: seed user set, group keyword, region, time;

2. searching users who have communication with the seed user from the short text data and the conversation data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;

step S2 specifically includes the following processing:

3. extracting the multidimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each multidimensional communication characteristic, wherein the multidimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;

step S3 specifically includes:

1. defining two confidence coefficient thresholds beta and gamma, wherein beta is more than 0 and less than gamma and less than 1;

3. and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.

Example III of the device

An embodiment of the present invention provides a computer-readable storage medium, where an implementation program for information transmission is stored, and when executed by the processor 52, the implementation program implements the following method steps:

step S1 specifically includes the following processing:

step S2 specifically includes the following processing:

step S3 specifically includes:

1. defining two confidence thresholds beta, gamma, wherein beta is more than 0 and less than gamma and less than 1;

The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A confidence-based population discovery method, comprising:

step 4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration times are reached;

the step of comprehensively obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network specifically comprises the following steps:

for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and the group keywords by using a formula 1:

for each candidate user in the candidate network, extracting a multi-dimensional communication characteristic between the candidate user and the seed user, and performing normalization processing on each dimension communication characteristic, wherein the multi-dimensional communication characteristic specifically comprises at least one of the following: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;

calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:

2. The method of claim 1, wherein setting constraints for the population, and generating the set of candidate users and the candidate networks for the population based on the constraints specifically comprises:

searching short text data in full text, finding out a text containing the group keywords, and adding related users into a candidate user set;

3. The method of claim 1, wherein the step of comparing the confidence level of the candidate user with a preset confidence level threshold to find a new seed user and a new candidate user specifically comprises:

defining a confidence threshold β and a confidence threshold γ, and 0< β < γ < 1;

4. A confidence-based population discovery apparatus, comprising:

the calling module is used for calling the generating module, the confidence coefficient determining module and the comparing module in sequence until the preset iteration times are reached;

the confidence determination module is specifically configured to:

wherein S_keyRepresenting a set of group keywords, s_userA set of words representing a user;

5. The apparatus of claim 4, wherein the generation module is specifically configured to:

defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keywords, region and time;

6. The apparatus of claim 4, wherein the comparison module is specifically configured to:

7. A confidence-based population discovery apparatus, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the confidence-based population discovery method of any one of claims 1 to 3.

8. A computer-readable storage medium, having stored thereon an information transfer-enabling program which, when executed by a processor, enables the steps of the confidence-based population discovery method of any one of claims 1 to 3.