CN110674390B - Confidence-based group discovery method and device - Google Patents

Confidence-based group discovery method and device Download PDF

Info

Publication number
CN110674390B
CN110674390B CN201910747703.3A CN201910747703A CN110674390B CN 110674390 B CN110674390 B CN 110674390B CN 201910747703 A CN201910747703 A CN 201910747703A CN 110674390 B CN110674390 B CN 110674390B
Authority
CN
China
Prior art keywords
user
candidate
users
confidence
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910747703.3A
Other languages
Chinese (zh)
Other versions
CN110674390A (en
Inventor
井雅琪
李扬曦
任博雅
杨亚茹
沈华伟
佟玲玲
时磊
王永庆
段运强
段东圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN201910747703.3A priority Critical patent/CN110674390B/en
Publication of CN110674390A publication Critical patent/CN110674390A/en
Application granted granted Critical
Publication of CN110674390B publication Critical patent/CN110674390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a group discovery method and a device based on confidence coefficient, wherein the method comprises the following steps: step 1, setting a constraint condition of a group, and generating a candidate user set and a candidate network of the group based on the constraint condition; step 2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network; step 3, comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value to find a new seed user and a new candidate user; and 4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration times are reached.

Description

Confidence-based group discovery method and device
Technical Field
The invention relates to the technical field of computers, in particular to a group discovery method and device based on confidence.
Background
With the rapid development of the internet, the social network has become an important platform for people to communicate and share information daily today. The group is an important mesoscopic organization of the social network, the group discovery and analysis not only have important theoretical significance, but also promote the application and development of the social network, can discover malicious behavior groups which harm social security, guide the reasonable management and control, and have important research significance and application value for promoting social network service and safety control. However, the massive data generated by users on the social network platform brings huge opportunities and challenges to group discovery and behavior analysis, and how to discover a specific group from a large number of network users and analyze the behaviors of the specific group is a problem which needs to be solved urgently.
The traditional group discovery algorithm is based on the idea of community structure cohesion, and mainly considers the characteristic of social network structure cohesion, namely that nodes in the same community are connected closely and nodes in different communities are connected sparsely. Comparative classical group discovery algorithms include: LPA algorithm, Louvain algorithm, CPM algorithm, etc.
The lpa (label Propagation algorithm) algorithm proposed by usahanndiniiraghavan et al in 2007 is a semi-supervised learning algorithm based on a graph, and its basic idea is to predict label information of unlabelled nodes from label information of labeled nodes, and to establish a complete graph model using the relationship between samples. The LPA algorithm predicts and propagates the label of the unlabeled data by using the intrinsic structure of the unlabeled data, the distribution rule and the label of the adjacent data. The LPA algorithm has the greatest characteristics of simplicity and high efficiency, and has the defects of unstable results and low accuracy in each iteration.
The Louvain algorithm is an algorithm based on multi-level optimization modularity, and the modularity is used for measuring the quality of a community discovery algorithm result initially and can depict the compactness of a discovered community. The Louvain algorithm includes two phases, in the first of which nodes in the network are traversed continuously, trying to join a single node in the community that can maximize the modularity improvement, until all nodes no longer change. And processing the result of the first stage in the second stage, and merging small communities into a super node to reconstruct the network, wherein the weight of the edge is the sum of the edge weights of all the original nodes in the two nodes.
The CPM (Clique Percolationmethod) algorithm was the earliest overlapping community discovery algorithm, the idea of which was based on the theory of group penetration. The algorithm considers communities as a fully connected subgraph set with shared nodes and identifies community structures in the network through a clique filtering algorithm. The algorithm first searches all the complete subgraphs with k nodes and then builds a new graph with k-clique as nodes, where if two k-cliques have (k-1) common nodes then an edge is built in the new graph for representing them between nodes. Finally, in the new graph, each connected subgraph is a community. The algorithm is applied to bipartite graphs, directed graphs and weighted graphs.
The current prior art has different problems:
1) the existing group discovery method processes the whole network, is only suitable for small-scale social networks, and has huge calculation amount for large-scale networks of millions or even millions of users, so that the actual operation cannot be realized;
2) the existing group discovery method is based on the structural information of the network, does not consider other factors such as the geographic position, text, conversation and the like of a user, and has low group discovery accuracy.
Disclosure of Invention
The embodiment of the invention provides a group discovery method and device based on confidence coefficient, which are used for solving the problems in the prior art.
The embodiment of the invention provides a group discovery method based on confidence coefficient, which comprises the following steps:
step 1, setting a constraint condition of a group, and generating a candidate user set and a candidate network of the group based on the constraint condition;
step 2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;
step 3, comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value to find a new seed user and a new candidate user;
and 4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration times are reached.
The embodiment of the present invention further provides a group discovery apparatus based on confidence, including:
the generating module is used for setting a constraint condition of the group and generating a candidate user set and a candidate network of the group based on the constraint condition;
the confidence coefficient determining module is used for comprehensively obtaining the confidence coefficient of each candidate user belonging to the group based on the candidate user set and the candidate network;
the comparison module is used for comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value, finding a new seed user and a new candidate user and acquiring the new seed user;
and the calling module calls the generating module, the confidence coefficient determining module and the comparing module in sequence until the preset iteration times are reached.
The embodiment of the invention also provides a group discovery method based on confidence coefficient, which comprises the following steps: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the confidence-based population discovery method described above.
The embodiment of the invention also provides a computer-readable storage medium, wherein an implementation program for information transmission is stored on the computer-readable storage medium, and when the program is executed by a processor, the steps of the group discovery method based on the confidence degree are implemented.
By adopting the embodiment of the invention, the method is not limited by the network scale, and can quickly and effectively discover the specific group on a large-scale network; network structure information and attribute information of the users are fully utilized, and the accuracy rate of group discovery is improved; under the condition that the seed users are rare or even missing, related group users can still be found; the technical scheme of the embodiment of the invention has expandability and is also suitable for other attribute information of the user.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic illustration of a confidence-based population discovery method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a detailed process of a confidence-based population discovery method according to an embodiment of the present invention;
FIG. 3 is a network property diagram of a confidence-based population discovery method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a confidence-based population discovery apparatus according to a first embodiment of the present invention;
FIG. 5 is a schematic diagram of a confidence-based population discovery apparatus according to a second embodiment of the present invention.
Detailed Description
When the inventor conducts group discovery in practical application, the inventor discovers that the traditional group discovery algorithm can only process small-scale networks and is difficult to process large-scale networks, and the traditional algorithm only utilizes the structure information of the networks, namely the incidence relation information among users, and other attribute information of the users, such as geographical position, text, conversation and the like, is not fully utilized, so that the accuracy of group discovery is low. In order to solve the above problems, an embodiment of the present invention provides a group discovery method based on confidence.
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
According to an embodiment of the present invention, a confidence-based group discovery method is provided, fig. 1 is a schematic diagram of the confidence-based group discovery method according to the embodiment of the present invention, as shown in fig. 1, the confidence-based group discovery method according to the embodiment of the present invention specifically includes:
step S1, setting the constraint condition of the group, and generating the candidate user set and the candidate network of the group based on the constraint condition;
step S1 specifically includes the following processing:
1. defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keyword, region, time;
2. searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;
3. searching short text data in full text, finding out texts containing group keywords, and adding related users into a candidate user set;
4. and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.
Step S2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;
step S2 specifically includes the following processing:
1. for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and group keywords by using a formula 1:
Figure BDA0002166140000000051
2. wherein s iskeyRepresenting a set of group keywords, suserA set of words representing a user;
3. extracting multi-dimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each dimension communication characteristic, wherein the multi-dimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;
4. and calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:
Figure BDA0002166140000000052
wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.
Step S3, according to the confidence of the candidate user, comparing with the preset confidence threshold value, finding out a new seed user and a new candidate user;
step S3 specifically includes:
1. defining a confidence threshold β and a confidence threshold γ, and 0< β < γ < 1;
2. screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;
3. screening out users with the threshold value of the found users in the found user set being larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.
And step S4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration number is reached.
The technical scheme of the embodiment of the invention solves the problems that the prior art is difficult to realize group discovery on a large-scale network and has low discovery accuracy, and the like, can be not limited by the network scale, fully utilizes the attribute information of the user and the structure information of the network, and realizes the efficient and accurate discovery of the specific group on the large-scale network.
According to the technical scheme of the embodiment of the invention, the related users can be searched from the mass data based on the constraint conditions of seed users, artificially defined group keywords, regions, time and the like, so as to form a candidate user set of the group, and the incidence relation among the candidate users forms a candidate network; for each user in the candidate user set, obtaining multidimensional characteristics such as network structure characteristics, short text characteristics and the like, synthesizing the characteristics to obtain the confidence degree of each user belonging to the group, wherein the confidence degree is used for indicating the confidence degree of the user belonging to the group; the group discovery is an iterative process, two confidence degree threshold values (beta < gamma) are set, in each iteration, users with the confidence degree reaching the threshold value beta are screened out as discovery users of the group and added into a discovery user set, the users with the confidence degree reaching gamma consider that the confidence degree is high, the users can serve as seed users to discover new candidate users from mass data, and the final discovery user set and the seed set jointly form the group.
According to the method, the candidate user set of the group is obtained from the mass data according to constraints such as seed users, group keywords, regions, time and the like, so that the problem that the group is difficult to process directly on a large-scale network is avoided; the concept of confidence is introduced, multidimensional features are extracted from attributes such as structures and short texts, the confidence of the user is obtained comprehensively, and a threshold is set to screen the user; through multiple iterations, new users are continually discovered and replenished, and the discovered users and the seed set together form the population.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
FIG. 2 is a schematic diagram of a detailed process of a confidence-based population discovery method according to an embodiment of the present invention, as shown in FIG. 2:
and step S1, generating a candidate user set and a candidate network of the group based on the constraint conditions such as seed users, artificially defined group keywords, regions, time and the like.
Firstly, defining a group, and setting constraints (at least one constraint condition exists) such as a seed user set, keywords of the group, regions, time and the like. The users who have communication relation with the seed user are most likely to belong to the group, so that the users who have communication relation with the seed user are searched from the short text data and the call data of the users, users who do not accord with the regional and time constraints are filtered out, and the users are added into the candidate user set. The keywords of the group describe the characteristics of the group, and if the content of the keywords is contained in the short text content published and received by the user, the user is likely to belong to the group, so that the short text data needs to be searched in full text, the text containing the keywords is found, and the related users are added into the candidate user set. And the candidate users are associated through texts and calls to form a candidate network. The attribute map of the candidate network is shown in fig. 3.
Step S2, obtaining the confidence degree of each user belonging to the group based on the multi-dimensional feature synthesis of the candidate users;
in the step, for each user in the candidate user set, multi-dimensional features such as short text features, network structure features and the like are obtained. Specifically, for the short text features, the short text content of each user is subjected to word segmentation and stop word removal processing to obtain a word set of the user, and the matching degree of the set and a group-defined keyword set is calculated by using the following formula, wherein skeyRepresenting a set of group keywords, suserSet of words representing the user:
Figure BDA0002166140000000071
for the network structure characteristics, extracting the communication characteristics between each candidate user and the seed user in the candidate network, wherein the extracting comprises the following steps: the short text sending frequency, the short text receiving frequency, the telephone calling frequency, the telephone receiving frequency, the number of seed users in the communication, the total communication time and the like, and normalization processing is carried out on each dimension characteristic.
And calculating the confidence degree of each user belonging to the group by combining the above characteristics, wherein the confidence degree is used for indicating the confidence degree of the user belonging to the group. The confidence coefficient is calculated based on the following formula, wherein u is the feature of the candidate user, alpha is the weight of the feature, and k is the total number of the features:
Figure BDA0002166140000000081
step S3, the found user and the new seed user of the population are obtained based on two confidence thresholds.
Defining two confidence degree threshold values beta and gamma, wherein beta is more than 0 and less than gamma and less than 1, screening out users more than the threshold value beta as newly discovered users of the group according to the confidence degree of each candidate user, and adding the users into a discovered user set. The users with the threshold value larger than gamma in the found users are considered to have higher credibility, and can be used as seed users to continuously find new candidate users. The seed user generated in step S3 will repeatedly execute steps S1, S2 until the set number of iterations is reached. The final set of discovery users and the seed set together form the discovery community.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
1. the method is not limited by the network scale, and a specific group can be quickly and effectively found on a large-scale network;
2. network structure information and attribute information of the user, including geographical position, text, conversation and the like, are fully utilized, and the accuracy of group discovery is improved;
3. under the condition that the seed users are rare or even missing, related group users can still be found;
4. the method has expandability and is also suitable for other attribute information of the user.
Apparatus embodiment one
According to an embodiment of the present invention, there is provided a confidence-based group discovery apparatus, and fig. 4 is a schematic diagram of a confidence-based group discovery apparatus according to a first embodiment of the present invention, as shown in fig. 4, the confidence-based group discovery apparatus specifically includes:
the generating module 40 is used for setting a constraint condition of the group and generating a candidate user set and a candidate network of the group based on the constraint condition; the generating module 40 is specifically configured to:
defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keyword, region, time;
searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;
searching short text data in full text, finding out texts containing group keywords, and adding related users into a candidate user set;
and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.
The confidence coefficient determining module 42 is used for comprehensively obtaining the confidence coefficient of each candidate user belonging to the group based on the candidate user set and the candidate network; the confidence determination module 42 is specifically configured to:
for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and group keywords by using a formula 1:
Figure BDA0002166140000000091
wherein s iskeyRepresenting a set of group keywords, suserA set of words representing a user;
extracting the multidimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each multidimensional communication characteristic, wherein the multidimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;
and calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:
Figure BDA0002166140000000092
wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.
A comparison module 44, configured to compare the confidence level of the candidate user with a preset confidence level threshold, find a new seed user and a new candidate user, and obtain a new seed user; the comparison module 44 is specifically configured to:
defining two confidence coefficient thresholds beta and gamma, wherein beta is more than 0 and less than gamma and less than 1;
screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;
and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.
And the calling module 46 sequentially calls the generating module, the confidence coefficient determining module and the comparing module until the preset iteration number is reached.
The specific processing of each module in the above embodiments may be understood with reference to the method embodiments, and is not described herein again.
In summary, with the aid of the technical solution of the embodiments of the present invention, the candidate user set of the group is obtained from the mass data according to the constraints of the seed users, the group keywords, the regions, the time, and the like, so as to avoid the problem that the candidate user set is difficult to process directly on a large-scale network; the concept of confidence is introduced, multidimensional features are extracted from attributes such as structures and short texts, the confidence of the user is obtained comprehensively, and a threshold is set to screen the user; through multiple iterations, new users are continually discovered and replenished, and the discovered users and the seed set together form the population.
Device embodiment II
An embodiment of the present invention provides a group discovery apparatus based on confidence, as shown in fig. 5, including: a memory 50, a processor 52 and a computer program stored on the memory 50 and executable on the processor 1020, which computer program when executed by the processor 52 implements the method steps of:
step S1, setting the constraint condition of the group, and generating the candidate user set and the candidate network of the group based on the constraint condition;
step S1 specifically includes the following processing:
1. defining a group, and setting constraints of the group, wherein the constraints comprise at least one of the following conditions: seed user set, group keyword, region, time;
2. searching users who have communication with the seed user from the short text data and the conversation data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;
3. searching short text data in full text, finding out texts containing group keywords, and adding related users into a candidate user set;
4. and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.
Step S2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;
step S2 specifically includes the following processing:
1. for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and group keywords by using a formula 1:
Figure BDA0002166140000000111
2. wherein s iskeyRepresenting a set of group keywords, suserA set of words representing a user;
3. extracting the multidimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each multidimensional communication characteristic, wherein the multidimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;
4. and calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:
Figure BDA0002166140000000112
wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.
Step S3, according to the confidence of the candidate user, comparing with the preset confidence threshold value, finding out a new seed user and a new candidate user;
step S3 specifically includes:
1. defining two confidence coefficient thresholds beta and gamma, wherein beta is more than 0 and less than gamma and less than 1;
2. screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;
3. and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.
And step S4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration number is reached.
The technical scheme of the embodiment of the invention solves the problems that the prior art is difficult to realize group discovery on a large-scale network and has low discovery accuracy, and the like, can be not limited by the network scale, fully utilizes the attribute information of the user and the structure information of the network, and realizes the efficient and accurate discovery of the specific group on the large-scale network.
Example III of the device
An embodiment of the present invention provides a computer-readable storage medium, where an implementation program for information transmission is stored, and when executed by the processor 52, the implementation program implements the following method steps:
step S1, setting the constraint condition of the group, and generating the candidate user set and the candidate network of the group based on the constraint condition;
step S1 specifically includes the following processing:
1. defining a group, and setting constraints of the group, wherein the constraints comprise at least one of the following conditions: seed user set, group keyword, region, time;
2. searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;
3. searching short text data in full text, finding out texts containing group keywords, and adding related users into a candidate user set;
4. and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.
Step S2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;
step S2 specifically includes the following processing:
1. for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and group keywords by using a formula 1:
Figure BDA0002166140000000121
2. wherein s iskeyRepresenting a set of group keywords, suserA set of words representing a user;
3. extracting the multidimensional communication characteristics between each candidate user and the seed user in the candidate network, and performing normalization processing on each multidimensional communication characteristic, wherein the multidimensional communication characteristics specifically comprise at least one of the following characteristics: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;
4. and calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:
Figure BDA0002166140000000131
wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.
Step S3, according to the confidence of the candidate user, comparing with the preset confidence threshold value, finding out a new seed user and a new candidate user;
step S3 specifically includes:
1. defining two confidence thresholds beta, gamma, wherein beta is more than 0 and less than gamma and less than 1;
2. screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;
3. and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.
And step S4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration number is reached.
The technical scheme of the embodiment of the invention solves the problems that the prior art is difficult to realize group discovery on a large-scale network and has low discovery accuracy, and the like, can be not limited by the network scale, fully utilizes the attribute information of the user and the structure information of the network, and realizes the efficient and accurate discovery of the specific group on the large-scale network.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A confidence-based population discovery method, comprising:
step 1, setting a constraint condition of a group, and generating a candidate user set and a candidate network of the group based on the constraint condition;
step 2, obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network;
step 3, comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value to find a new seed user and a new candidate user;
step 4, acquiring a new seed user, and repeatedly executing the steps 1-4 until the preset iteration times are reached;
the step of comprehensively obtaining the confidence degree of each candidate user belonging to the group based on the candidate user set and the candidate network specifically comprises the following steps:
for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and the group keywords by using a formula 1:
Figure FDA0003544482210000011
wherein s iskeyRepresenting a set of group keywords, suserA set of words representing a user;
for each candidate user in the candidate network, extracting a multi-dimensional communication characteristic between the candidate user and the seed user, and performing normalization processing on each dimension communication characteristic, wherein the multi-dimensional communication characteristic specifically comprises at least one of the following: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;
calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:
Figure FDA0003544482210000012
wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.
2. The method of claim 1, wherein setting constraints for the population, and generating the set of candidate users and the candidate networks for the population based on the constraints specifically comprises:
defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keyword, region, time;
searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;
searching short text data in full text, finding out a text containing the group keywords, and adding related users into a candidate user set;
and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.
3. The method of claim 1, wherein the step of comparing the confidence level of the candidate user with a preset confidence level threshold to find a new seed user and a new candidate user specifically comprises:
defining a confidence threshold β and a confidence threshold γ, and 0< β < γ < 1;
screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;
and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.
4. A confidence-based population discovery apparatus, comprising:
the generating module is used for setting a constraint condition of the group and generating a candidate user set and a candidate network of the group based on the constraint condition;
the confidence coefficient determining module is used for comprehensively obtaining the confidence coefficient of each candidate user belonging to the group based on the candidate user set and the candidate network;
the comparison module is used for comparing the confidence coefficient of the candidate user with a preset confidence coefficient threshold value, finding a new seed user and a new candidate user and acquiring the new seed user;
the calling module is used for calling the generating module, the confidence coefficient determining module and the comparing module in sequence until the preset iteration times are reached;
the confidence determination module is specifically configured to:
for each user in the candidate user set, acquiring short text characteristics of the user, performing word segmentation and stop word removal processing on the short text content of each user to obtain a word set of the user, and calculating the matching degree of the word set and the group keywords by using a formula 1:
Figure FDA0003544482210000031
wherein SkeyRepresenting a set of group keywords, suserA set of words representing a user;
for each candidate user in the candidate network, extracting a multi-dimensional communication characteristic between the candidate user and the seed user, and performing normalization processing on each dimension communication characteristic, wherein the multi-dimensional communication characteristic specifically comprises at least one of the following: sending short text frequency, receiving short text frequency, calling frequency, receiving frequency, number of seed users in communication and total time of communication;
calculating the confidence degree of each candidate user belonging to the group according to the matching degree and the communication characteristics and a formula 2:
Figure FDA0003544482210000032
wherein u is the feature of the candidate user, α is the weight of the feature, and k is the total number of the features.
5. The apparatus of claim 4, wherein the generation module is specifically configured to:
defining a group, and setting a constraint condition of the group, wherein the constraint condition comprises at least one of the following conditions: seed user set, group keywords, region and time;
searching users who have communication with the seed user from the short text data and the call data of the seed user, filtering out users who do not accord with the region and time constraints, and adding the users into a candidate user set;
searching short text data in full text, finding out a text containing the group keywords, and adding related users into a candidate user set;
and according to the candidate user set, the candidate users are associated through texts and calls to form a candidate network.
6. The apparatus of claim 4, wherein the comparison module is specifically configured to:
defining a confidence threshold β and a confidence threshold γ, and 0< β < γ < 1;
screening out users larger than a confidence threshold value beta as newly discovered users of the group according to the confidence of each candidate user, and adding the users into a discovered user set;
and screening out the users with the threshold value of the found users in the found user set larger than the confidence threshold value gamma, and taking the users as seed users to continuously find new candidate users.
7. A confidence-based population discovery apparatus, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the confidence-based population discovery method of any one of claims 1 to 3.
8. A computer-readable storage medium, having stored thereon an information transfer-enabling program which, when executed by a processor, enables the steps of the confidence-based population discovery method of any one of claims 1 to 3.
CN201910747703.3A 2019-08-14 2019-08-14 Confidence-based group discovery method and device Active CN110674390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910747703.3A CN110674390B (en) 2019-08-14 2019-08-14 Confidence-based group discovery method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910747703.3A CN110674390B (en) 2019-08-14 2019-08-14 Confidence-based group discovery method and device

Publications (2)

Publication Number Publication Date
CN110674390A CN110674390A (en) 2020-01-10
CN110674390B true CN110674390B (en) 2022-05-20

Family

ID=69068572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910747703.3A Active CN110674390B (en) 2019-08-14 2019-08-14 Confidence-based group discovery method and device

Country Status (1)

Country Link
CN (1) CN110674390B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344940A (en) * 2008-08-21 2009-01-14 魏芳 Network overlapped corporation detection method based on global partition and local expansion
US9916629B2 (en) * 2013-04-09 2018-03-13 International Business Machines Corporation Identifying one or more relevant social networks for one or more collaboration artifacts
CN105721279B (en) * 2016-01-15 2019-03-26 中国联合网络通信有限公司广东省分公司 A kind of the relationship cycle method for digging and system of subscribers to telecommunication network
CN107103551A (en) * 2017-03-20 2017-08-29 重庆邮电大学 A kind of coauthorship network community division method of selected seed node
CN108153824B (en) * 2017-12-06 2020-04-24 阿里巴巴集团控股有限公司 Method and device for determining target user group

Also Published As

Publication number Publication date
CN110674390A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
Solus et al. Consistency guarantees for greedy permutation-based causal inference algorithms
US8392398B2 (en) Query optimization over graph data streams
Xie et al. LabelrankT: Incremental community detection in dynamic networks via label propagation
Abello et al. Massive quasi-clique detection
Yang et al. Tracking influential individuals in dynamic networks
US9767416B2 (en) Sparse and data-parallel inference method and system for the latent Dirichlet allocation model
US8745055B2 (en) Clustering system and method
Prokhorenkova et al. Community detection through likelihood optimization: in search of a sound model
Liu et al. Efficient mining of large maximal bicliques
Yin et al. Maximum entropy model for mobile text classification in cloud computing using improved information gain algorithm
Yu et al. Fast budgeted influence maximization over multi-action event logs
Gao et al. Enhancing collaborative filtering via topic model integrated uniform euclidean distance
Ying et al. FrauDetector+ An Incremental Graph-Mining Approach for Efficient Fraudulent Phone Call Detection
CN110674390B (en) Confidence-based group discovery method and device
CN112348041B (en) Log classification and log classification training method and device, equipment and storage medium
Popova et al. Data Structures for Efficient Computation of Influence Maximization and Influence Estimation.
Ke et al. Spark-based feature selection algorithm of network traffic classification
Al-Khamees et al. Survey: Clustering techniques of data stream
Badami et al. Cross-domain hashtag recommendation and story revelation in social media
CN115292361A (en) Method and system for screening distributed energy abnormal data
N. Papadopoulos et al. Distributed time-based local community detection
Mitra Fast convergence for stochastic and distributed gradient descent in the interpolation limit
Ji et al. An improved random walk based community detection algorithm
Tang et al. A Two-stage Algorithm Based on Prediction and Search for Maxk-Truss Decomposition
CN117828382B (en) Network interface clustering method and device based on URL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant