CN111209513B

CN111209513B - Network user classification method based on graph link analysis

Info

Publication number: CN111209513B
Application number: CN202010018052.7A
Authority: CN
Inventors: 赵楠; 程佳; 陈南; 易运晖; 包晶晶
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2022-04-19
Anticipated expiration: 2040-01-08
Also published as: CN111209513A

Abstract

The invention discloses a network user classification method based on graph link analysis, which mainly comprises the following steps: constructing a network user topological graph; calculating the link compactness of each network user in the network user topological graph by using a graph link analysis formula; filtering the network users; setting a threshold value by using the activity of the network users to be classified; calculating the relevancy C by using the keywords of each network user to be classified; and classifying the network users to be classified. The invention has the advantage of high classification efficiency on the premise of ensuring the classification accuracy of the user.

Description

Network user classification method based on graph link analysis

Technical Field

The invention belongs to the technical field of physics, and further relates to a network user classification method based on graph link analysis in the technical field of network classification. The invention can be used for solving the classification problem of network users in the Internet.

Background

A large number of users exist in the network, the attention information of each network user is different, and meanwhile, a large number of junk users also exist in the network users. Junk users often post useless information in web sites, disrupting network order. The network environment can be effectively purified by filtering the junk users in the website, and the interference of the junk users is avoided. On the other hand, the active users in the network are classified, so that the user management is facilitated, and the method plays a vital role in subsequent user expansion and website operation. At present, most user classification methods classify all users in a website according to user relationships and user personal information, and the methods improve the accuracy of classification results and reduce the classification efficiency.

Patent document "network community user group division method based on links and text content" (patent application No. CN201310084039.1, publication No. CN103218400A) applied by the university of beijing industry discloses a network user classification method based on links and text content. The method analyzes the network structure expressed by the network community users on the links by using a link-based analysis method, analyzes the same interest structure expressed by the users on the text content by using an interest-based analysis method, and performs difference fusion on the results of the two methods to obtain comprehensive network community user group division results. On the basis, each division result is evaluated respectively, the accuracy of the whole division result is verified, and the group members which do not meet the index requirements are screened according to the tightness degree. The method is used for user classification and group division, although the accuracy of classification results is improved, the method still has the defects that the classified network users need to be manually screened, and the classification efficiency is greatly reduced.

A naive bayes microblog user classification method based on feature weighting is disclosed in a patent document 'naive bayes microblog user classification method based on feature weighting' applied by Chongqing post and telecommunications university (patent application No. 201810443273.1, publication No. CN 108596276A). The method comprises the steps of dividing scattered microblog user data into a training data set and a testing data set; then calculating a training data set to obtain the prior probability, the conditional probability and the information gain of each feature, establishing a target optimization matrix according to the information gain ranking, and determining the weight of each feature; and finally, calculating the posterior probability of the test data, wherein the class corresponding to the maximum posterior probability is the classification result. The method has the defects that the accuracy of the user classification result obtained in practical application is low due to the fact that the user classification is carried out according to the personal information randomly filled by the microblog user.

Disclosure of Invention

The invention aims to provide a network user classification method based on graph link analysis aiming at the defects of the prior art, which is used for solving the problem of improving the classification efficiency of network users while ensuring the classification accuracy of the network users.

The specific idea of the invention is that a graph link analysis method is used for filtering the network users, calculating the activity of the network users and classifying the network users by combining the activity of the network users.

In order to achieve the purpose, the method comprises the following specific implementation steps:

(1) constructing a network user topological graph:

crawling link information of each user page in an open source programming website by using a web crawler tool, and importing the link information into a complex network modeling tool to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes;

(2) calculating the link compactness of each network user in the network user topological graph by using the following graph link analysis formula:

wherein S is_iRepresenting the link compactness of the ith network user, and d representing a damping factor, which takes values of [0.70,0.85 ]]N represents the total number of network users in the network user topology graph, Σ represents the summation operation, j represents the serial number of the network user, u represents the number of the network user_jiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, u_jiValue 1, if no link exists, u_jiA value of 0, k_jRepresenting the total number of the j network users and other users_jRepresenting the link closeness of the jth network user;

(3) filtering the network users:

sequencing the link compactness of all network users from high to low, reserving the first 80 percent of the network users as network users to be classified, and deleting the rest network users as junk users;

(4) setting a threshold value by utilizing the activity of the network users to be classified:

(4a) calculating the activity of each network user to be classified by using the following formula:

θ_m＝0.9lg(d_m+1)

wherein, theta_mRepresenting the activity of the mth network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, d_mRepresenting the number of times of logging in the website by the mth network user to be classified during the activity evaluation period;

(4b) setting the minimum value of the activity of the network user to be classified during the activity evaluation period as a threshold value;

(5) calculating the relevance C by using the keywords of each network user to be classified:

(5a) extracting keywords of each network user to be classified by using a keyword extraction tool;

(5b) calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by utilizing a cosine similarity formula;

(6) classifying the network users to be classified:

and multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.

Compared with the prior art, the invention has the following advantages:

firstly, the invention filters the network users by adopting graph link analysis, screens out the garbage users, and overcomes the problem that a large number of garbage users increase the workload of user classification when the prior art classifies the users, so that the efficiency of classifying the network users is higher.

Secondly, the invention analyzes the activity of the user and classifies the user by combining the activity of the user in the network, thereby overcoming the problem that the prior art only utilizes the personal interest data of the user to classify the user and has lower accuracy, and leading the classification result of the network user to have higher accuracy.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The specific steps of the implementation of the present invention will be further described with reference to fig. 1.

Step 1, constructing a network user topological graph.

And crawling the link information of each user page in the open source programming website by using a network crawler tool, and importing the link information into a Python third-party package Networkx to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes.

And 2, calculating the link compactness of each network user by using the following graph link analysis formula.

Wherein S is_iRepresenting the link compactness of the ith network user, d representing a damping factor, wherein the value range of d is 0.70-0.85, N representing the total number of the network users in the network user topological graph, sigma representing the summation operation, j representing the serial number of the network users, u representing the serial number of the network users_jiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, u_jiValue 1, if no link exists, u_jiA value of 0, k_jRepresenting the total number of the j network users and other users_jIndicating the link closeness of the jth network user.

And 3, filtering the network users.

And sequencing the link compactness of all the network users from high to low, reserving the first 80 percent of the network users as the network users to be classified, and deleting the rest network users as junk users.

And 4, setting a threshold value by utilizing the activity of the network users to be classified.

In the first step, the activity of each network user to be classified is calculated by using the following formula.

θ_i＝0.9lg(d_i+1)

Wherein, theta_iRepresenting the activity of the ith network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, d_iAnd the number of times of logging in the website by the ith network user to be classified during the activity evaluation is represented. In an embodiment of the present invention, the activity evaluation period for evaluating the activity of the user is selected to be a half year time.

And secondly, setting the minimum value of the activity of the network users to be classified during the activity evaluation period as a threshold value.

And 5, calculating the relevancy C by using the keywords of each network user to be classified.

Firstly, extracting keywords of each network user to be classified by using a keyword extraction tool. In the embodiment of the invention, the item description text of each user item in the open source programming website is extracted, and a Keyword is extracted from the item description text by using a Keyword extraction tool RAKE (Rapid Automatic Keyword extraction).

And secondly, calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by using the following cosine similarity formula.

Wherein, C_nRepresenting the correlation degree of the nth network user to be classified and the search keyword of the open source programming website where the user is, q represents the search keyword of the open source programming website, represents dot product operation, and w_nRepresents the keywords of the nth network user to be classified, | · calculation₂Which represents a 2 norm operation.

And 6, classifying the network users to be classified.

Claims

1. A network user classification method based on graph link analysis is characterized in that a graph link analysis method is utilized to filter network users, analyze the activity of the network users and classify the network users by combining the activity of the network users; the method comprises the following specific steps:

(1) constructing a network user topological graph:

wherein S is_iRepresenting the link compactness of the ith network user, and d representing a damping factor, which takes values of [0.70,0.85 ]]N denotes the total number of network subscribers in the network subscriber topology map, sigma denotes the summation operation, j denotes the number of network subscribers, u denotes the number of network subscribers_jiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, u_jiValue 1, if no link exists, u_jiA value of 0, k_jRepresenting the total number of the j network users and other users_jRepresenting the link closeness of the jth network user;

(3) filtering the network users:

θ_m＝0.9lg(d_m+1)

(6) classifying the network users to be classified:

2. The method for classifying network users based on graph link analysis according to claim 1, wherein the cosine similarity formula in step (5b) is as follows: