CN111209513B - Network user classification method based on graph link analysis - Google Patents

Network user classification method based on graph link analysis Download PDF

Info

Publication number
CN111209513B
CN111209513B CN202010018052.7A CN202010018052A CN111209513B CN 111209513 B CN111209513 B CN 111209513B CN 202010018052 A CN202010018052 A CN 202010018052A CN 111209513 B CN111209513 B CN 111209513B
Authority
CN
China
Prior art keywords
network
user
classified
users
network user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010018052.7A
Other languages
Chinese (zh)
Other versions
CN111209513A (en
Inventor
赵楠
程佳
陈南
易运晖
包晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010018052.7A priority Critical patent/CN111209513B/en
Publication of CN111209513A publication Critical patent/CN111209513A/en
Application granted granted Critical
Publication of CN111209513B publication Critical patent/CN111209513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a network user classification method based on graph link analysis, which mainly comprises the following steps: constructing a network user topological graph; calculating the link compactness of each network user in the network user topological graph by using a graph link analysis formula; filtering the network users; setting a threshold value by using the activity of the network users to be classified; calculating the relevancy C by using the keywords of each network user to be classified; and classifying the network users to be classified. The invention has the advantage of high classification efficiency on the premise of ensuring the classification accuracy of the user.

Description

Network user classification method based on graph link analysis
Technical Field
The invention belongs to the technical field of physics, and further relates to a network user classification method based on graph link analysis in the technical field of network classification. The invention can be used for solving the classification problem of network users in the Internet.
Background
A large number of users exist in the network, the attention information of each network user is different, and meanwhile, a large number of junk users also exist in the network users. Junk users often post useless information in web sites, disrupting network order. The network environment can be effectively purified by filtering the junk users in the website, and the interference of the junk users is avoided. On the other hand, the active users in the network are classified, so that the user management is facilitated, and the method plays a vital role in subsequent user expansion and website operation. At present, most user classification methods classify all users in a website according to user relationships and user personal information, and the methods improve the accuracy of classification results and reduce the classification efficiency.
Patent document "network community user group division method based on links and text content" (patent application No. CN201310084039.1, publication No. CN103218400A) applied by the university of beijing industry discloses a network user classification method based on links and text content. The method analyzes the network structure expressed by the network community users on the links by using a link-based analysis method, analyzes the same interest structure expressed by the users on the text content by using an interest-based analysis method, and performs difference fusion on the results of the two methods to obtain comprehensive network community user group division results. On the basis, each division result is evaluated respectively, the accuracy of the whole division result is verified, and the group members which do not meet the index requirements are screened according to the tightness degree. The method is used for user classification and group division, although the accuracy of classification results is improved, the method still has the defects that the classified network users need to be manually screened, and the classification efficiency is greatly reduced.
A naive bayes microblog user classification method based on feature weighting is disclosed in a patent document 'naive bayes microblog user classification method based on feature weighting' applied by Chongqing post and telecommunications university (patent application No. 201810443273.1, publication No. CN 108596276A). The method comprises the steps of dividing scattered microblog user data into a training data set and a testing data set; then calculating a training data set to obtain the prior probability, the conditional probability and the information gain of each feature, establishing a target optimization matrix according to the information gain ranking, and determining the weight of each feature; and finally, calculating the posterior probability of the test data, wherein the class corresponding to the maximum posterior probability is the classification result. The method has the defects that the accuracy of the user classification result obtained in practical application is low due to the fact that the user classification is carried out according to the personal information randomly filled by the microblog user.
Disclosure of Invention
The invention aims to provide a network user classification method based on graph link analysis aiming at the defects of the prior art, which is used for solving the problem of improving the classification efficiency of network users while ensuring the classification accuracy of the network users.
The specific idea of the invention is that a graph link analysis method is used for filtering the network users, calculating the activity of the network users and classifying the network users by combining the activity of the network users.
In order to achieve the purpose, the method comprises the following specific implementation steps:
(1) constructing a network user topological graph:
crawling link information of each user page in an open source programming website by using a web crawler tool, and importing the link information into a complex network modeling tool to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes;
(2) calculating the link compactness of each network user in the network user topological graph by using the following graph link analysis formula:
Figure GDA0003480660920000021
wherein S isiRepresenting the link compactness of the ith network user, and d representing a damping factor, which takes values of [0.70,0.85 ]]N represents the total number of network users in the network user topology graph, Σ represents the summation operation, j represents the serial number of the network user, u represents the number of the network userjiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, ujiValue 1, if no link exists, ujiA value of 0, kjRepresenting the total number of the j network users and other usersjRepresenting the link closeness of the jth network user;
(3) filtering the network users:
sequencing the link compactness of all network users from high to low, reserving the first 80 percent of the network users as network users to be classified, and deleting the rest network users as junk users;
(4) setting a threshold value by utilizing the activity of the network users to be classified:
(4a) calculating the activity of each network user to be classified by using the following formula:
θm=0.9lg(dm+1)
wherein, thetamRepresenting the activity of the mth network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, dmRepresenting the number of times of logging in the website by the mth network user to be classified during the activity evaluation period;
(4b) setting the minimum value of the activity of the network user to be classified during the activity evaluation period as a threshold value;
(5) calculating the relevance C by using the keywords of each network user to be classified:
(5a) extracting keywords of each network user to be classified by using a keyword extraction tool;
(5b) calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by utilizing a cosine similarity formula;
(6) classifying the network users to be classified:
and multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.
Compared with the prior art, the invention has the following advantages:
firstly, the invention filters the network users by adopting graph link analysis, screens out the garbage users, and overcomes the problem that a large number of garbage users increase the workload of user classification when the prior art classifies the users, so that the efficiency of classifying the network users is higher.
Secondly, the invention analyzes the activity of the user and classifies the user by combining the activity of the user in the network, thereby overcoming the problem that the prior art only utilizes the personal interest data of the user to classify the user and has lower accuracy, and leading the classification result of the network user to have higher accuracy.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The specific steps of the implementation of the present invention will be further described with reference to fig. 1.
Step 1, constructing a network user topological graph.
And crawling the link information of each user page in the open source programming website by using a network crawler tool, and importing the link information into a Python third-party package Networkx to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes.
And 2, calculating the link compactness of each network user by using the following graph link analysis formula.
Figure GDA0003480660920000041
Wherein S isiRepresenting the link compactness of the ith network user, d representing a damping factor, wherein the value range of d is 0.70-0.85, N representing the total number of the network users in the network user topological graph, sigma representing the summation operation, j representing the serial number of the network users, u representing the serial number of the network usersjiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, ujiValue 1, if no link exists, ujiA value of 0, kjRepresenting the total number of the j network users and other usersjIndicating the link closeness of the jth network user.
And 3, filtering the network users.
And sequencing the link compactness of all the network users from high to low, reserving the first 80 percent of the network users as the network users to be classified, and deleting the rest network users as junk users.
And 4, setting a threshold value by utilizing the activity of the network users to be classified.
In the first step, the activity of each network user to be classified is calculated by using the following formula.
θi=0.9lg(di+1)
Wherein, thetaiRepresenting the activity of the ith network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, diAnd the number of times of logging in the website by the ith network user to be classified during the activity evaluation is represented. In an embodiment of the present invention, the activity evaluation period for evaluating the activity of the user is selected to be a half year time.
And secondly, setting the minimum value of the activity of the network users to be classified during the activity evaluation period as a threshold value.
And 5, calculating the relevancy C by using the keywords of each network user to be classified.
Firstly, extracting keywords of each network user to be classified by using a keyword extraction tool. In the embodiment of the invention, the item description text of each user item in the open source programming website is extracted, and a Keyword is extracted from the item description text by using a Keyword extraction tool RAKE (Rapid Automatic Keyword extraction).
And secondly, calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by using the following cosine similarity formula.
Figure GDA0003480660920000051
Wherein, CnRepresenting the correlation degree of the nth network user to be classified and the search keyword of the open source programming website where the user is, q represents the search keyword of the open source programming website, represents dot product operation, and wnRepresents the keywords of the nth network user to be classified, | · calculation2Which represents a 2 norm operation.
And 6, classifying the network users to be classified.
And multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.

Claims (2)

1. A network user classification method based on graph link analysis is characterized in that a graph link analysis method is utilized to filter network users, analyze the activity of the network users and classify the network users by combining the activity of the network users; the method comprises the following specific steps:
(1) constructing a network user topological graph:
crawling link information of each user page in an open source programming website by using a web crawler tool, and importing the link information into a complex network modeling tool to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes;
(2) calculating the link compactness of each network user in the network user topological graph by using the following graph link analysis formula:
Figure FDA0003480660910000011
wherein S isiRepresenting the link compactness of the ith network user, and d representing a damping factor, which takes values of [0.70,0.85 ]]N denotes the total number of network subscribers in the network subscriber topology map, sigma denotes the summation operation, j denotes the number of network subscribers, u denotes the number of network subscribersjiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, ujiValue 1, if no link exists, ujiA value of 0, kjRepresenting the total number of the j network users and other usersjRepresenting the link closeness of the jth network user;
(3) filtering the network users:
sequencing the link compactness of all network users from high to low, reserving the first 80 percent of the network users as network users to be classified, and deleting the rest network users as junk users;
(4) setting a threshold value by utilizing the activity of the network users to be classified:
(4a) calculating the activity of each network user to be classified by using the following formula:
θm=0.9lg(dm+1)
wherein, thetamRepresenting the activity of the mth network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, dmRepresenting the number of times of logging in the website by the mth network user to be classified during the activity evaluation period;
(4b) setting the minimum value of the activity of the network user to be classified during the activity evaluation period as a threshold value;
(5) calculating the relevance C by using the keywords of each network user to be classified:
(5a) extracting keywords of each network user to be classified by using a keyword extraction tool;
(5b) calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by utilizing a cosine similarity formula;
(6) classifying the network users to be classified:
and multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.
2. The method for classifying network users based on graph link analysis according to claim 1, wherein the cosine similarity formula in step (5b) is as follows:
Figure FDA0003480660910000021
wherein, CnRepresenting the correlation degree of the nth network user to be classified and the search keyword of the open source programming website where the user is, q represents the search keyword of the open source programming website, represents dot product operation, and wnRepresents the keywords of the nth network user to be classified, | · calculation2Which represents a 2 norm operation.
CN202010018052.7A 2020-01-08 2020-01-08 Network user classification method based on graph link analysis Active CN111209513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010018052.7A CN111209513B (en) 2020-01-08 2020-01-08 Network user classification method based on graph link analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010018052.7A CN111209513B (en) 2020-01-08 2020-01-08 Network user classification method based on graph link analysis

Publications (2)

Publication Number Publication Date
CN111209513A CN111209513A (en) 2020-05-29
CN111209513B true CN111209513B (en) 2022-04-19

Family

ID=70787173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010018052.7A Active CN111209513B (en) 2020-01-08 2020-01-08 Network user classification method based on graph link analysis

Country Status (1)

Country Link
CN (1) CN111209513B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339469A (en) * 2016-08-29 2017-01-18 乐视控股(北京)有限公司 Method and device for recommending data
CN106980692A (en) * 2016-05-30 2017-07-25 国家计算机网络与信息安全管理中心 A kind of influence power computational methods based on microblogging particular event

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10990579B2 (en) * 2017-11-30 2021-04-27 Wipro Limited Method and system for providing response to user input

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980692A (en) * 2016-05-30 2017-07-25 国家计算机网络与信息安全管理中心 A kind of influence power computational methods based on microblogging particular event
CN106339469A (en) * 2016-08-29 2017-01-18 乐视控股(北京)有限公司 Method and device for recommending data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
根据多维特征的网络用户分类研究;窦伊男;《信息科技辑》;20101231;全文 *

Also Published As

Publication number Publication date
CN111209513A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
Gaber et al. A survey of classification methods in data streams
Chakrabarti et al. Page-level template detection via isotonic smoothing
JP5092165B2 (en) Data construction method and system
CN106156372B (en) A kind of classification method and device of internet site
CN101814083A (en) Automatic webpage classification method and system
WO2006118814A2 (en) Method for finding semantically related search engine queries
CN101477554A (en) User interest based personalized meta search engine and search result processing method
CN101452463A (en) Method and apparatus for directionally grabbing page resource
CN103902597A (en) Method and device for determining search relevant categories corresponding to target keywords
CN101894351A (en) Multi-agent based tour multimedia information personalized service system
CN103714149A (en) Self-adaptive incremental deep web data source discovery method
CN110377605A (en) A kind of Sensitive Attributes identification of structural data and classification stage division
CN111738843A (en) Quantitative risk evaluation system and method using running water data
Sujatha Improved user navigation pattern prediction technique from web log data
Mittal et al. A COMPARATIVE STUDY OF ASSOCIATION RULE MINING TECHNIQUES AND PREDICTIVE MINING APPROACHES FOR ASSOCIATION CLASSIFICATION.
CN107086925B (en) Deep learning-based internet traffic big data analysis method
CN109526027B (en) Cell capacity optimization method, device, equipment and computer storage medium
Zubi et al. Using data mining techniques to analyze crime patterns in the libyan national crime data
CN107133321B (en) Method and device for analyzing search characteristics of page
CN111209513B (en) Network user classification method based on graph link analysis
CN108647263B (en) Network address confidence evaluation method based on webpage segmentation crawling
CN111461324A (en) Hierarchical pruning method based on layer recovery sensitivity
CN107169020A (en) A kind of orientation web retrieval method based on keyword
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium
CN114860903A (en) Event extraction, classification and fusion method oriented to network security field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant