CN111209513B - Network user classification method based on graph link analysis - Google Patents
Network user classification method based on graph link analysis Download PDFInfo
- Publication number
- CN111209513B CN111209513B CN202010018052.7A CN202010018052A CN111209513B CN 111209513 B CN111209513 B CN 111209513B CN 202010018052 A CN202010018052 A CN 202010018052A CN 111209513 B CN111209513 B CN 111209513B
- Authority
- CN
- China
- Prior art keywords
- network
- user
- classified
- users
- network user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a network user classification method based on graph link analysis, which mainly comprises the following steps: constructing a network user topological graph; calculating the link compactness of each network user in the network user topological graph by using a graph link analysis formula; filtering the network users; setting a threshold value by using the activity of the network users to be classified; calculating the relevancy C by using the keywords of each network user to be classified; and classifying the network users to be classified. The invention has the advantage of high classification efficiency on the premise of ensuring the classification accuracy of the user.
Description
Technical Field
The invention belongs to the technical field of physics, and further relates to a network user classification method based on graph link analysis in the technical field of network classification. The invention can be used for solving the classification problem of network users in the Internet.
Background
A large number of users exist in the network, the attention information of each network user is different, and meanwhile, a large number of junk users also exist in the network users. Junk users often post useless information in web sites, disrupting network order. The network environment can be effectively purified by filtering the junk users in the website, and the interference of the junk users is avoided. On the other hand, the active users in the network are classified, so that the user management is facilitated, and the method plays a vital role in subsequent user expansion and website operation. At present, most user classification methods classify all users in a website according to user relationships and user personal information, and the methods improve the accuracy of classification results and reduce the classification efficiency.
Patent document "network community user group division method based on links and text content" (patent application No. CN201310084039.1, publication No. CN103218400A) applied by the university of beijing industry discloses a network user classification method based on links and text content. The method analyzes the network structure expressed by the network community users on the links by using a link-based analysis method, analyzes the same interest structure expressed by the users on the text content by using an interest-based analysis method, and performs difference fusion on the results of the two methods to obtain comprehensive network community user group division results. On the basis, each division result is evaluated respectively, the accuracy of the whole division result is verified, and the group members which do not meet the index requirements are screened according to the tightness degree. The method is used for user classification and group division, although the accuracy of classification results is improved, the method still has the defects that the classified network users need to be manually screened, and the classification efficiency is greatly reduced.
A naive bayes microblog user classification method based on feature weighting is disclosed in a patent document 'naive bayes microblog user classification method based on feature weighting' applied by Chongqing post and telecommunications university (patent application No. 201810443273.1, publication No. CN 108596276A). The method comprises the steps of dividing scattered microblog user data into a training data set and a testing data set; then calculating a training data set to obtain the prior probability, the conditional probability and the information gain of each feature, establishing a target optimization matrix according to the information gain ranking, and determining the weight of each feature; and finally, calculating the posterior probability of the test data, wherein the class corresponding to the maximum posterior probability is the classification result. The method has the defects that the accuracy of the user classification result obtained in practical application is low due to the fact that the user classification is carried out according to the personal information randomly filled by the microblog user.
Disclosure of Invention
The invention aims to provide a network user classification method based on graph link analysis aiming at the defects of the prior art, which is used for solving the problem of improving the classification efficiency of network users while ensuring the classification accuracy of the network users.
The specific idea of the invention is that a graph link analysis method is used for filtering the network users, calculating the activity of the network users and classifying the network users by combining the activity of the network users.
In order to achieve the purpose, the method comprises the following specific implementation steps:
(1) constructing a network user topological graph:
crawling link information of each user page in an open source programming website by using a web crawler tool, and importing the link information into a complex network modeling tool to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes;
(2) calculating the link compactness of each network user in the network user topological graph by using the following graph link analysis formula:
wherein S isiRepresenting the link compactness of the ith network user, and d representing a damping factor, which takes values of [0.70,0.85 ]]N represents the total number of network users in the network user topology graph, Σ represents the summation operation, j represents the serial number of the network user, u represents the number of the network userjiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, ujiValue 1, if no link exists, ujiA value of 0, kjRepresenting the total number of the j network users and other usersjRepresenting the link closeness of the jth network user;
(3) filtering the network users:
sequencing the link compactness of all network users from high to low, reserving the first 80 percent of the network users as network users to be classified, and deleting the rest network users as junk users;
(4) setting a threshold value by utilizing the activity of the network users to be classified:
(4a) calculating the activity of each network user to be classified by using the following formula:
θm=0.9lg(dm+1)
wherein, thetamRepresenting the activity of the mth network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, dmRepresenting the number of times of logging in the website by the mth network user to be classified during the activity evaluation period;
(4b) setting the minimum value of the activity of the network user to be classified during the activity evaluation period as a threshold value;
(5) calculating the relevance C by using the keywords of each network user to be classified:
(5a) extracting keywords of each network user to be classified by using a keyword extraction tool;
(5b) calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by utilizing a cosine similarity formula;
(6) classifying the network users to be classified:
and multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.
Compared with the prior art, the invention has the following advantages:
firstly, the invention filters the network users by adopting graph link analysis, screens out the garbage users, and overcomes the problem that a large number of garbage users increase the workload of user classification when the prior art classifies the users, so that the efficiency of classifying the network users is higher.
Secondly, the invention analyzes the activity of the user and classifies the user by combining the activity of the user in the network, thereby overcoming the problem that the prior art only utilizes the personal interest data of the user to classify the user and has lower accuracy, and leading the classification result of the network user to have higher accuracy.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The specific steps of the implementation of the present invention will be further described with reference to fig. 1.
Step 1, constructing a network user topological graph.
And crawling the link information of each user page in the open source programming website by using a network crawler tool, and importing the link information into a Python third-party package Networkx to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes.
And 2, calculating the link compactness of each network user by using the following graph link analysis formula.
Wherein S isiRepresenting the link compactness of the ith network user, d representing a damping factor, wherein the value range of d is 0.70-0.85, N representing the total number of the network users in the network user topological graph, sigma representing the summation operation, j representing the serial number of the network users, u representing the serial number of the network usersjiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, ujiValue 1, if no link exists, ujiA value of 0, kjRepresenting the total number of the j network users and other usersjIndicating the link closeness of the jth network user.
And 3, filtering the network users.
And sequencing the link compactness of all the network users from high to low, reserving the first 80 percent of the network users as the network users to be classified, and deleting the rest network users as junk users.
And 4, setting a threshold value by utilizing the activity of the network users to be classified.
In the first step, the activity of each network user to be classified is calculated by using the following formula.
θi=0.9lg(di+1)
Wherein, thetaiRepresenting the activity of the ith network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, diAnd the number of times of logging in the website by the ith network user to be classified during the activity evaluation is represented. In an embodiment of the present invention, the activity evaluation period for evaluating the activity of the user is selected to be a half year time.
And secondly, setting the minimum value of the activity of the network users to be classified during the activity evaluation period as a threshold value.
And 5, calculating the relevancy C by using the keywords of each network user to be classified.
Firstly, extracting keywords of each network user to be classified by using a keyword extraction tool. In the embodiment of the invention, the item description text of each user item in the open source programming website is extracted, and a Keyword is extracted from the item description text by using a Keyword extraction tool RAKE (Rapid Automatic Keyword extraction).
And secondly, calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by using the following cosine similarity formula.
Wherein, CnRepresenting the correlation degree of the nth network user to be classified and the search keyword of the open source programming website where the user is, q represents the search keyword of the open source programming website, represents dot product operation, and wnRepresents the keywords of the nth network user to be classified, | · calculation2Which represents a 2 norm operation.
And 6, classifying the network users to be classified.
And multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.
Claims (2)
1. A network user classification method based on graph link analysis is characterized in that a graph link analysis method is utilized to filter network users, analyze the activity of the network users and classify the network users by combining the activity of the network users; the method comprises the following specific steps:
(1) constructing a network user topological graph:
crawling link information of each user page in an open source programming website by using a web crawler tool, and importing the link information into a complex network modeling tool to generate a network user topological graph, wherein nodes in the topological graph represent network users, and links between the network users are represented as edges between the nodes;
(2) calculating the link compactness of each network user in the network user topological graph by using the following graph link analysis formula:
wherein S isiRepresenting the link compactness of the ith network user, and d representing a damping factor, which takes values of [0.70,0.85 ]]N denotes the total number of network subscribers in the network subscriber topology map, sigma denotes the summation operation, j denotes the number of network subscribers, u denotes the number of network subscribersjiRepresenting the link relation between the jth network user and the ith network user, if the link relation exists between the two network users, ujiValue 1, if no link exists, ujiA value of 0, kjRepresenting the total number of the j network users and other usersjRepresenting the link closeness of the jth network user;
(3) filtering the network users:
sequencing the link compactness of all network users from high to low, reserving the first 80 percent of the network users as network users to be classified, and deleting the rest network users as junk users;
(4) setting a threshold value by utilizing the activity of the network users to be classified:
(4a) calculating the activity of each network user to be classified by using the following formula:
θm=0.9lg(dm+1)
wherein, thetamRepresenting the activity of the mth network user to be classified, lg representing the logarithm operation with a natural constant of 10 as the base, dmRepresenting the number of times of logging in the website by the mth network user to be classified during the activity evaluation period;
(4b) setting the minimum value of the activity of the network user to be classified during the activity evaluation period as a threshold value;
(5) calculating the relevance C by using the keywords of each network user to be classified:
(5a) extracting keywords of each network user to be classified by using a keyword extraction tool;
(5b) calculating the correlation C of each network user to be classified and the search keyword of the open source programming website where the user is located by utilizing a cosine similarity formula;
(6) classifying the network users to be classified:
and multiplying the relevancy C of each network user to be classified and the search keyword of the open source programming website where the user is located by the activity of the user, and taking the network user to be classified with the product larger than a threshold value as an active user under the open source programming website search keyword classification.
2. The method for classifying network users based on graph link analysis according to claim 1, wherein the cosine similarity formula in step (5b) is as follows:
wherein, CnRepresenting the correlation degree of the nth network user to be classified and the search keyword of the open source programming website where the user is, q represents the search keyword of the open source programming website, represents dot product operation, and wnRepresents the keywords of the nth network user to be classified, | · calculation2Which represents a 2 norm operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010018052.7A CN111209513B (en) | 2020-01-08 | 2020-01-08 | Network user classification method based on graph link analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010018052.7A CN111209513B (en) | 2020-01-08 | 2020-01-08 | Network user classification method based on graph link analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111209513A CN111209513A (en) | 2020-05-29 |
CN111209513B true CN111209513B (en) | 2022-04-19 |
Family
ID=70787173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010018052.7A Active CN111209513B (en) | 2020-01-08 | 2020-01-08 | Network user classification method based on graph link analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111209513B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339469A (en) * | 2016-08-29 | 2017-01-18 | 乐视控股(北京)有限公司 | Method and device for recommending data |
CN106980692A (en) * | 2016-05-30 | 2017-07-25 | 国家计算机网络与信息安全管理中心 | A kind of influence power computational methods based on microblogging particular event |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10990579B2 (en) * | 2017-11-30 | 2021-04-27 | Wipro Limited | Method and system for providing response to user input |
-
2020
- 2020-01-08 CN CN202010018052.7A patent/CN111209513B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980692A (en) * | 2016-05-30 | 2017-07-25 | 国家计算机网络与信息安全管理中心 | A kind of influence power computational methods based on microblogging particular event |
CN106339469A (en) * | 2016-08-29 | 2017-01-18 | 乐视控股(北京)有限公司 | Method and device for recommending data |
Non-Patent Citations (1)
Title |
---|
根据多维特征的网络用户分类研究;窦伊男;《信息科技辑》;20101231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111209513A (en) | 2020-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gaber et al. | A survey of classification methods in data streams | |
Chakrabarti et al. | Page-level template detection via isotonic smoothing | |
JP5092165B2 (en) | Data construction method and system | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN101814083A (en) | Automatic webpage classification method and system | |
WO2006118814A2 (en) | Method for finding semantically related search engine queries | |
CN101477554A (en) | User interest based personalized meta search engine and search result processing method | |
CN101452463A (en) | Method and apparatus for directionally grabbing page resource | |
CN103902597A (en) | Method and device for determining search relevant categories corresponding to target keywords | |
CN101894351A (en) | Multi-agent based tour multimedia information personalized service system | |
CN103714149A (en) | Self-adaptive incremental deep web data source discovery method | |
CN110377605A (en) | A kind of Sensitive Attributes identification of structural data and classification stage division | |
CN111738843A (en) | Quantitative risk evaluation system and method using running water data | |
Sujatha | Improved user navigation pattern prediction technique from web log data | |
Mittal et al. | A COMPARATIVE STUDY OF ASSOCIATION RULE MINING TECHNIQUES AND PREDICTIVE MINING APPROACHES FOR ASSOCIATION CLASSIFICATION. | |
CN107086925B (en) | Deep learning-based internet traffic big data analysis method | |
CN109526027B (en) | Cell capacity optimization method, device, equipment and computer storage medium | |
Zubi et al. | Using data mining techniques to analyze crime patterns in the libyan national crime data | |
CN107133321B (en) | Method and device for analyzing search characteristics of page | |
CN111209513B (en) | Network user classification method based on graph link analysis | |
CN108647263B (en) | Network address confidence evaluation method based on webpage segmentation crawling | |
CN111461324A (en) | Hierarchical pruning method based on layer recovery sensitivity | |
CN107169020A (en) | A kind of orientation web retrieval method based on keyword | |
CN112069392B (en) | Method and device for preventing and controlling network-related crime, computer equipment and storage medium | |
CN114860903A (en) | Event extraction, classification and fusion method oriented to network security field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |