CN108647800B - Online social network user missing attribute prediction method based on node embedding - Google Patents

Online social network user missing attribute prediction method based on node embedding Download PDF

Info

Publication number
CN108647800B
CN108647800B CN201810222943.7A CN201810222943A CN108647800B CN 108647800 B CN108647800 B CN 108647800B CN 201810222943 A CN201810222943 A CN 201810222943A CN 108647800 B CN108647800 B CN 108647800B
Authority
CN
China
Prior art keywords
user
attribute
social network
data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810222943.7A
Other languages
Chinese (zh)
Other versions
CN108647800A (en
Inventor
傅晨波
张剑
宣琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810222943.7A priority Critical patent/CN108647800B/en
Publication of CN108647800A publication Critical patent/CN108647800A/en
Application granted granted Critical
Publication of CN108647800B publication Critical patent/CN108647800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A node embedding-based online social network user missing attribute prediction method comprises the following steps: s1: extracting user information in the online social network, wherein the user information comprises a friend list, online behaviors and related attribute data; s2, after constructing a network model, embedding nodes in the network into an Euclidean space through a node2vec algorithm to obtain an embedded vector representing the network structure characteristics; s3: constructing vectors representing other characteristics of the user according to the online behavior of the user and the public attribute data; s4: superposing the vectors obtained in S2 and S3 to finally obtain a vector representing the user characteristics; s5: the user missing attribute is predicted by training a logistic regression model. The method makes full use of the structural characteristics of the user social network in the online social network, combines the behavior and attribute data of the user, enables the prediction of the user missing data to reach higher precision, and has practical application value.

Description

Online social network user missing attribute prediction method based on node embedding
Technical Field
The invention relates to the field of data mining and network science, in particular to a node embedding-based online social network user missing attribute prediction method.
Background
The rapid development of social economy and internet technology has prompted the generation of online social networks such as microblogs, public reviews and the like. There are hundreds of millions of users in these online social networks, and the massive data of user attributes, user behaviors, and user social relationships generated along with the users become important resources for data mining research and application. In an online social network, each user has various attributes including an ID, a nickname, and a registration time. Meanwhile, behavior data such as comments, praise, forwarding and the like of the microblog also become specific data of each user. Currently, network information security is more and more emphasized by people. The user attribute prediction in the online social network can further enhance the user identity recognition in the virtual community, and has important significance for restraining and fighting against network crimes and the like.
In an online social network, social networks among different types of users tend to have different structures, and the social network structures of users having the same attributes and behaviors tend to converge. This allows us to infer the missing attributes from the user's social network structure. At present, a method for predicting the missing attribute of a user in an online social network mainly depends on the public attribute of the user, the social relationship, the behavior and other data of the user, and the structural characteristics of the social network of the user are not fully considered. In addition to extracting traditional network structure indexes, the currently popular method is to obtain an embedded vector of a node by a graph embedding method, such as node2vec (see document [1] a. grover, j.leskovic. node2vec: Scalable creation for networks. acm SIGKDD international conference on Knowledge discovery and data mining.2016. grave, leskovicz, node steering amount: an expandable network feature learning method, ACM SIGKDD international conference of Knowledge discovery and data mining, 2016.), to represent network structure features. By utilizing the method, the social network structure characteristics of the users in the online social network can be fully utilized, and the user missing attribute prediction can also reach higher precision even under the condition that part of the user public attributes and the behavior data are missing.
Disclosure of Invention
In order to make up for the defect that the user missing attribute cannot be predicted in the existing online social network, the invention provides a node embedding-based online social network user missing attribute prediction method, a network structure of each user is embedded into a Euclidean space through a node embedding method node2vec, so that the social network structure characteristics of the users are extracted, and the missing attribute of the users can be predicted more accurately by combining the public attribute and the behavior data of the users.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a node embedding-based online social network user missing attribute prediction method comprises the following steps:
s1: data collection and processing, namely crawling user data in an online social network, wherein the user data comprises a friend list, behaviors and attribute data of a user;
s2: forming a network by the user data collected in the S1, and embedding each node into an Euclidean space through a node2vec to obtain an embedded vector representing the social network structure characteristics of the user;
s3: processing the behavior of the user and the known attribute data to form a vector representing the characteristics of the user except the social network structure;
s4: splicing the vector representing the structural features of the user and vectors of other features to obtain a final vector representing the features of the user;
s5: and defining a training set and a testing set, and predicting the missing attribute of the user by training a logistic steve regression classifier.
Further, in step S5, after the training is stopped or the model converges, the model accuracy is checked by using the samples in the test set.
Still further, in step S1, the behavior data of the user includes praise, forward, comment in the microblog, and comment, score, and consumption behavior in the comment website; the attribute data of the user includes gender, residence and occupation.
Further, in step S2, before embedding the node in the online social network using the node2vec algorithm, a return probability parameter p, a leaving probability parameter q, and an embedded vector dimension N in the algorithm need to be determined; wherein, the parameter p controls the probability that the node2vec returns to the original node when randomly walking, and the local characteristics in the network are emphasized; the parameter q controls the probability of jumping to other nodes when the random walk is carried out, and the attention is paid to the global characteristics in the network. Different p, q and N affect the quality of the extracted network structure characteristics, so that multiple sets of parameters need to be taken for comparison.
Still further, in the steps S3 and S4, the vector structures representing other features of the user need to be normalized and arranged according to a fixed order after the selected user behavior data and the known attribute data are normalized. After the behavior characteristic vector and the attribute characteristic vector of the user are obtained, the behavior characteristic vector and the attribute characteristic vector are combined with the social network structure characteristic vector to be arranged and spliced according to a fixed sequence, and then the embedded vector representing the user can be obtained. It should be noted that when constructing vectors representing other features, normalization must be performed according to the values of the social network structure vectors, otherwise some of the features are emphasized and the effects of the rest of the features are attenuated.
Further, in step S5, a logistic regression model is used
Figure BDA0001600468210000031
Classifying the embedded vectors so as to realize the missing attribute prediction of the user; wherein, yuFor model output, represent the probability that user u has the attribute, if yu>0.5, the user u has the attribute, otherwise, the user u does not have the attribute; h isu=axu+ b, a and b are parameters to be trained in the model, xuIs a vector representing the characteristics of the user u; if the target missing attribute has a plurality of values, each attribute value needs to be binarized; for example, to predict the city where the user lives, the problem can be converted into a two-classification problem of predicting whether the user lives in a certain city; the user missing attribute prediction is realized through multiple classification, the maximum training iteration times and the maximum tolerance error need to be set before model training is carried out, after a trained model is obtained, the trained model is applied to samples in a test set, and the prediction precision is used as a performance index of the model.
The invention has the beneficial effects that: the method has the advantages that the social network structure characteristics of the users in the online social network are fully utilized, and the missing attributes of the users can be predicted more accurately by combining the behaviors and attribute characteristics of the users. Under the condition of missing other user data, the high precision can be achieved for the prediction of the missing attribute of the user.
Drawings
Fig. 1 is a flowchart of a node-embedding-based online social network user missing attribute prediction method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, the invention relates to a node-embedded online social network user missing attribute prediction method, which uses a data set disclosed by yelp official authority to predict user gender. After screening, the data set contains 2006 users and 31521 friendships. The data set records the user's dining time, place, selected tastes, and the user's friend network, in addition to personal information such as the user's ID and nickname.
The invention comprises the following steps:
s1: aiming at the problem of predicting the gender of the user, extracting relevant data of the user from the yelp data set, wherein the relevant data comprises a friend list, a user name and a user consumption record of the user, and processing the data;
s2: establishing a network model according to the extracted user and the friend list of the user, and embedding nodes in the network into an N-dimensional Euclidean space by using a node2vec algorithm to obtain an embedded vector x representing the social network structure characteristics of the useru
S3: respectively constructing x according to friend gender distribution and consumption behaviors of users(1)And x(2)Two vectors representing user features;
s4: splicing the S2 and the obtained vectors in S3 to obtain a final vector x representing the user characteristicsu
S5: and (3) dividing the training set and the test set by the user in the data set according to a ratio of 9:1, training the logistic stewart regression model by using the samples in the training set, and finally testing the classification precision of the model by using the samples in the test set.
In S1, for the problem of predicting the gender of the user, the gender label of each user is obtained first, and the yelp data set does not include this attribute. To solve this problem, the present embodiment uses the name of each user to determine the gender of the user, so as to obtain the gender tag of the user. If the user is male, the gender label is 0; if the user is female, the gender label is 1. Meanwhile, the present embodiment also takes the consumption behavior of the user as one of the features for classification. The Yelp dataset contains the average price level for the restaurant consumed by the user each time and the corresponding restaurant, with the price levels being noted as 1, 2, 3, and 4 from low to high, respectively.
In S2, the relationship between the user and his friends is modeled through the network, and parameters p, q, and N need to be determined before node2vec is used. By repeated comparison, the present embodiment selects p-1, q-2, and N-128 as parameters for the node2vec algorithm to operate. By this method, a 128-dimensional vector x can be obtained for each user(0)To describe its social network structure.
In the above S3 and S4, in order to construct the friend gender ratio of the user, it is first required to count the friend gender of each user and construct a two-dimensional vector x(1)=[Nmale,Nfemale]/Ntotal. Wherein N ismaleNumber of male friends, N, representing userfemaleNumber of female friends, N, representing usertotalRepresenting the total number of buddies for the user. For the characteristic of the user consumption behavior, firstly, the consumption times of the user at restaurants with 4 price levels respectively need to be counted and are respectively marked as T1、T2、T3And T4. Construct vector x(2)=[T1,T2,T3,T4]/TtotalTo characterize the consumption behavior of the user, TtotalRepresenting the total number of consumption by the user. Finally, the obtained three vectors describing the user characteristics are spliced, namely
xu=[x(0),x(1),x(2)] (2)
The resulting 1 2012-dimensional vector is used to characterize user u.
In S5, the screened yelp data set is randomly divided according to a ratio of 9:1, and 1800 users are obtained as training set samples, and the remaining 206 users are obtained as test set samples. After determining the gender label for each sample in the training set, the model (1) is trained. In this embodiment, the model (1) uses the norm of L2 as a penalty term to prevent the model from being over-fitted, the maximum number of training iterations is 5000 times, and the maximum tolerance error is 0.001. And after the model (1) stops training or converges, finally testing the classification precision of the model by using the samples in the test set. On the aspect of the gender prediction problem of the user of the yelp data set, the method provided by the invention achieves 71.2% of accuracy.
The method fully utilizes the social network structure of the user in the online social network, and simultaneously combines the behavior and the public attribute characteristics of the user, so that the accuracy of the missing attribute prediction of the user can be improved to a certain extent. In addition, under the condition that only the online social network structure is available and the user information is lost, the lost attribute prediction of the user can reach certain precision by utilizing the network structure characteristics. The present invention is to be considered as illustrative and not restrictive. It will be understood by those skilled in the art that various changes, modifications and equivalents may be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (1)

1. A node embedding-based online social network user missing attribute prediction method is characterized by comprising the following steps: the prediction method comprises the following steps:
s1: data collection and processing, namely crawling user data in an online social network, wherein the user data comprises a friend list, behaviors and attribute data of a user;
s2: forming a network by the user data collected in the S1, and embedding each node into an Euclidean space through a node2vec to obtain an embedded vector representing the social network structure characteristics of the user;
s3: processing the behavior of the user and the known attribute data to form a vector representing the characteristics of the user except the social network structure characteristics;
s4: splicing the embedded vector representing the social network structure characteristics of the user and vectors of other characteristics to obtain a final vector representing the characteristics of the user;
s5: defining a training set and a testing set, and predicting the missing attribute of the user by training a logistic steve regression classifier;
in S1, the behavior data of the user includes praise, forward, comment in the microblog, and comment, score, and consumption behavior in the comment website; attribute data of the user includes gender, residence and occupation;
in the step S5, after the training is stopped or the model converges, the precision of the model is checked by using the sample in the test set;
in S2, before embedding a node in an online social network using a node2vec algorithm, a return probability parameter p, a leaving probability parameter q, and an embedding vector dimension N in the algorithm need to be determined; wherein the parameter p controls the probability that the node2vec returns to the original node when randomly walking, and the local characteristics in the network are emphasized; the parameter q controls the probability of jumping to other nodes during random walk, and the attention is paid to the global characteristics in the network;
in the S3 and S4, the vector construction representing other characteristics of the user needs to normalize the selected user behavior data and the known attribute data and then arrange the normalized data according to a fixed sequence; after the behavior characteristic vector and the attribute characteristic vector of the user are obtained, arranging and splicing are carried out according to a fixed sequence by combining the social network structure characteristic vector, and then an embedded vector representing the user is obtained;
in S5, a logistic regression model is used
Figure FDA0003342436830000011
Classifying the embedded vectors so as to realize the missing attribute prediction of the user; wherein, yuFor model output, represent the probability that user u has the attribute, if yuIf the attribute is more than 0.5, the user u has the attribute, otherwise, the attribute does not exist; h isu=axu+ b, a and b are parameters to be trained in the model, xuIf the target missing attribute has multiple values for a vector representing the characteristics of the user u, each attribute value needs to be binarized, generallyAnd (3) realizing user missing attribute prediction through multiple classification, setting the maximum training iteration times and the maximum tolerance error before model training, applying the trained model to a sample in a test set after obtaining the trained model, and taking prediction precision as a performance index of the model.
CN201810222943.7A 2018-03-19 2018-03-19 Online social network user missing attribute prediction method based on node embedding Active CN108647800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810222943.7A CN108647800B (en) 2018-03-19 2018-03-19 Online social network user missing attribute prediction method based on node embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810222943.7A CN108647800B (en) 2018-03-19 2018-03-19 Online social network user missing attribute prediction method based on node embedding

Publications (2)

Publication Number Publication Date
CN108647800A CN108647800A (en) 2018-10-12
CN108647800B true CN108647800B (en) 2022-01-11

Family

ID=63744300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810222943.7A Active CN108647800B (en) 2018-03-19 2018-03-19 Online social network user missing attribute prediction method based on node embedding

Country Status (1)

Country Link
CN (1) CN108647800B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685647B (en) * 2018-12-27 2021-08-10 阳光财产保险股份有限公司 Credit fraud detection method and training method and device of model thereof, and server
CN109903087A (en) * 2019-02-13 2019-06-18 广州视源电子科技股份有限公司 The method, apparatus and storage medium of Behavior-based control feature prediction user property value
CN110134881A (en) * 2019-05-28 2019-08-16 东北师范大学 A kind of friend recommendation method and system based on the insertion of multiple information sources figure
CN111091410B (en) * 2019-11-04 2022-03-11 南京光普信息技术有限公司 Node embedding and user behavior characteristic combined net point sales prediction method
CN111160483B (en) * 2019-12-31 2023-03-17 杭州师范大学 Network relation type prediction method based on multi-classifier fusion model
CN113742665B (en) * 2020-06-05 2024-03-26 国家计算机网络与信息安全管理中心 User identity recognition model construction and user identity verification methods and devices
CN112132326B (en) * 2020-08-31 2023-12-01 浙江工业大学 Social network friend prediction method based on random walk penalty mechanism

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122455A (en) * 2017-04-26 2017-09-01 中国人民解放军国防科学技术大学 A kind of network user's enhancing method for expressing based on microblogging
CN107145977A (en) * 2017-04-28 2017-09-08 电子科技大学 A kind of method that structured attributes deduction is carried out to online social network user

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120226623A1 (en) * 2010-10-01 2012-09-06 Linkedln Corporation Methods and systems for exploring career options

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122455A (en) * 2017-04-26 2017-09-01 中国人民解放军国防科学技术大学 A kind of network user's enhancing method for expressing based on microblogging
CN107145977A (en) * 2017-04-28 2017-09-08 电子科技大学 A kind of method that structured attributes deduction is carried out to online social network user

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
node2vec:Scalable Feature Learning for Networks;Aditya Grover;《ACM》;20160813;全文 *
基于网络的个性化推荐研究算法;方宽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160315(第3期);正文第53-56页 *
网络表示学习(DeepWalk,LINE,node2vec,SDNE);博客转载 www.thutmose.cn;《csdn https://thutmose.blog.csdn.net/article/details/79251772》;20180204;全文 *

Also Published As

Publication number Publication date
CN108647800A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108647800B (en) Online social network user missing attribute prediction method based on node embedding
WO2022041979A1 (en) Information recommendation model training method and related device
Fayazi et al. Uncovering crowdsourced manipulation of online reviews
Mcauley et al. Discovering social circles in ego networks
Fire et al. Computationally efficient link prediction in a variety of social networks
Nettleton Data mining of social networks represented as graphs
CN111143704B (en) Online community friend recommendation method and system integrating user influence relationship
Xia et al. Design of reciprocal recommendation systems for online dating
Yigit et al. Extended topology based recommendation system for unidirectional social networks
Bhargava et al. Unsupervised modeling of users' interests from their Facebook profiles and activities
Zhao et al. Detecting profilable and overlapping communities with user-generated multimedia contents in LBSNs
Huang et al. Information fusion oriented heterogeneous social network for friend recommendation via community detection
Leng et al. Dynamically aggregating individuals’ social influence and interest evolution for group recommendations
CN113656699B (en) User feature vector determining method, related equipment and medium
Liu et al. A hybrid book recommendation algorithm based on context awareness and social network
Liu et al. [Retracted] Deep Learning and Collaborative Filtering‐Based Methods for Students’ Performance Prediction and Course Recommendation
Karpov et al. Detecting automatically managed accounts in online social networks: Graph embeddings approach
CN117251586A (en) Multimedia resource recommendation method, device and storage medium
Liu et al. From strangers to neighbors: Link prediction in microblogs using social distance game
Sabet Social Media Posts Popularity Prediction During Long-Running Live Events A case study on Fashion Week
Chen et al. From tie strength to function: Home location estimation in social network
Bide et al. Cross event detection and topic evolution analysis in cross events for man-made disasters in social media streams
CN112528130A (en) Information recommendation method, device, equipment and computer readable storage medium
CN110134881A (en) A kind of friend recommendation method and system based on the insertion of multiple information sources figure
季一木 et al. Collaborative filtering recommendation algorithm based on interactive data classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant