CN112989199B - A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network - Google Patents

A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network Download PDF

Info

Publication number
CN112989199B
CN112989199B CN202110343021.3A CN202110343021A CN112989199B CN 112989199 B CN112989199 B CN 112989199B CN 202110343021 A CN202110343021 A CN 202110343021A CN 112989199 B CN112989199 B CN 112989199B
Authority
CN
China
Prior art keywords
network
proximity
features
author
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110343021.3A
Other languages
Chinese (zh)
Other versions
CN112989199A (en
Inventor
吴江
贺超城
欧桂燕
左任衔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110343021.3A priority Critical patent/CN112989199B/en
Publication of CN112989199A publication Critical patent/CN112989199A/en
Application granted granted Critical
Publication of CN112989199B publication Critical patent/CN112989199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cooperative network link prediction method based on a multidimensional proximity attribute network, which belongs to the field of cooperative recommendation and comprises the following steps: respectively reserving multidimensional proximity features, local network features and global network features by using a self-encoder model, a joint probability model and an attribute Skip-Gram model; wherein the multidimensional proximity features include cognitive proximity features, geographic proximity features, and institutional proximity features; the loss function of the self-coding model, the loss function of the local network characteristic, the loss function of the global network characteristic and the loss function of the L2-norm are combined to serve as an overall objective function, and a random gradient descent method is adopted to optimize the overall objective function, so that the representation learning of the network nodes is realized; and carrying out cooperative network link prediction through the vector cosine similarity corresponding to the network node. According to the method, the network characteristics and the node attribute information are comprehensively considered, and the accuracy of the cooperative network link prediction is improved.

Description

一种基于多维邻近属性网络的合作网络链路预测方法A collaborative network link prediction method based on multi-dimensional proximity attribute network

技术领域Technical Field

本发明属于合作推荐领域,更具体地,涉及一种基于多维邻近属性网络的合作网络链路预测方法。The present invention belongs to the field of cooperative recommendation, and more specifically, relates to a cooperative network link prediction method based on a multi-dimensional neighboring attribute network.

背景技术Background Art

合作者推荐对于促进科研合作具有重要意义。已有的文献研究主要集中于推荐所有共同作者之间的合作关系。现有的合作者推荐方法主要基于网络模型、内容模型和混合模型。其中,基于网络模型的合作者推荐方法合并了本地网络功能(例如:公共邻居)或全局网络功能(例如:带有重启的随机游走RWR)。基于内容模型的合作者推荐方法通过提取内容特征(例如基于LDA的相似性)来推荐作者。基于混合模型的合作者推荐方法结合了网络特征和内容特征。结合网络特征和节点属性信息的属性网络嵌入模型表现出了良好的性能。现有的文献指出了科研合作邻近性(proximity)的五个维度。但是,现有的合作者推荐方法仅包含属于网络特征的社会邻近性(social proximity)或属于文本特征的认知邻近性。Collaborator recommendation is of great significance for promoting scientific research collaboration. Existing literature research mainly focuses on recommending the collaboration between all co-authors. Existing collaborator recommendation methods are mainly based on network models, content models and hybrid models. Among them, the collaborator recommendation methods based on network models incorporate local network features (e.g., common neighbors) or global network features (e.g., random walks with restarts RWR). The collaborator recommendation methods based on content models recommend authors by extracting content features (e.g., similarity based on LDA). The collaborator recommendation methods based on hybrid models combine network features and content features. The attribute network embedding model that combines network features and node attribute information shows good performance. Existing literature points out five dimensions of proximity in scientific research collaboration. However, existing collaborator recommendation methods only include social proximity, which belongs to network features, or cognitive proximity, which belongs to text features.

专利文献CN104573103B提出了一种科技文献异构网络下合作者推荐方法,但该方法仅考虑了一对作者与彼此合作的意愿度进行合作者的推荐,未从主要作者的角度出发。Patent document CN104573103B proposes a collaborator recommendation method in a heterogeneous network of scientific literature, but this method only considers the willingness of a pair of authors to cooperate with each other to recommend collaborators, and does not start from the perspective of the main author.

发明内容Summary of the invention

针对现有技术的缺陷,本发明的目的在于提供一种基于多维邻近属性网络的合作网络链路预测方法,旨在解决现有的合作者推荐方法未从多维邻近性特征考虑,且无法对多维邻近性特征进行函数处理后体现科研相似性与作者间合作概率,导致作者合作的预测精准度比较低的问题。In view of the defects of the prior art, the purpose of the present invention is to provide a collaboration network link prediction method based on a multidimensional proximity attribute network, aiming to solve the problem that the existing collaborator recommendation method does not take the multidimensional proximity characteristics into consideration and cannot reflect the scientific research similarity and the probability of collaboration between authors after functional processing of the multidimensional proximity characteristics, resulting in a relatively low prediction accuracy of author collaboration.

为实现上述目的,本发明提供了一种基于多维邻近属性网络的合作网络链路预测方法,包括以下步骤:To achieve the above object, the present invention provides a method for predicting cooperative network links based on a multi-dimensional proximity attribute network, comprising the following steps:

利用自编码器模型、联合概率模型和属性Skip-Gram模型分别保留多维邻近性特征、局部网络特征和全局网络特征;其中,多维邻近性特征包括认知邻近性特征、地理邻近性特征和制度邻近性特征;The autoencoder model, joint probability model and attribute Skip-Gram model are used to retain multidimensional proximity features, local network features and global network features respectively; among which, the multidimensional proximity features include cognitive proximity features, geographical proximity features and institutional proximity features;

结合自编码模型的损失函数、局部网络特征的损失函数和全局网络特征的损失函数以及L2-范数的损失函数作为整体目标函数,采用随机梯度下降方法优化整体目标函数,实现对网络节点的表示学习;The loss function of the autoencoder model, the loss function of the local network features, the loss function of the global network features, and the loss function of the L2-norm are combined as the overall objective function, and the stochastic gradient descent method is used to optimize the overall objective function to achieve representation learning of network nodes;

通过网络节点对应的向量余弦相似度进行合作网络链路预测;The cooperative network link prediction is performed through the cosine similarity of the vectors corresponding to the network nodes;

其中,网络节点代表作者;认知邻近性特征表征作者在科研领域的认知水平;地理邻近性特征表征各个作者的位置关系;制度邻近性特征表征作者所在位置语言的相似度;局部网络特征表征各作者合作的概率表示;全局网络特征通过作者邻近性向量的似然值表示科研相似性。Among them, network nodes represent authors; cognitive proximity features represent the author's cognitive level in the field of scientific research; geographical proximity features represent the positional relationship of each author; institutional proximity features represent the similarity of the language of the author's location; local network features represent the probability of cooperation among authors; and global network features represent scientific research similarity through the likelihood value of the author's proximity vector.

优选地,认知邻近性特征表示为:

Figure BDA0002999835890000021
Preferably, the cognitive proximity feature is expressed as:
Figure BDA0002999835890000021

其中,Cpi,y为作者ai发表在y年的论文认知向量累加和;y0为基础年;Y为年限区间;Among them, Cp i,y is the cumulative sum of cognitive vectors of papers published by author a i in year y; y 0 is the base year; Y is the year interval;

地理邻近性特征表示为GG=(VG,EG),其中,VG为地理节点集;EG为地理边的集合;The geographic proximity feature is expressed as GG = (VG, EG), where VG is the set of geographic nodes; EG is the set of geographic edges;

优选地,制度邻近性采用通用语言的连续聚合指数衡量。Preferably, institutional proximity is measured using a continuous aggregation index of common language.

优选地,自编码器模型为:Preferably, the autoencoder model is:

hi=σ1(W(1)xi+b(1))h i1 (W (1) x i +b (1) )

Figure BDA0002999835890000022
Figure BDA0002999835890000022

其中,xi为作者的邻近特征向量,hi是编码器的隐层表示;

Figure BDA0002999835890000023
是解码器的重构;θ={W(1),b(1),W(2),b(2)}是模型参数;σ1(·)为激活函数中的tanh函数;Among them, xi is the author's neighbor feature vector, hi is the hidden layer representation of the encoder;
Figure BDA0002999835890000023
is the reconstruction of the decoder; θ = {W (1) , b (1) , W (2) , b (2) } are model parameters; σ 1 (·) is the tanh function in the activation function;

自编码器模型的损失函数为:The loss function of the autoencoder model is:

Figure BDA0002999835890000031
Figure BDA0002999835890000031

其中,n为作者总数目。Where n is the total number of authors.

优选地,局部网络特征的损失函数为:

Figure BDA0002999835890000032
Preferably, the loss function of the local network feature is:
Figure BDA0002999835890000032

其中,pij是作者ai和作者aj的联合概率;eij为作者ai和作者aj间的连边。Among them, p ij is the joint probability of author a i and author a j ; e ij is the edge between author a i and author a j .

优选地,全局网络特征的损失函数为:Preferably, the loss function of the global network feature is:

Figure BDA0002999835890000033
Figure BDA0002999835890000033

其中,ai+j为生成的节点序列中的节点上下文,w是窗口大小;p(ai+j|xi)的条件概率是上下文ai+j给定节点i邻近性向量的似然值;G为所有网络节点集合;C为所有随机漫步序列;ai代表作者。Among them, ai +j is the node context in the generated node sequence, w is the window size; the conditional probability of p( ai+j | xi ) is the likelihood value of the proximity vector of node i given the context ai +j ; G is the set of all network nodes; C is all random walk sequences; ai represents the author.

总体而言,通过本发明所构思的以上技术方案与现有技术相比,具有以下有益效果:In general, the above technical solution conceived by the present invention has the following beneficial effects compared with the prior art:

本发明从多维邻近性视角,全面涵盖科研合作者的属性特征;通过将认知邻近性、地理邻近性和制度邻近性进行预训练,表示成低维向量,在不破坏网络特征的前提下,能够全面的考虑各作者所特有的属性;同时保留的网络特征,包括局部网络和全局网络,局部网络能准确地反映各作者合作的意愿,全局网络则更能突出体现科研的相似度;在此基础上,以网络特征和节点属性特征损失函数之和的最小值为目标进行优化,提升了合作网络链路预测的精准度。The present invention comprehensively covers the attribute characteristics of scientific research collaborators from the perspective of multidimensional proximity. By pre-training cognitive proximity, geographical proximity and institutional proximity and expressing them as low-dimensional vectors, the unique attributes of each author can be comprehensively considered without destroying the network characteristics. At the same time, the retained network characteristics include local networks and global networks. The local network can accurately reflect the willingness of each author to cooperate, while the global network can better highlight the similarity of scientific research. On this basis, the optimization is performed with the minimum value of the sum of the network characteristics and node attribute characteristics loss functions as the goal, thereby improving the accuracy of the cooperation network link prediction.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1(a)是本发明实施例提供的GPN的距离加权网络示意图;FIG. 1( a ) is a schematic diagram of a distance-weighted network of a GPN provided in an embodiment of the present invention;

图1(b)是本发明实施例提供的GPN的地理邻近网络示意图;FIG. 1( b ) is a schematic diagram of a geographically adjacent network of a GPN provided in an embodiment of the present invention;

图1(c)是本发明实施例提供的GPN节点D的转移概率示意图;FIG1( c ) is a schematic diagram of the transition probability of a GPN node D provided in an embodiment of the present invention;

图2(a)是本发明实施例提供的制度邻近网络示意图;FIG. 2( a ) is a schematic diagram of an institutional proximity network provided by an embodiment of the present invention;

图2(b)是本发明实施例提供的IPN节点d的转移概率示意图;FIG2( b ) is a schematic diagram of the transition probability of an IPN node d provided in an embodiment of the present invention;

图3是本发明实施例提供的科研合作网络构建示意图;FIG3 is a schematic diagram of a scientific research cooperation network construction provided by an embodiment of the present invention;

图4是本发明实施例提供的联合优化框架示意图。FIG4 is a schematic diagram of a joint optimization framework provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

本发明提供了一种基于多维邻近属性网络的合作网络链路预测方法,包括以下步骤:The present invention provides a method for predicting cooperative network links based on a multi-dimensional proximity attribute network, comprising the following steps:

利用自编码器模型、联合概率模型和属性Skip-Gram模型分别保留多维邻近性特征、局部网络特征和全局网络特征;其中,多维邻近性特征包括认知邻近性特征、地理邻近性特征和制度邻近性特征;The autoencoder model, joint probability model and attribute Skip-Gram model are used to retain multidimensional proximity features, local network features and global network features respectively; among which, the multidimensional proximity features include cognitive proximity features, geographical proximity features and institutional proximity features;

结合自编码模型的损失函数、局部网络特征的损失函数和全局网络特征的损失函数以及L2-范数的损失函数作为整体目标函数,采用随机梯度下降方法优化整体目标函数,实现对网络节点的表示学习;The loss function of the autoencoder model, the loss function of the local network features, the loss function of the global network features, and the loss function of the L2-norm are combined as the overall objective function, and the stochastic gradient descent method is used to optimize the overall objective function to achieve representation learning of network nodes;

通过网络节点对应的向量余弦相似度进行合作网络链路预测;The cooperative network link prediction is performed through the cosine similarity of the vectors corresponding to the network nodes;

其中,网络节点代表作者;认知邻近性特征表征作者在科研领域的认知水平;地理邻近性特征表征各个作者的位置关系;制度邻近性特征表征作者所在位置语言的相似度;局部网络特征表征各作者存在的合作关系;全局网络特征通过作者间的邻域相似性。Among them, network nodes represent authors; cognitive proximity features represent the author's cognitive level in the field of scientific research; geographical proximity features represent the positional relationship of each author; institutional proximity features represent the similarity of the language of the author's location; local network features represent the cooperative relationship between authors; and global network features are based on the neighborhood similarity between authors.

以下进行详细介绍,首先介绍本发明涉及到的多维邻近性的表示学习方法,其次阐述本发明提供的ARCR模型(科研合作推荐模型:attribute-aware researchrecommendation)的框架。The following is a detailed introduction. First, the representation learning method of multi-dimensional proximity involved in the present invention is introduced, and then the framework of the ARCR model (attribute-aware research recommendation model) provided by the present invention is explained.

1.本发明提供的多维邻近性特征表示包括认知邻近性特征表示、地理邻近性特征表示和制度邻近性表示。1. The multidimensional proximity feature representation provided by the present invention includes cognitive proximity feature representation, geographic proximity feature representation and institutional proximity representation.

(1)认知邻近性特征表示:(1) Cognitive proximity feature representation:

科学论文是链接性文本,不仅包含文本,还包含引文链接。文本内容信息和链接信息对于衡量科学论文相似性都是必不可少的。在本发明中,通过P2V(paper-to-vector)进行文本内容特征和引文链接特征表示,以得到科研主体的认知邻近性表示。考虑到科研主体的认知基础的动态变化,采用时间权重衰减因子,对于作者ai,根据其各个年度发表论文的认知向量,得到认知邻近性特征表示:Scientific papers are linked texts, which contain not only text but also citation links. Both text content information and link information are essential for measuring the similarity of scientific papers. In the present invention, P2V (paper-to-vector) is used to represent text content features and citation link features to obtain the cognitive proximity representation of scientific research subjects. Considering the dynamic changes in the cognitive basis of scientific research subjects, a time weight decay factor is used. For author a i , according to the cognitive vectors of the papers published in each year, the cognitive proximity feature representation is obtained:

Figure BDA0002999835890000051
Figure BDA0002999835890000051

其中,Cpi,y为作者ai发表在y年的论文的认知向量累加和;y0为基础年;Y为年限区间。Among them, Cp i,y is the cumulative sum of the cognitive vectors of the papers published by author a i in year y; y 0 is the base year; and Y is the year interval.

(2)地理邻近性特征表示:(2) Geographic proximity feature representation:

地理邻近性网络(GPN:Geographical Proximity Network)被定义为GG=(VG,EG),其中,VG为地理节点集;v1∈VG,表示一个城市;EG为地理边的集合;e1∈EG,表示两个城市之间的地理联系,e1=(u1,v1);其中,u1为异于v1的另一城市;e1与边的权重

Figure BDA0002999835890000052
相关;
Figure BDA0002999835890000053
为城市u1与城市v1之间的距离。The Geographical Proximity Network (GPN) is defined as GG = (VG, EG), where VG is a set of geographic nodes; v 1 ∈ VG, representing a city; EG is a set of geographic edges; e 1 ∈ EG, representing the geographic connection between two cities, e 1 = (u 1 ,v 1 ); where u 1 is another city different from v 1 ; e 1 and the edge weight
Figure BDA0002999835890000052
Related;
Figure BDA0002999835890000053
is the distance between city u 1 and city v 1 .

图1(a)为GPN的距离加权网络;图1(b)为GPN的地理邻近网络;图1(c)为网络节点D的转移概率;图中,A,B,C和D代表不同的城市;在GPN上进行偏向随机游走,其转移概率与边的权重成正比,以转移概率生成城市节点序列(如图1(c)所示),最后执行属性Skip-Gram模型。地理距离较近的两个城市在抽样结果中更有可能共同出现,得到相近的向量表示。Figure 1(a) is the distance-weighted network of GPN; Figure 1(b) is the geographic proximity network of GPN; Figure 1(c) is the transition probability of network node D; in the figure, A, B, C and D represent different cities; a biased random walk is performed on GPN, and its transition probability is proportional to the weight of the edge. The city node sequence is generated with the transition probability (as shown in Figure 1(c)), and finally the attribute Skip-Gram model is executed. Two cities with close geographical distances are more likely to appear together in the sampling results and obtain similar vector representations.

(3)制度邻近性特征表示:(3) Institutional proximity characteristics:

当科研主体具有相似的文化背景时,科研合作会更容易。语言是文化的核心。雅克.梅利兹(Jacques.Melitz)整合了通用的母语、通用的口语、通用的官方语言以及语言相似性,从而提出了一种通用语言的连续聚合指数(continual aggregate index),该指数优于传统的哑变量模型。因此,采用该指数衡量科研主体的制度邻近性。When scientific research subjects have similar cultural backgrounds, scientific research cooperation will be easier. Language is the core of culture. Jacques Melitz integrated the common mother tongue, common spoken language, common official language and language similarity to propose a continuous aggregate index of common language, which is better than the traditional dummy variable model. Therefore, this index is used to measure the institutional proximity of scientific research subjects.

制度邻近网络(IPN:Institutional Proximity Network)被定义为GI=(VI,EI),VI为制度节点集,v2∈VI表示一个特定的国家;EI为制度边的集合;每条制度边e2∈EI,表示两个国家间的制度邻近性,e2=(u2,v2)与制度边的权重

Figure BDA0002999835890000061
相关,
Figure BDA0002999835890000062
是国家对之间共同语言的连续聚合指数;其中,u2为异于v2的另一个国家。The Institutional Proximity Network (IPN) is defined as GI = (VI, EI), where VI is the set of institutional nodes, v 2 ∈VI represents a specific country; EI is the set of institutional edges; each institutional edge e 2 ∈EI represents the institutional proximity between two countries, e 2 = (u 2 ,v 2 ) and the weight of the institutional edge
Figure BDA0002999835890000061
Related,
Figure BDA0002999835890000062
is a continuous aggregation index of the common language between country pairs; where u 2 is another country different from v 2 .

图2(a)是制度邻近网络,图2(b)是网络节点d的转移概率;在IPN基于随机游走(random walks)生成节点序列,然后对节点序列进行属性Skip-Gram算法。在抽样结果中,制度邻近性较大的两个国家更有可能同时出现,从而导致相似的表示。Figure 2(a) is the institutional proximity network, and Figure 2(b) is the transition probability of network node d. In IPN, node sequences are generated based on random walks, and then the attribute Skip-Gram algorithm is performed on the node sequences. In the sampling results, two countries with greater institutional proximity are more likely to appear at the same time, resulting in similar representations.

2.科研合作网络(Coauthorship network)2. Coauthorship network

科研合作网络体现了社会邻近性,科研主体合著发表论文的合作信息构建科研合作网络。图3是科研合作网络构建示意图;当两个科研主体有一篇合著论文时,则两者有一条权重为1的两边;当两个科研主体有k篇合著论文时,则两者有一条权重为k的连边。The scientific research cooperation network reflects social proximity. The cooperation information of scientific research subjects co-authoring papers builds the scientific research cooperation network. Figure 3 is a schematic diagram of the construction of the scientific research cooperation network; when two scientific research subjects have a co-authored paper, there is a two-way edge with a weight of 1 between them; when two scientific research subjects have k co-authored papers, there is a connecting edge with a weight of k between them.

3.问题陈述3. Problem Statement

G=(A,E,X)是属性科研合作网络,其中,A={a1,a2,...,an}是作者集合;每条边eij=(ai,aj)∈E表示作者ai和aj之间的科研合作关系;X∈Rn×m表示节点属性矩阵;xi是作者ai的邻近性特征向量;节点属性信息是认知、地理和制度邻近性向量的并集;旨在通过学习映射函数f:ai→hi∈Rd,将每个作者ai表示为低维向量hi,其中,d<<n,并保留网络特征和节点属性信息;网络特征包括局部网络特征和全局网络特征。局部网络特征表示两个作者之间是否存在边;全局网络特征表示节点的高阶邻域相似性。G = (A, E, X) is an attributed scientific research cooperation network, where A = {a 1 , a 2 , ..., a n } is the set of authors; each edge e ij = (a i , a j ) ∈ E represents the scientific research cooperation relationship between authors a i and a j ; X ∈ R n×m represents the node attribute matrix; xi is the proximity feature vector of author a i ; the node attribute information is the union of cognitive, geographical and institutional proximity vectors; the goal is to represent each author a i as a low-dimensional vector h i by learning the mapping function f: a ihi ∈ R d , where d ≤ n, and retain the network features and node attribute information; the network features include local network features and global network features. The local network feature indicates whether there is an edge between two authors; the global network feature indicates the high-order neighborhood similarity of the node.

4.基于邻近性特征的自编码器4. Autoencoder based on proximity features

邻近性特征属于属性特征,采用自编码器保存节点属性信息。自编码器模型包括以下三层,分别为输入层、隐层和输出层;自编码器的表示函数为:The proximity feature belongs to the attribute feature, and the autoencoder is used to save the node attribute information. The autoencoder model includes the following three layers: input layer, hidden layer and output layer; the representation function of the autoencoder is:

hi=σ1(W(1)xi+b(1))h i1 (W (1) x i +b (1) )

Figure BDA0002999835890000071
Figure BDA0002999835890000071

其中,xi为作者的邻近特征向量,hi∈Rd是编码器的隐层表示;

Figure BDA0002999835890000072
是解码器的重构;θ={W(1),b(1),W(2),b(2)}是模型参数;σ1(·)为激活函数中的tanh函数;Among them, xi is the author's neighbor feature vector, hi∈Rd is the hidden layer representation of the encoder;
Figure BDA0002999835890000072
is the reconstruction of the decoder; θ = {W (1) , b (1) , W (2) , b (2) } are model parameters; σ 1 (·) is the tanh function in the activation function;

通过最小化以下损失函数进行学习模型参数。The model parameters are learned by minimizing the following loss function.

Figure BDA0002999835890000073
Figure BDA0002999835890000073

为了保留属性信息中的高度非线性,在编码器中采用了K层隐层;In order to preserve the high nonlinearity in the attribute information, K hidden layers are used in the encoder;

Figure BDA0002999835890000074
Figure BDA0002999835890000074

Figure BDA0002999835890000075
Figure BDA0002999835890000075

其中,

Figure BDA0002999835890000076
表示作者ai所需的低维隐藏表示形式。相应,在解码器中也采用了K层隐层。in,
Figure BDA0002999835890000076
represents the low-dimensional hidden representation required by the author a i . Accordingly, K hidden layers are also used in the decoder.

5.局部网络特征5. Local Network Characteristics

网络特征包括局部网络特征和全局网络特征;局部网络特征的具体情况如下:Network features include local network features and global network features. The specifics of local network features are as follows:

最大化下面的似然估计以保留局部网络特征:Lf=∏eij>0pijMaximize the following likelihood estimate to preserve local network features: L f = ∏e ij>0 p ij ;

其中,pij是ai和aj的联合概率:

Figure BDA0002999835890000077
Where p ij is the joint probability of a i and a j :
Figure BDA0002999835890000077

因此,可以将负似然最小化,具体如下:Therefore, the negative likelihood can be minimized as follows:

Figure BDA0002999835890000078
Figure BDA0002999835890000078

6.全局网络特征6. Global Network Characteristics

为了保留全局网络特征,采用了基于属性的Skip-Gram模型。通过为所有随机游走序列c∈C提供当前节点ai及其邻近性特征xi,最小化以下负对数似然:In order to preserve the global network characteristics, the attribute-based Skip-Gram model is adopted. By providing the current node a i and its proximity features x i for all random walk sequences c ∈ C, the following negative log-likelihood is minimized:

Figure BDA0002999835890000079
Figure BDA0002999835890000079

其中,ai+j为生成的节点序列中的节点上下文,w是窗口大小;p(ai+j|xi)的条件概率是上下文ai+j给定节点i邻近性向量的似然值;G为所有网络节点集合;C为所有随机漫步序列;Where a i+j is the node context in the generated node sequence, w is the window size; the conditional probability of p(a i+j | xi ) is the likelihood value of the proximity vector of node i given the context a i+j ; G is the set of all network nodes; C is the sequence of all random walks;

Figure BDA0002999835890000081
Figure BDA0002999835890000081

其中,f(·)为自动编码器模型的编码器部分的函数;

Figure BDA0002999835890000082
是上下文节点
Figure BDA0002999835890000083
的对应表示方式。但是该公式的计算成本较高,因此,使用负采样法,将p(ai+j|xi)替换为:Where f(·) is the function of the encoder part of the autoencoder model;
Figure BDA0002999835890000082
Is the context node
Figure BDA0002999835890000083
However, the computational cost of this formula is high, so negative sampling is used to replace p(a i+j | xi ) with:

Figure BDA0002999835890000084
Figure BDA0002999835890000084

其中,σ2(·)为激活函数中的sigmoid函数,|neg|为负例样本的数量;

Figure BDA0002999835890000085
为期望函数;
Figure BDA0002999835890000086
da是节点a的度数;Where σ 2 (·) is the sigmoid function in the activation function, and |neg| is the number of negative samples;
Figure BDA0002999835890000085
is the expected function;
Figure BDA0002999835890000086
d a is the degree of node a;

7.联合优化框架7. Joint Optimization Framework

图4为联合优化框架。由于自编码器模型、联合概率模型和属性感知skip-gram模型共享相同的编码器层,因此这些模型是紧密相连的。

Figure BDA0002999835890000087
的最终表示方式捕获了网络特征和节点属性信息。The joint optimization framework is shown in Figure 4. Since the autoencoder model, joint probability model, and attribute-aware skip-gram model share the same encoder layer, these models are closely connected.
Figure BDA0002999835890000087
The final representation captures both network characteristics and node attribute information.

将三个目标函数结合以获得联合模型的总目标函数:The three objective functions are combined to obtain the overall objective function of the joint model:

Figure BDA0002999835890000088
Figure BDA0002999835890000088

采用随机梯度下降算法优化上式中的损失函数,迭代优化两个耦合组件(αLf+βLae+γLreg,and Lh);Lreg

Figure BDA0002999835890000089
The stochastic gradient descent algorithm is used to optimize the loss function in the above formula, and the two coupled components (αL f +βL ae +γL reg , and L h ) are iteratively optimized; L reg is
Figure BDA0002999835890000089

8.科研合作推荐8. Recommendation for scientific research cooperation

基于作者ai的低维隐藏表示

Figure BDA0002999835890000091
以及作者aj的低维隐藏表示
Figure BDA0002999835890000092
Figure BDA0002999835890000093
计算hi和hj余弦相似度;最后,向目标作者推荐前k个同类作者作为潜在的科研合作对象。Based on the low-dimensional hidden representation of the author a i
Figure BDA0002999835890000091
And the low-dimensional hidden representation of author a j
Figure BDA0002999835890000092
Figure BDA0002999835890000093
Calculate the cosine similarity between hi and hj ; finally, recommend the top k similar authors to the target author as potential scientific research cooperation partners.

实施例Example

数据源,利用Web of Science核心引文数据库收集2010~2019年医药类别的论文(“生物化学与分子生物学”、“医药、研究与实验”、“药理学与药学”和“毒理学”),搜索查询为“WC=A AND PY=B AND LANGUAGE='English'”,其中,A是药学领域,B是2010~2019。排除独立作者的论文和非期刊论文,最终检索到528118篇论文。作者姓名消歧后筛选了162196位发表过5篇以上论文的作者。利用Google地图API获取地理信息。采用CEPII语言获取语言信息。将数据集按出版年份分为两部分:2018年之前的数据作为训练集,2018~2019年的数据作为测试集。Data source: Web of Science Core Citation Database was used to collect papers in the pharmaceutical category from 2010 to 2019 ("Biochemistry and Molecular Biology", "Medicine, Research and Experiment", "Pharmacology and Pharmacy", and "Toxicology"), and the search query was "WC = A AND PY = B AND LANGUAGE = 'English'", where A is the pharmaceutical field and B is 2010 to 2019. Papers with independent authors and non-journal papers were excluded, and 528,118 papers were finally retrieved. After author name disambiguation, 162,196 authors who had published more than 5 papers were screened. Geographic information was obtained using Google Maps API. Language information was obtained using CEPII language. The dataset was divided into two parts according to the year of publication: data before 2018 was used as a training set, and data from 2018 to 2019 was used as a test set.

实施过程:建立一个三层的自动编码器;第一层、第二层和第三层的隐藏维数分别为的d(1)=600、d(2)=512、d(3)=256;地理邻近向量和制度邻近向量的维数均为64,认知邻近向量的维数为256。将局部网络特征损失函数的权重设为α=1,自动编码器损失函数的权重设为β=10,L2-范数正则化γ的权重设为10。随机选择100位作者作为目标节点,运行ARCR。Implementation: A three-layer autoencoder was built; the hidden dimensions of the first, second, and third layers were d(1)=600, d(2)=512, and d(3)=256, respectively; the dimensions of the geographic proximity vector and the institutional proximity vector were both 64, and the dimension of the cognitive proximity vector was 256. The weight of the local network feature loss function was set to α=1, the weight of the autoencoder loss function was set to β=10, and the weight of the L2-norm regularization γ was set to 10. 100 authors were randomly selected as target nodes and ARCR was run.

综上所述,本发明存在以下优势:In summary, the present invention has the following advantages:

本发明从多维邻近性视角,全面涵盖科研合作者的属性特征;通过将认知邻近性、地理邻近性和制度邻近性进行预训练,表示成低维向量,在不破坏网络特征的前提下,能够全面的考虑各作者所特有的属性;同时保留的网络特征,包括局部网络和全局网络,局部网络能准确地反映各作者合作的意愿,全局网络则更能突出体现科研的相似度;在此基础上,以网络特征和节点属性特征损失函数之和的最小值为目标进行优化,提升了合作网络链路预测的精准度。The present invention comprehensively covers the attribute characteristics of scientific research collaborators from the perspective of multidimensional proximity. By pre-training cognitive proximity, geographical proximity and institutional proximity and expressing them as low-dimensional vectors, the unique attributes of each author can be comprehensively considered without destroying the network characteristics. At the same time, the retained network characteristics include local networks and global networks. The local network can accurately reflect the willingness of each author to cooperate, while the global network can better highlight the similarity of scientific research. On this basis, the optimization is performed with the minimum value of the sum of the network characteristics and node attribute characteristics loss functions as the goal, thereby improving the accuracy of the cooperation network link prediction.

本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。It will be easily understood by those skilled in the art that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A cooperative network link prediction method based on a multidimensional proximity attribute network is characterized by comprising the following steps:
respectively reserving multidimensional proximity features, local network features and global network features by using a self-encoder model, a joint probability model and an attribute Skip-Gram model; wherein the multidimensional proximity features include cognitive proximity features, geographic proximity features, and institutional proximity features;
the loss function of the self-coding model, the loss function of the local network characteristic, the loss function of the global network characteristic and the loss function of the L2-norm are combined to serve as an overall objective function, and a random gradient descent method is adopted to optimize the overall objective function, so that the representation learning of the network nodes is realized;
carrying out cooperative network link prediction through vector cosine similarity corresponding to the network node;
wherein the network node represents an author; the multi-dimensional proximity feature representation includes a cognitive proximity feature representation, a geographic proximity feature representation, and a institutional proximity feature representation; cognitive proximity features characterize the cognitive level of authors in the scientific domain; the geographic proximity features characterize the position relationship of each author; the system proximity feature represents the similarity of the language of the position where the author is located; the local network characteristics represent the cooperative relationship existing in each author; the global network features represent scientific research similarity through likelihood values of author proximity vectors;
wherein the geographic proximity network in the geographic proximity feature is defined as gg= (VG, EG), where VG is a set of geographic nodes; v 1 E VG, represents a city; EG is a collection of geographic edges; e, e 1 EG, represents the geographic relationship between two cities, e 1 =(u 1 ,v 1 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein u is 1 Is different from v 1 Is a city of another city; e, e 1 Weighting of edges
Figure FDA0004178690190000011
Correlation;
Figure FDA0004178690190000012
Is city u 1 And city v 1 A distance therebetween; performing biased random walk on a geographic proximity network, wherein the transition probability is in direct proportion to the weight of the edge, generating a city node sequence according to the transition probability, and finally executing an attribute Skip-Gram model; the more likely two cities with closer geographic distances are co-present in the sampling result, and similar vector representations are obtained; wherein the system proximity network in the system proximity feature is defined as gi= (VI, EI), VI is the system node set, v 2 Epsilon VI represents a specific country; EI is the set of institutional edges; each system side e 2 E EI, the system adjacency between two countries, e 2 =(u 2 ,v 2 ) Weights with the system side->
Figure FDA0004178690190000021
Related (I)>
Figure FDA0004178690190000022
Is a continuous aggregation index of common language between country pairs; wherein u is 2 Is different from v 2 Is a country of another country; performing random walk on a system proximity network to generate a national node sequence, and performing an attribute Skip-Gram algorithm on the national node sequence; in the sampling result, the more likely two countries with greater system proximity are to appear simultaneously, resulting in a similar representation;
the self-encoder model, the joint probability model and the attribute aware skip-gram model share the same encoder layer;
the overall objective function is:
L=L h +αL f +βL ae +γL reg
optimizing the whole objective function by adopting a random gradient descent algorithm, and iteratively optimizing the two coupling components alpha L f +βL ae +γL reg And L h ;L f A loss function that is a local network feature; l (L) h Is a global network feature; l (L) ae A loss function that is a self-coding model; l (L) reg Is a loss function of the L2-norm.
2. The cooperative network link prediction method according to claim 1, wherein the institutional adjacency is measured in terms of a continuous aggregation index in a common language.
3. The cooperative network link prediction method according to claim 2, wherein the self-encoder model is: h is a i =σ 1 (W (1) x i +b (1) ),
Figure FDA0004178690190000023
Wherein x is i For the author's neighboring feature vector, h i Is a hidden layer representation of the encoder;
Figure FDA0004178690190000024
is a reconstruction of the decoder; θ= { W (1) ,b (1) ,W (2) ,b (2) -model parameters; sigma (sigma) 1 (. Cndot.) is the tanh function in the activation function. />
4. A method of collaborative network link prediction according to claim 3, wherein the loss function of the self-encoder model is:
Figure FDA0004178690190000025
where n is the total number of authors.
5. The cooperative network link prediction method according to claim 2, wherein the loss function of the local network characteristic is:
Figure FDA0004178690190000032
wherein p is ij Is author a i And author a j Is a joint probability of (2); e, e ij For author a i And author a j The connecting edges between the two.
6. The cooperative network link prediction method according to claim 2, wherein the loss function of the global network feature is:
Figure FDA0004178690190000031
wherein a is i+j For the node context in the generated node sequence, w is the window size; p (a) i+j |x i ) The conditional probability of (a) is context a i+j Giving likelihood values of the node i proximity vector; g is a set of all network nodes; c is all random walk sequences; a, a i Representing the author.
CN202110343021.3A 2021-03-30 2021-03-30 A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network Active CN112989199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110343021.3A CN112989199B (en) 2021-03-30 2021-03-30 A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110343021.3A CN112989199B (en) 2021-03-30 2021-03-30 A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network

Publications (2)

Publication Number Publication Date
CN112989199A CN112989199A (en) 2021-06-18
CN112989199B true CN112989199B (en) 2023-05-30

Family

ID=76338483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110343021.3A Active CN112989199B (en) 2021-03-30 2021-03-30 A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network

Country Status (1)

Country Link
CN (1) CN112989199B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8732101B1 (en) * 2013-03-15 2014-05-20 Nara Logics, Inc. Apparatus and method for providing harmonized recommendations based on an integrated user profile
CN109145087A (en) * 2018-07-30 2019-01-04 大连理工大学 A kind of scholar's recommendation and collaborative forecasting method based on expression study and competition theory
CN111368074A (en) * 2020-02-24 2020-07-03 西安电子科技大学 A Link Prediction Method Based on Network Structure and Text Information
CN112069306A (en) * 2020-07-22 2020-12-11 中国科学院计算机网络信息中心 Paper partner recommendation method based on author writing tree and graph neural network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8170971B1 (en) * 2011-09-28 2012-05-01 Ava, Inc. Systems and methods for providing recommendations based on collaborative and/or content-based nodal interrelationships
CN109101629A (en) * 2018-08-14 2018-12-28 合肥工业大学 A kind of network representation method based on depth network structure and nodal community
US11544530B2 (en) * 2018-10-29 2023-01-03 Nec Corporation Self-attentive attributed network embedding
CN110245682B (en) * 2019-05-13 2021-07-27 华中科技大学 A Topic-Based Network Representation Learning Method
CN111709474A (en) * 2020-06-16 2020-09-25 重庆大学 A Graph Embedding Link Prediction Method Fusing Topology Structure and Node Attributes
CN112256870A (en) * 2020-10-15 2021-01-22 大连理工大学 Attribute network representation learning method based on self-adaptive random walk

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8732101B1 (en) * 2013-03-15 2014-05-20 Nara Logics, Inc. Apparatus and method for providing harmonized recommendations based on an integrated user profile
CN109145087A (en) * 2018-07-30 2019-01-04 大连理工大学 A kind of scholar's recommendation and collaborative forecasting method based on expression study and competition theory
CN111368074A (en) * 2020-02-24 2020-07-03 西安电子科技大学 A Link Prediction Method Based on Network Structure and Text Information
CN112069306A (en) * 2020-07-22 2020-12-11 中国科学院计算机网络信息中心 Paper partner recommendation method based on author writing tree and graph neural network

Also Published As

Publication number Publication date
CN112989199A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
Huai et al. Which environmental features contribute to positive and negative perceptions of urban parks? A cross-cultural comparison using online reviews and Natural Language Processing methods
Caquard Cartography II: Collective cartographies in the social media era
CN111782769B (en) Intelligent knowledge graph question-answering method based on relation prediction
Shields The paradoxes of necessity: Fail forwards neoliberalism, social reproduction, recombinant populism and Poland’s 500Plus policy
CN116861923B (en) Implicit relation mining method, system, computer and storage medium based on multi-view unsupervised graph contrast learning
CN112598563B (en) A smart city data construction method based on knowledge graph
Ma et al. Exploring the spatial distribution characteristics of emotions of weibo users in wuhan waterfront based on gender differences using social media texts
Constantin et al. Profiling visitors to Romanian ecotourism destinations
CN116521936B (en) Course recommendation method and device based on user behavior analysis and storage medium
CN113158041A (en) Article recommendation method based on multi-attribute features
Yang et al. Place deduplication with embeddings
Wang et al. Measuring urban poverty spatial by remote sensing and social sensing data: A fine-scale empirical study from Zhengzhou
Cagliero et al. From hotel reviews to city similarities: A unified latent-space model
Luo et al. Exploring destination image through online reviews: an augmented mining model using latent Dirichlet allocation combined with probabilistic hesitant fuzzy algorithm
Li et al. Social network analysis on tourists’ perceived image of tropical forest park: implications for niche tourism
Zhai et al. The impact of place of origin on international and domestic graduates’ mobility in China
Zhang et al. A deep transfer learning toponym extraction and geospatial clustering framework for investigating scenic spots as cognitive regions
Zhang et al. News recommendation based on user topic and entity preferences in historical behavior
Huynh Thai et al. Analyzing public opinions regarding virtual tourism in the context of COVID-19: Unidirectional vs. 360-degree videos
Song et al. Improving answer quality using image-text coherence on social Q&A sites
CN114385927A (en) Scientific research collaborator recommendation method based on multi-similarity fusion
Ling et al. Enhancing Chinese Address Parsing in Low-Resource Scenarios through In-Context Learning
CN112989199B (en) A Collaborative Network Link Prediction Method Based on Multi-Dimensional Proximity Attribute Network
CN112257517B (en) A Tourist Attraction Recommendation System Based on Attraction Clustering and Group Emotion Recognition
Zhao et al. Organizational geosocial network: A graph machine learning approach integrating geographic and public policy information for studying the development of social organizations in China

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant