CN112380360B

CN112380360B - Node query method based on meta-path in heterogeneous information network

Info

Publication number: CN112380360B
Application number: CN202011260846.0A
Authority: CN
Inventors: 汤颖; 徐珊
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2022-03-18
Anticipated expiration: 2040-11-12
Also published as: CN112380360A

Abstract

A meta-path-based query method for similar nodes in heterogeneous networks, including: 1. Generating a path greedy tree; expanding the greedy tree according to the input source node and short text description; performing semantic matching of short texts in the process of expanding the greedy tree; 2. . Determine the meta-path sequence; first traverse the greedy tree to obtain the edge type sequence, and then determine the node type sequence according to the edge type sequence; traverse the generated greedy tree, and separate the path connecting the input node pairs; 3. The importance of calculating the meta-path First, according to the factors that affect the importance of the meta-path, define the calculation formula of the importance of the meta-path; with the help of the number of instance nodes in the greedy leaf nodes, calculate the importance of the meta-path; 4. Combine multiple meta-paths to generate query instances; Instance node pairs of meta-path semantics have high similarity in meta-path semantics; therefore, to obtain query result instances, it is only necessary to find node pairs with high semantic similarity in each meta-path.

Description

Node Query Method Based on Meta-Path in Heterogeneous Information Network

技术领域technical field

本发明涉及一种面向异构信息网络基于元路径的节点查询方法。The invention relates to a meta-path-based node query method for heterogeneous information networks.

背景技术Background technique

现实世界系统通常由大量互相作用的多类型组件构成，例如人类的社会活动、通信和生物网络。在这样的系统中，组件通过相互间的连接构成网络，这种网络通常被称为信息网络。传统的信息网络大都是同构网络，即信息网络中的节点是相同类型的实体，这些实体通过相同类型的关系连接。然而，在实际的应用中大多数信息网络是异构的，即网络中的节点和关系并不是单一类型的。如果不区分网络中节点或边的类型往往会丢失其中包含的重要语义信息。很多领域的数据都是以异构信息网络的形式存在的，例如学术文献信息网络、医疗信息网络、Twitter信息网络等。Real-world systems typically consist of a large number of interacting multi-type components, such as human social activities, communication, and biological networks. In such systems, components are connected to each other to form a network, which is often referred to as an information network. Most of the traditional information networks are homogeneous networks, that is, the nodes in the information network are entities of the same type, and these entities are connected by the same type of relationship. However, in practical applications, most information networks are heterogeneous, that is, the nodes and relationships in the network are not of a single type. If the types of nodes or edges in the network are not distinguished, the important semantic information contained in them is often lost. Data in many fields exists in the form of heterogeneous information networks, such as academic literature information networks, medical information networks, and Twitter information networks.

在异构信息网络中，为了表示两个节点间的不同语义关系，元路径这一概念被提出并且应用于异构信息网络挖掘的相关任务。元路径是一系列节点类型和边类型的序列，可以看作是表达不同语义的简单图模式。由于元路径可以很好地表达异构信息网络中不同类型节点间的连接语义，被广泛应用于异构信息网络挖掘的各项任务中，特别是基于元路径相似性的节点查询任务中。In heterogeneous information networks, in order to represent different semantic relations between two nodes, the concept of meta-path is proposed and applied to the related tasks of heterogeneous information network mining. A meta-path is a sequence of node types and edge types that can be viewed as simple graph patterns that express different semantics. Because meta-paths can well express the connection semantics between different types of nodes in heterogeneous information networks, they are widely used in various tasks of heterogeneous information network mining, especially in node query tasks based on meta-path similarity.

现有的工作中元路径大都由领域专家根据特定的应用场景来确定，因此这些元路径通常不具备通用性，不同的任务和数据集需要设置不同的元路径。另外也有一些自动生成元路径的方法，比如将元路径的长度作为参数，通过设置最大元路径长度来生成长度阈值内的元路径。这种方式可以防止元路径的数量达到无穷大，但是无法保证长度阈值内的元路径都是“关键”路径。自动生成元路径的方法大都专注于将节点间的连接关系表示成元路径，而不考虑节点本身属性对连接关系的影响。这些元路径的生成表示限制了节点查询的准确性。Most of the meta-paths in the existing work are determined by domain experts according to specific application scenarios, so these meta-paths are usually not universal, and different tasks and datasets need to set different meta-paths. There are also some methods of automatically generating meta-paths, such as taking the length of the meta-path as a parameter, and generating meta-paths within the length threshold by setting the maximum meta-path length. This approach prevents the number of meta-paths from reaching infinity, but there is no guarantee that meta-paths within the length threshold are all "critical" paths. Most of the methods of automatically generating meta-paths focus on expressing the connection relationship between nodes as a meta-path, without considering the influence of the attributes of the nodes themselves on the connection relationship. The generated representation of these meta-paths limits the accuracy of node queries.

发明内容SUMMARY OF THE INVENTION

为了确保生成的元路径是关键元路径并克服元路径语义不包含节点本身属性的问题，本发明针对星型结构的影视异构信息网络提供了一种基于关键元路径的节点查询方法，该关键元路径结合节点属性和元路径重要度自动生成。In order to ensure that the generated meta-path is the key meta-path and overcome the problem that the semantics of the meta-path does not contain the attributes of the node itself, the present invention provides a node query method based on the critical meta-path for the star-structured film and television heterogeneous information network. The meta-path is automatically generated by combining node attributes and meta-path importance.

本发明结合元路径的长度、稀有度和强度等因素定义元路径的重要度，并在计算强度支持函数时结合了描述电影内容的短文本。本发明通过扩展贪婪树来生成元路径，生成过程包含四个阶段，第一个阶段根据输入的源节点和目标节点以及短文本描述生成路径贪婪树；第二个阶段根据得到的路径贪婪树生成元路径上的类型序列；第三个阶段计算元路径的重要度；第四个阶段结合多条元路径进行节点查询。The present invention defines the importance of the meta-path in combination with factors such as the length, rarity and strength of the meta-path, and combines the short text describing the movie content when calculating the strength support function. The invention generates the meta-path by expanding the greedy tree, and the generation process includes four stages. The first stage generates the path greedy tree according to the input source node and target node and the short text description; the second stage generates the path greedy tree according to the obtained path. The type sequence on the meta-path; the third stage calculates the importance of the meta-path; the fourth stage combines multiple meta-paths for node query.

基于元路径的异构网络相似节点查询方法总体流程，具体包括以下步骤：The overall process of the meta-path-based method for querying similar nodes in heterogeneous networks includes the following steps:

步骤1：生成路径贪婪树；根据输入的源节点和短文本描述对贪婪树进行扩展；在扩展贪婪树过程中进行短文本的语义匹配；Step 1: Generate a path greedy tree; expand the greedy tree according to the input source node and short text description; perform semantic matching of the short text in the process of expanding the greedy tree;

1.1构建贪婪树根节点；贪婪树的对象结点包括两个部分的信息，一个是路径扩展过程中生成的节点对列表，其中节点对按照字典形式进行存储，将源节点保存成字典的键，目标节点保存成字典的值；另一个为判断当前贪婪树对象是否还能向下扩展的标记，当标记为True时，表示当前对象可以继续向下扩展，标记为False时表示当前对象是路径的终点或达到了元路径的长度阈值，连接贪婪树对象的边使用异构信息网络中的边类型来标注，贪婪树的根节点由于没有扩展，其中源节点对应的值为空；1.1 Build the root node of the greedy tree; the object node of the greedy tree includes two parts of information, one is the list of node pairs generated during the path expansion process, where the node pairs are stored in the form of a dictionary, and the source node is saved as the key of the dictionary, The target node is saved as the value of the dictionary; the other is a mark to judge whether the current greedy tree object can still expand downward. When the mark is True, it means that the current object can continue to expand downward, and when the mark is False, it means that the current object is a path The end point or the length threshold of the meta-path is reached, and the edge connecting the greedy tree object is marked with the edge type in the heterogeneous information network. Since the root node of the greedy tree is not expanded, the value corresponding to the source node is empty;

1.2递归扩展贪婪树；在扩展贪婪树过程中，根据贪婪树的边类型判断下一个节点是否为电影节点；若是电影节点，则进行1.3的语义匹配过程；若不是电影节点，则继续1.2递归扩展贪婪树过程，直到目标节点出现在贪婪树叶节点的值列表中，或路径达到长度阈值；1.2 Recursively expand the greedy tree; in the process of expanding the greedy tree, determine whether the next node is a movie node according to the edge type of the greedy tree; if it is a movie node, perform the semantic matching process of 1.3; if it is not a movie node, continue to 1.2 recursive expansion Greedy tree process until the target node appears in the value list of the greedy leaf node, or the path reaches a length threshold;

1.3对查询输入的短文本和电影文本简介进行语义匹配；影视信息网络是星型结构的网络，其中存在一个中心对象，其余类型的对象均与该中心对象相连，中心对象的属性可以影响网络中所有类型的关系；在影视信息网络中，该中心对象是电影，影人之间的联系都是通过电影产生的，同时电影内容中包含丰富的语义，这些语义可以从节点属性的角度突出节点间连接关系的特点；电影简介使用一小段文本概括电影内容，可以理解成短文本数据；同时输入的查询文本是短文本数据，因此找到与查询时输入的短文本查询语义相似的电影简介，可以得到符合查询语义的电影内容，从而生成符合短文本语义的元路径；下面介绍对电影简介和用户输入的短文本之间的语义匹配步骤：(1)使用基于TextRank的开源结巴分词算法对查询输入的短文本进行分词；将输入的短文本记为Q，分词后每篇文档由词序列可表示为[q₀,q₁,…q_i…q_n]，其中q_i为第i个词，n为词序列长度；1.3 Semantic matching between the short text input by the query and the introduction of the movie text; the film and television information network is a star-structured network, in which there is a central object, and other types of objects are connected to the central object, and the properties of the central object can affect the network. All types of relationships; in the film and television information network, the central object is the film, the connections between filmmakers are generated through films, and the film content contains rich semantics, which can highlight the relationship between nodes from the point of view of node attributes The characteristics of the connection relationship; the movie introduction uses a small piece of text to summarize the content of the movie, which can be understood as short text data; at the same time, the input query text is short text data, so if you find a movie introduction with similar semantics to the short text query entered during the query, you can get The movie content that conforms to the query semantics, so as to generate a meta-path that conforms to the semantics of the short text; the following describes the semantic matching steps between the movie introduction and the short text input by the user: (1) Use the open source stuttering word segmentation algorithm based on TextRank to analyze the query input. The short text is segmented; the input short text is denoted as Q, and each document can be represented as [q ₀ , q ₁ ,...q _i ...q _n ] by the word sequence after word segmentation, where q _i is the ith word, n is the length of the word sequence;

(2)使用定向Skip-Gram模型(Directional Skip-Gram,简称DSG)得到每个词的词向量，记为V_qi；(2) Use the directional Skip-Gram model (Directional Skip-Gram, DSG for short) to obtain the word vector of each word, denoted as V _qi ;

(3)在得到词向量后，通过公式(1)计算词向量的均值，得到句向量；(3) After the word vector is obtained, the mean value of the word vector is calculated by formula (1) to obtain the sentence vector;

(4)对电影的文本简介进行分词，将一篇电影简介记为T,分词后可以得到词序列[t₀,…t_j,…t_m]和每个词的TF-IDF值作为权重，权重序列为[w₀,…w_j,…w_m]；(4) Perform word segmentation on the text introduction of the movie, and denote a movie introduction as T. After word segmentation, the word sequence [t ₀ ,…t _j ,…t _m ] and the TF-IDF value of each word can be obtained as the weight, The weight sequence is [w ₀ ,…w _j ,…w _m ];

(5)采用了命名实体识别技术来处理电影文本简介中的人名，将识别为人名的词从分词结果中删除；(5) Named entity recognition technology is used to process the names of people in the movie text introduction, and the words recognized as people's names are deleted from the word segmentation results;

(6)对电影简介分词后的词汇进行词性分析，过滤掉动词、形容词、副词等修饰词，保留名词；(6) Perform part-of-speech analysis on the vocabulary after the word segmentation of the movie introduction, filter out modifiers such as verbs, adjectives, and adverbs, and retain nouns;

(7)使用DSG模型得到每个词t_j的词向量V_tj，然后通过公式(2)计算加权平均句向量V_T；(7) use the DSG model to obtain the word vector V _tj of each word t _j , and then calculate the weighted average sentence vector V _T by formula (2);

(6)基于余弦相似度度量得到两段文本的相似性，计算公式为：(6) The similarity between two texts is obtained based on the cosine similarity measure, and the calculation formula is:

步骤2：确定元路径序列；首先遍历贪婪树得到边类型序列，然后按照边类型序列确定节点类型序列；对生成的贪婪树进行遍历，从中分离出连接输入节点对的路径；L为路径集合，在L中保存所有可能的元路径边序列；将根节点记为第i层的第j个节点，此时i＝0，j＝0；Step 2: Determine the meta-path sequence; first traverse the greedy tree to obtain the edge type sequence, and then determine the node type sequence according to the edge type sequence; traverse the generated greedy tree, and separate the paths connecting the input node pairs; L is the path set, Save all possible meta-path edge sequences in L; record the root node as the j-th node of the i-th layer, at this time i=0, j=0;

2.1从根节点开始向下遍历；根节点为当前节点，贪婪树第i+1层的第j个节点为下一节点，此处j＝0；将连接当前节点和下一节点的边放入当前的路径序列l中，将下一节点的字典值也就是目标节点集合的长度记为该节点的出度；2.1 Traverse down from the root node; the root node is the current node, and the jth node of the i+1th layer of the greedy tree is the next node, where j=0; put the edge connecting the current node and the next node into In the current path sequence l, the dictionary value of the next node, that is, the length of the target node set, is recorded as the out-degree of the node;

2.2更新当前节点为上一步的下一节点，下一节点为贪婪树第i+1层第j个叶节点,此处j＝0；若下一节点的标记为True，将连接当前节点和下一节点的边放入当前的路径序列l中，将下一节点的字典值也就是目标节点集合的长度记为该节点的出度；将当前的路径序列l保存在集合L中，并令j＝j+1，进行步骤2.3；否则，判断下一节点是否还有扩展边，若有则将连接当前节点和下一节点的边放入当前的路径序列l中，将下一节点的字典值也就是目标节点集合的长度记为该节点的出度；令i＝i+1，j＝0重复步骤2.2；若下一节点没有扩展边，令j＝j+1，进行步骤2.3；2.2 Update the current node to the next node of the previous step, and the next node is the jth leaf node of the i+1th layer of the greedy tree, where j=0; if the mark of the next node is True, the current node and the next node will be connected. The edge of a node is put into the current path sequence l, and the dictionary value of the next node, that is, the length of the target node set, is recorded as the out-degree of the node; the current path sequence l is stored in the set L, and let j =j+1, go to step 2.3; otherwise, judge whether the next node has an extended edge, if so, put the edge connecting the current node and the next node into the current path sequence l, and put the dictionary value of the next node That is, the length of the target node set is recorded as the out-degree of the node; let i=i+1, j=0 repeat step 2.2; if the next node has no extended edge, let j=j+1, go to step 2.3;

2.3更新下一节点为贪婪树第i+1层第j个叶节点；若下一节点的标记为True，将连接当前节点和下一节点的边放入当前的路径序列l中，将下一节点的字典值也就是目标节点集合的长度记为该节点的出度；将当前的路径序列l保存在集合L中，并令j＝j+1，进行步骤2.3；否则，判断下一节点是否还有扩展边，若有则将连接当前节点和下一节点的边放入当前的路径序列l中，将下一节点的字典值也就是目标节点集合的长度记为该节点的出度；令i＝i+1，j＝0重复步骤2.2；若下一节点没有扩展边，令j＝j+1，进行步骤2.3；2.3 Update the next node to be the jth leaf node of the i+1th layer of the greedy tree; if the mark of the next node is True, put the edge connecting the current node and the next node into the current path sequence l, and put the next node into the current path sequence l. The dictionary value of the node, that is, the length of the target node set, is recorded as the out-degree of the node; save the current path sequence l in the set L, and set j=j+1, go to step 2.3; otherwise, judge whether the next node is There is also an extended edge. If there is, the edge connecting the current node and the next node is put into the current path sequence l, and the dictionary value of the next node, that is, the length of the target node set, is recorded as the out-degree of the node; let i=i+1, j=0, repeat step 2.2; if the next node has no extended edge, set j=j+1, and go to step 2.3;

2.4完成上述遍历后，得到包含边类型序列的元路径集合L＝{l₀,…l_i,…}；对L中的每条元路径l_i＝{t₀,…,t_j,…}，根据其中的边类型t_j确定节点类型；最终得到包含节点类型序列和边类型序列的完整元路径；2.4 After completing the above traversal, obtain the meta-path set L={l ₀ ,...l _i ,...} containing the sequence of edge types; for each meta-path in L _i ={t ₀ ,...,t _j ,...} , determine the node type according to the edge type t _j in it; finally obtain the complete meta-path including the node type sequence and the edge type sequence;

步骤3：计算元路径的重要度；首先根据影响元路径重要度的因素，定义元路径重要度的计算公式；借助贪婪树叶节点中的实例节点数量，计算元路径的重要度，元路径的重要度计算公式为：Step 3: Calculate the importance of the meta-path; first, define the calculation formula of the importance of the meta-path according to the factors affecting the importance of the meta-path; with the help of the number of instance nodes in the greedy leaf node, calculate the importance of the meta-path, the importance of the meta-path The degree calculation formula is:

I_s,t(P)＝S_s,t(P)*R_s,t(P)*Penalty(|P|)(P∈P_s→t)(4)I _s,t (P)=S _s,t (P)*R _s,t (P)*Penalty(|P|)(P∈P _s→t )(4)

其中，重要度分为S_s,t(P)，R_s,t(P)和Penalty(|P|)三个部分；Among them, the importance is divided into three parts: S _s,t (P), R _s,t (P) and Penalty (|P|);

3.1计算长度惩罚函数；从步骤2得到的元路径中获取源路径长度，使用β^P作为惩罚函数，其中β是一个取值为0.5的衰减系数；3.1 Calculate the length penalty function; obtain the source path length from the meta-path obtained in step 2, and use β ^P as the penalty function, where β is an attenuation coefficient with a value of 0.5;

3.2计算元路径稀有度；3.2 Calculate meta-path rarity;

稀有度计算函数用于评估在给定异构信息网络G＝(V，E)中，元路径P在与输入的节点对<s,t>相似的其他节点对中的稀有程度，使用D_s,t来表示与输入节点对相似的节点对，其定义为：The rarity calculation function is used to evaluate the rarity of the meta-path P among other node pairs similar to the input node pair <s, t> in a given heterogeneous information network G = (V, E), using D _{s ,t} to denote a node pair similar to the input node pair, which is defined as:

D_s,t＝D_t∪D_s (5)D _s,t = D _t ∪ D _s (5)

其中，in,

D_t中不包括t,D_s中不包括s.元路径稀有性可以通过公式(8)计算得到；D _t does not include t, and D _s does not include s. The meta-path rarity can be calculated by formula (8);

3.3计算元路径强度；元路径重要度支持函数为：3.3 Calculate the strength of the meta-path; the support function of the importance of the meta-path is:

S_s,t(P)＝Strength(P)*MNIs_s,t(P) (9)S _s,t (P)=Strength(P)*MNIs _s,t (P) (9)

其中，MNIs(p)计算元路径P中的最小实例数，计算如公式(10)所示，p_i为元路径上第i个节点的实例数量；Among them, MNIs(p) calculates the minimum number of instances in the meta-path P, the calculation is shown in formula (10), pi is the instance number of the _i -th node on the meta-path;

Strenth(p)计算元路径P的强度系数，公式(11)定义了计算方法；假设由公式(10)得到的拥有最小实例数的节点为A，节点A的出度为O(A)，节点A的入度为I(A)；当节点A为电影节点时，节点A的出度通过公式(12)计算得到，p_A为节点A的实例集合；由节点A实例集合中的每个节点的向量与短文本向量的相似度求和得到节点A的出度；Strenth(p) calculates the strength coefficient of the meta-path P, and formula (11) defines the calculation method; assuming that the node with the smallest number of instances obtained from formula (10) is A, the out-degree of node A is O(A), and the node The in-degree of A is I(A); when node A is a movie node, the out-degree of node A is calculated by formula (12), p _A is the instance set of node A; The similarity of the vector and the short text vector is summed to obtain the out-degree of node A;

3.4计算元路径重要度；通过步骤3.1，3.2，3.3分别计算得到元路径长度衰减系数，稀有度和强度后，根据公式(4)计算最终的元路径重要度；3.4 Calculate the importance of the meta-path; after calculating the length attenuation coefficient, rarity and strength of the meta-path through steps 3.1, 3.2, and 3.3, calculate the final meta-path importance according to formula (4);

步骤4：结合多条元路径生成查询实例；符合元路径语义的实例节点对在元路径的语义上具有较高的相似性；因此要得到查询结果实例，只需要找到在每一条元路径语义上都有较高相似性的节点对；Step 4: Combine multiple meta-paths to generate query instances; instance node pairs that conform to meta-path semantics have high similarity in meta-path semantics; therefore, to obtain query result instances, it is only necessary to find the semantics of each meta-path. Node pairs with higher similarity;

4.1计算节点对基于元路径的相似性；节点对根据不同元路径的相似性计算公式为：4.1 Calculate the similarity of node pairs based on meta-paths; the calculation formula of the similarity of node pairs according to different meta-paths is:

其中，ρ_ei(v_i,C_i+1)表示类型为C_i+1，根据边e_i连接到节点v_i的节点x的集合；P^i...n表示元路径中从节点C_i到C_n的子序列，α为固定参数，设置为0.5；当节点x的类型C_i+1为电影时，使用电影文本简介和查询短文本的相似度和∑sim(V_x，V_Q)来替代公式(13)中的|ρ_ei(v_i,C_i+1)|；Among them, ρ _ei (vi , C _i ₊₁ ) represents the type of C _i+1 , the set of nodes x connected to the node vi according to the edge e _i ; P _i ^...n represents the path from the node C _i To the subsequence of C _n , α is a fixed parameter, set to 0.5; when the type C _i+1 of the node x is a movie, use the movie text introduction and query the similarity of the short text and ∑sim(V _x , V _Q ) to replace |ρ _ei (vi ,C _i ₊₁ )| in formula (13);

4.2计算节点对实体s和t之间的相似性；使用线性聚合函数计算实体s和t之间的相似性σ(s,t|P)，将元路径对应的重要度作为相似性的权重，聚合函数为：4.2 Calculate the similarity between the nodes pair entities s and t; use the linear aggregation function to calculate the similarity σ(s, t|P) between the entities s and t, and use the importance corresponding to the meta-path as the weight of the similarity, The aggregation function is:

其中，I_j表示不同元路径P_j对应的重要度；Among them, I _j represents the importance corresponding to different meta-paths P _j ;

4.3根据相似性矩阵得到查询实例；得到基于元路径的节点相似性后，为每条元路径计算一个相似性矩阵；若影人节点的数量为m，相似性矩阵的大小为m×m，将元路径P的相似性矩阵记为SP：4.3 Obtain the query instance according to the similarity matrix; after obtaining the node similarity based on the meta-path, calculate a similarity matrix for each meta-path; if the number of shadow human nodes is m, and the size of the similarity matrix is m×m, the The similarity matrix of meta-path P is denoted as SP:

每条元路径第一次被生成时，就构建好相应的相似性矩阵，这些矩阵可以被重复利用，每次结合多条元路径进行查询时，只需要选取对应元路径的相似性矩阵，记录所有矩阵相同位置处值不为0的索引和值，根据索引即可得到满足所有元路径语义的节点对，计算这些节点对的相似性，即可得到查询结果。When each meta-path is generated for the first time, the corresponding similarity matrix is constructed, and these matrices can be reused. Each time a query is performed in combination with multiple meta-paths, only the similarity matrix of the corresponding meta-path needs to be selected and recorded. All the indices and values at the same position of the matrix are not 0. According to the index, node pairs that satisfy the semantics of all meta-paths can be obtained, and the similarity of these node pairs can be calculated to obtain the query result.

基于元路径的异构网络相似节点查询方法流程步骤至此结束。This concludes the process steps of the meta-path-based method for querying similar nodes in heterogeneous networks.

本发明综合上述技术提出了基于元路径的异构信息网络节点查询方法。为了解决传统的元路径不具备通用性，且不能区分是否是关键元路径，提出综合元路径长度、稀有度和强度三个因素的重要度计算方法，通过元路径的重要度来确定生成的元路径是否是“关键元路径”。另外，为了使生成的元路径受到短文本语义的约束，本发明利用电影文本简介信息，通过对电影文本简介和短文本描述进行语义匹配，从而得到受短文本语义约束的元路径，并计算相应的重要度。基于计算得到的元路径及其重要度得到异构信息网络下的相似节点查询结果。The present invention proposes a meta-path-based heterogeneous information network node query method based on the above technologies. In order to solve the problem that the traditional meta-path is not universal and cannot distinguish whether it is a critical meta-path or not, a method for calculating the importance of the meta-path length, rarity and strength is proposed, and the generated meta-path is determined by the importance of the meta-path. Whether the path is a "critical meta-path". In addition, in order to make the generated meta-path constrained by the semantics of the short text, the present invention uses the movie text introduction information to perform semantic matching on the movie text introduction and the short text description, so as to obtain the meta-path constrained by the short text semantics, and calculate the corresponding importance. Based on the calculated meta-paths and their importance, the query results of similar nodes in heterogeneous information networks are obtained.

本发明的优点是：(1)算法思路新颖。本发明使用元路径重要度来判断元路径是否是关键元路径，有效克服了通过领域专家确定的元路径不具备通用性的缺点。(2)多维度丰富元路径语义。在生成元路径过程中，增加了节点属性维度的语义。使用短文本对电影内容进行约束，使得生成的元路径不仅包含自身的关系语义，还包含与节点属性相关的语义。(3)算法实现简单快速。本发明通过递归扩展贪婪树的方式，实时地从用户输入中提取元路径并计算重要度，不需要进行数据标注和模型训练，极大地提高了自动生成元路径的效率。The advantages of the present invention are: (1) The algorithm idea is novel. The present invention uses the importance of the meta-path to judge whether the meta-path is a key meta-path, and effectively overcomes the disadvantage that the meta-path determined by domain experts does not have generality. (2) Multi-dimensional enrichment of meta-path semantics. In the process of generating meta-paths, the semantics of node attribute dimension are added. The movie content is constrained with short text, so that the generated meta-path contains not only its own relational semantics, but also the semantics related to node attributes. (3) The algorithm is simple and fast to implement. The invention extracts the meta-path from the user input in real time and calculates the importance by recursively expanding the greedy tree, without data labeling and model training, and greatly improves the efficiency of automatically generating the meta-path.

附图说明Description of drawings

图1是本发明方法的总流程图。Figure 1 is a general flow diagram of the method of the present invention.

具体实施方式Detailed ways

本发明结合元路径的长度、稀有度和强度等因素定义元路径的重要度，并在计算强度支持函数时结合了描述电影内容的短文本。本发明通过扩展贪婪树来生成元路径，生成过程包含四个阶段，第一个阶段根据输入的源节点和目标节点以及短文本描述生成路径贪婪树。第二个阶段根据得到的路径贪婪树生成元路径上的类型序列。第三个阶段计算元路径的重要度。第四个阶段结合多条元路径进行节点查询。The present invention defines the importance of the meta-path in combination with factors such as the length, rarity and strength of the meta-path, and combines the short text describing the movie content when calculating the strength support function. The invention generates the meta-path by expanding the greedy tree. The generating process includes four stages. The first stage generates the path greedy tree according to the input source node and target node and short text description. The second stage generates a sequence of types on the metapath from the resulting path greedy tree. The third stage computes the importance of meta-paths. The fourth stage combines multiple meta-paths for node query.

基于元路径的异构网络相似节点查询方法总体流程如图1所示，具体包括以下步骤：The overall process of the meta-path-based similar node query method in heterogeneous networks is shown in Figure 1, which specifically includes the following steps:

3.2计算元路径稀有度；3.2 Calculate meta-path rarity;

D_s,t＝D_t∪D_s (5)D _s,t = D _t ∪ D _s (5)

其中，in,

S_s,t(P)＝Strength(P)*MNIs_s,t(P) (9)S _s,t (P)=Strength(P)*MNIs _s,t (P) (9)

本说明书实施例所述的内容仅仅是对发明构思的实现形式的列举，本发明的保护范围不应当被视为仅限于实施例所陈述的具体形式，本发明的保护范围也及于本领域技术人员根据本发明构思所能够想到的等同技术手段。The content described in the embodiments of the present specification is only an enumeration of the realization forms of the inventive concept, and the protection scope of the present invention should not be regarded as limited to the specific forms stated in the embodiments, and the protection scope of the present invention also extends to those skilled in the art. Equivalent technical means that can be conceived by a person based on the inventive concept.

Claims

1. The heterogeneous network similar node query method based on the meta-path comprises the following steps:

step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;

1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;

1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if so, performing the semantic matching process of the step 1.3; if not, continuing the process of recursively expanding the greedy tree in the step 1.2 until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;

1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the semantic matching step between the movie profile and the short text entered by the user comprises:

(1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence₀,q₁,…q_i…q_n]Wherein q is_iIs the ith word, and n is the length of the word sequence;

(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as V_qi；

(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;

(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words₀,…t_j,…t_m]And the TF-IDF value of each word as a weight, the weight sequence being [ w₀,…w_j,…w_m]；

(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;

(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out verbs, adjectives and adverb modifiers, and keeping nouns;

(7) obtaining each word t using a DSG model_jWord vector V_tjThen, a weighted average sentence vector V is calculated by the formula (2)_T；

(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:

step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;

2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;

2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;

2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;

2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence₀,…l_i… }; for each meta path L in L_i＝{t₀,…,t_j… } according to the edge type t therein_jDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;

and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:

I_s,t(P)＝S_s,t(P)*R_s,t(P)*Penalty(|P|)(P∈P_s→t)(4)

wherein the importance is divided into S_s,t(P)，R_s,t(P) and Penalty (| P |);

3.1 calculating a length penalty function; obtaining meta path length from the meta path obtained in step 2, using beta^|P|As a penalty function, where β is an attenuation coefficient of 0.5;

3.2 calculating the rarity of the meta-path;

rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using D_s,tTo represent a pair of nodes similar to the pair of input nodes, defined as:

D_s,t＝D_t∪D_s (5)

wherein,

D_tdoes not include t, D_sS is not included, and the rarity of the meta-path can be obtained by calculation through a formula (8);

3.3 calculating the meta path strength; the meta path importance support function is:

S_s,t(P)＝Strength(P)*MNIs_s,t(P) (9)

wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), P_iThe number of instances of the ith node on the meta path;

Strength(p) Calculating the intensity coefficient of the meta-path P, wherein the formula (11) defines a calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), p_AIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;

when a is Movie, the number of pictures is,

3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);

and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;

4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:

where ρ is_ei(v_i,C_i+1) Is represented by type C_i+1According to edge e_iIs connected to node v_iA set of nodes x; p^i...nRepresenting slave node C in meta-path_iTo C_nα is a fixed parameter and is set to 0.5; type C when node x_i+1When it is a movie, use electricitySimilarity sum sigma sim (V) of shadow text introduction and query short text_x，V_Q) Instead of | ρ in equation (13)_ei(v_i,C_i+1)|；

4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:

wherein, I_jRepresenting different meta-paths P_jThe corresponding importance;

4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m, the size of the similarity matrix is m multiplied by m, and the similarity matrix of the meta-path P is marked as S_p：

When each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.