CN112380360A

CN112380360A - Node query method based on meta-path in heterogeneous information network

Info

Publication number: CN112380360A
Application number: CN202011260846.0A
Authority: CN
Inventors: 汤颖; 徐珊
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-02-19
Anticipated expiration: 2040-11-12
Also published as: CN112380360B

Abstract

The heterogeneous network similar node query method based on the meta-path comprises the following steps: 1. generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process; 2. determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; 3. calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the element path by the number of example nodes in the greedy tree leaf node; 4. generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with semantically higher similarity in each meta-path need to be found.

Description

Node query method based on meta-path in heterogeneous information network

Technical Field

The invention relates to a node query method based on a meta path and oriented to a heterogeneous information network.

Background

Real-world systems are typically composed of a large number of interacting multi-type components, such as human social events, communications, and biological networks. In such systems, the components form a network by being connected to each other, and such a network is generally referred to as an information network. Most conventional information networks are homogeneous networks, that is, nodes in the information networks are entities of the same type, and the entities are connected through relationships of the same type. However, in practical applications most information networks are heterogeneous, i.e. the nodes and relations in the network are not of a single type. Without distinguishing the type of node or edge in the network, important semantic information contained therein is often lost. Data in many fields exists in the form of heterogeneous information networks, such as academic literature information networks, medical information networks, Twitter information networks, and the like.

In the heterogeneous information network, in order to represent different semantic relationships between two nodes, the concept of meta-path is proposed and applied to related tasks of heterogeneous information network mining. A meta-path is a sequence of node types and edge types that can be viewed as a simple graph schema that expresses different semantics. The meta-path can well express the connection semantics among different types of nodes in the heterogeneous information network, so that the meta-path is widely applied to various tasks of heterogeneous information network mining, in particular to a node query task based on meta-path similarity.

Most of the existing working meta-paths are determined by domain experts according to specific application scenarios, so the meta-paths usually have no universality, and different tasks and data sets need to be provided with different meta-paths. There are also some methods for automatically generating meta-paths, such as generating meta-paths within a length threshold by setting a maximum meta-path length using the length of the meta-path as a parameter. This approach may prevent the number of meta-paths from reaching infinity, but does not guarantee that meta-paths within the length threshold are all "critical" paths. The method for automatically generating the meta path mostly focuses on representing the connection relationship between the nodes as the meta path without considering the influence of the property of the nodes on the connection relationship. The generated representation of these meta-paths limits the accuracy of node queries.

Disclosure of Invention

In order to ensure that the generated meta-path is a key meta-path and overcome the problem that meta-path semantics do not contain the attributes of the nodes, the invention provides a node query method based on the key meta-path aiming at a video heterogeneous information network with a star structure, and the key meta-path is automatically generated by combining the node attributes and the meta-path importance.

The invention defines the importance degree of the meta-path by combining the factors of the length, rarity, strength and the like of the meta-path, and combines the short text for describing the movie content when calculating the strength support function. The method comprises the steps of generating a meta path by expanding a greedy tree, wherein the generation process comprises four stages, and in the first stage, the path greedy tree is generated according to input source nodes, target nodes and short text description; the second stage generates a type sequence on the element path according to the obtained path greedy tree; the third stage calculates the importance of the meta path; the fourth stage combines multiple element paths to carry out node query.

The general flow of the heterogeneous network similar node query method based on the meta-path specifically comprises the following steps:

step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;

1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;

1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if the node is a movie node, performing a semantic matching process of 1.3; if the node is not the movie node, continuing the 1.2 recursive expansion greedy tree process until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;

1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the following describes the semantic matching procedure between the movie profile and the short text entered by the user: (1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence₀,q₁,…q_i…q_n]Wherein q is_iIs the ith word, and n is the length of the word sequence;

(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as V_qi；

(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;

(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words₀,…t_j,…t_m]And the TF-IDF value of each word as a weight, the weight sequence being [ w₀,…w_j,…w_m]；

(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;

(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out modifiers such as verbs, adjectives and adverbs, and keeping nouns;

(7) obtaining each word t using a DSG model_jWord vector V_tjThen, a weighted average sentence vector V is calculated by the formula (2)_T；

(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:

step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;

2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;

2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;

2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;

2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence₀,…l_i… }; for each meta path L in L_i＝{t₀,…,t_j… } according to the edge type t therein_jDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;

and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:

I_s,t(P)＝S_s,t(P)*R_s,t(P)*Penalty(|P|)(P∈P_s→t)(4)

wherein the importance is divided into S_s,t(P)，R_s,t(P) and Penalty (| P |);

3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using beta^PAs a penalty function, where β is an attenuation coefficient of 0.5;

3.2 calculating the rarity of the meta-path;

rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using D_s,tTo represent a pair of nodes similar to the pair of input nodes, defined as:

D_s,t＝D_t∪D_s (5)

wherein the content of the first and second substances,

D_tdoes not include t, D_sThe rarity of the s-element path is obtained by calculation through a formula (8);

3.3 calculating the meta path strength; the meta path importance support function is:

S_s,t(P)＝Strength(P)*MNIs_s,t(P) (9)

wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), P_iIs on the meta pathNumber of instances of i nodes;

streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), p_AIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;

3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);

and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;

4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:

where ρ is_ei(v_i,C_i+1) Is represented by type C_i+1According to edge e_iIs connected to node v_iA set of nodes x; p^i...nRepresenting slave node C in meta-path_iTo C_nα is a fixed parameter and is set to 0.5; type C when node x_i+1For movies, the similarity sum sigma sim (V) of the short text and the introduction of the movie text is used_x，V_Q) Instead of | ρ in equation (13)_ei(v_i,C_i+1)|；

4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:

wherein, I_jRepresenting different meta-paths P_jThe corresponding importance;

4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m and the size of the similarity matrix is m × m, the similarity matrix of the meta-path P is recorded as SP:

when each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.

The steps of the heterogeneous network similar node query method based on the meta path are ended.

The invention provides a heterogeneous information network node query method based on meta-path by integrating the above technologies. In order to solve the problem that the traditional meta-path has no universality and cannot distinguish whether the meta-path is a key meta-path, an importance calculation method integrating three factors of meta-path length, rarity and strength is provided, and whether the generated meta-path is the 'key meta-path' is determined according to the importance of the meta-path. In addition, in order to make the generated meta-path be constrained by the short text semantics, the invention utilizes the movie text introduction information to carry out semantic matching on the movie text introduction and the short text description, thereby obtaining the meta-path which is constrained by the short text semantics and calculating the corresponding importance. And obtaining a similar node query result under the heterogeneous information network based on the computed meta-path and the importance thereof.

The invention has the advantages that: (1) the algorithm is novel in thinking. The invention judges whether the meta path is the key meta path by using the meta path importance, and effectively overcomes the defect that the meta path determined by a field expert does not have universality. (2) And (5) enriching meta-path semantics in multiple dimensions. In the process of generating the meta-path, the semantics of the attribute dimension of the node are increased. And (3) using short text to constrain the movie content, so that the generated meta-path not only contains self relation semantics, but also contains semantics related to node attributes. (3) The algorithm is simple and quick to implement. According to the method, the element path is extracted from the user input in real time and the importance is calculated in a mode of recursively expanding the greedy tree, data labeling and model training are not needed, and the efficiency of automatically generating the element path is greatly improved.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention.

Detailed Description

The invention defines the importance degree of the meta-path by combining the factors of the length, rarity, strength and the like of the meta-path, and combines the short text for describing the movie content when calculating the strength support function. The method generates the element path by expanding the greedy tree, the generation process comprises four stages, and the first stage generates the path greedy tree according to the input source node, the input target node and the short text description. And the second stage generates a type sequence on the element path according to the obtained path greedy tree. The third stage calculates the importance of the meta path. The fourth stage combines multiple element paths to carry out node query.

The overall flow of the heterogeneous network similar node query method based on the meta-path is shown in fig. 1, and specifically includes the following steps:

1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the video information network is a star-structured network in which there is a central object and the other types of objects are all associated with itThe central objects are connected, and the attributes of the central objects can influence the relationships of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the following describes the semantic matching procedure between the movie profile and the short text entered by the user: (1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence₀,q₁,…q_i…q_n]Wherein q is_iIs the ith word, and n is the length of the word sequence;

I_s,t(P)＝S_s,t(P)*R_s,t(P)*Penalty(|P|)(P∈P_s→t)(4)

wherein the importance is divided into S_s,t(P)，R_s,t(P) and Penalty (| P |);

3.2 calculating the rarity of the meta-path;

D_s,t＝D_t∪D_s (5)

wherein the content of the first and second substances,

S_s,t(P)＝Strength(P)*MNIs_s,t(P) (9)

wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), P_iThe number of instances of the ith node on the meta path;

streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), p_AIs an instance set of node A; by node A in the instance setThe similarity of the vector of each node and the short text vector is summed to obtain the output degree of the node A;

wherein, I_jRepresenting different meta-paths P_jThe corresponding importance;

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. The heterogeneous network similar node query method based on the meta-path comprises the following steps:

1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if so, performing the semantic matching process of the step 1.3; if not, continuing the process of recursively expanding the greedy tree in the step 1.2 until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;

1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the step of semantic matching between the movie profile and the short text entered by the user comprises:

(1) use is based on TextPerforming word segmentation on short text input by using an open source ending word segmentation algorithm of Rank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence₀,q₁,…q_i…q_n]Wherein q is_iIs the ith word, and n is the length of the word sequence;

(7) obtaining a word vector V of each word tj using a DSG model_tjThen, a weighted average sentence vector V is calculated by the formula (2)_T；

I_s,t(P)＝S_s,t(P)*R_s,t(P)*Penalty(|P|)(P∈P_s→t) (4)

wherein the importance is divided into S_s,t(P)，R_s,t(P) and Penalty (| P |);

3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using beta^|P|As a penalty function, where β is an attenuation coefficient of 0.5;

3.2 calculating the rarity of the meta-path;

D_s,t＝D_t∪D_s (5)

wherein the content of the first and second substances,

S_s,t(P)＝Strength(P)*MNIs_s,t(P) (9)

wherein, I_jRepresenting different meta-paths P_jThe corresponding importance;