CN112380360A - Node query method based on meta-path in heterogeneous information network - Google Patents

Node query method based on meta-path in heterogeneous information network Download PDF

Info

Publication number
CN112380360A
CN112380360A CN202011260846.0A CN202011260846A CN112380360A CN 112380360 A CN112380360 A CN 112380360A CN 202011260846 A CN202011260846 A CN 202011260846A CN 112380360 A CN112380360 A CN 112380360A
Authority
CN
China
Prior art keywords
node
path
meta
similarity
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011260846.0A
Other languages
Chinese (zh)
Other versions
CN112380360B (en
Inventor
汤颖
徐珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202011260846.0A priority Critical patent/CN112380360B/en
Publication of CN112380360A publication Critical patent/CN112380360A/en
Application granted granted Critical
Publication of CN112380360B publication Critical patent/CN112380360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The heterogeneous network similar node query method based on the meta-path comprises the following steps: 1. generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process; 2. determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; 3. calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the element path by the number of example nodes in the greedy tree leaf node; 4. generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with semantically higher similarity in each meta-path need to be found.

Description

Node query method based on meta-path in heterogeneous information network
Technical Field
The invention relates to a node query method based on a meta path and oriented to a heterogeneous information network.
Background
Real-world systems are typically composed of a large number of interacting multi-type components, such as human social events, communications, and biological networks. In such systems, the components form a network by being connected to each other, and such a network is generally referred to as an information network. Most conventional information networks are homogeneous networks, that is, nodes in the information networks are entities of the same type, and the entities are connected through relationships of the same type. However, in practical applications most information networks are heterogeneous, i.e. the nodes and relations in the network are not of a single type. Without distinguishing the type of node or edge in the network, important semantic information contained therein is often lost. Data in many fields exists in the form of heterogeneous information networks, such as academic literature information networks, medical information networks, Twitter information networks, and the like.
In the heterogeneous information network, in order to represent different semantic relationships between two nodes, the concept of meta-path is proposed and applied to related tasks of heterogeneous information network mining. A meta-path is a sequence of node types and edge types that can be viewed as a simple graph schema that expresses different semantics. The meta-path can well express the connection semantics among different types of nodes in the heterogeneous information network, so that the meta-path is widely applied to various tasks of heterogeneous information network mining, in particular to a node query task based on meta-path similarity.
Most of the existing working meta-paths are determined by domain experts according to specific application scenarios, so the meta-paths usually have no universality, and different tasks and data sets need to be provided with different meta-paths. There are also some methods for automatically generating meta-paths, such as generating meta-paths within a length threshold by setting a maximum meta-path length using the length of the meta-path as a parameter. This approach may prevent the number of meta-paths from reaching infinity, but does not guarantee that meta-paths within the length threshold are all "critical" paths. The method for automatically generating the meta path mostly focuses on representing the connection relationship between the nodes as the meta path without considering the influence of the property of the nodes on the connection relationship. The generated representation of these meta-paths limits the accuracy of node queries.
Disclosure of Invention
In order to ensure that the generated meta-path is a key meta-path and overcome the problem that meta-path semantics do not contain the attributes of the nodes, the invention provides a node query method based on the key meta-path aiming at a video heterogeneous information network with a star structure, and the key meta-path is automatically generated by combining the node attributes and the meta-path importance.
The invention defines the importance degree of the meta-path by combining the factors of the length, rarity, strength and the like of the meta-path, and combines the short text for describing the movie content when calculating the strength support function. The method comprises the steps of generating a meta path by expanding a greedy tree, wherein the generation process comprises four stages, and in the first stage, the path greedy tree is generated according to input source nodes, target nodes and short text description; the second stage generates a type sequence on the element path according to the obtained path greedy tree; the third stage calculates the importance of the meta path; the fourth stage combines multiple element paths to carry out node query.
The general flow of the heterogeneous network similar node query method based on the meta-path specifically comprises the following steps:
step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;
1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;
1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if the node is a movie node, performing a semantic matching process of 1.3; if the node is not the movie node, continuing the 1.2 recursive expansion greedy tree process until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;
1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the following describes the semantic matching procedure between the movie profile and the short text entered by the user: (1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence0,q1,…qi…qn]Wherein q isiIs the ith word, and n is the length of the word sequence;
(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as Vqi
(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;
Figure BDA0002774593100000021
(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words0,…tj,…tm]And the TF-IDF value of each word as a weight, the weight sequence being [ w0,…wj,…wm];
(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;
(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out modifiers such as verbs, adjectives and adverbs, and keeping nouns;
(7) obtaining each word t using a DSG modeljWord vector VtjThen, a weighted average sentence vector V is calculated by the formula (2)T
Figure BDA0002774593100000031
(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:
Figure BDA0002774593100000032
step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;
2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;
2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence0,…li… }; for each meta path L in Li={t0,…,tj… } according to the edge type t thereinjDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;
and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:
Is,t(P)=Ss,t(P)*Rs,t(P)*Penalty(|P|)(P∈Ps→t)(4)
wherein the importance is divided into Ss,t(P),Rs,t(P) and Penalty (| P |);
3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using betaPAs a penalty function, where β is an attenuation coefficient of 0.5;
3.2 calculating the rarity of the meta-path;
rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using Ds,tTo represent a pair of nodes similar to the pair of input nodes, defined as:
Ds,t=Dt∪Ds (5)
wherein the content of the first and second substances,
Figure BDA0002774593100000041
Figure BDA0002774593100000042
Dtdoes not include t, DsThe rarity of the s-element path is obtained by calculation through a formula (8);
Figure BDA0002774593100000043
3.3 calculating the meta path strength; the meta path importance support function is:
Ss,t(P)=Strength(P)*MNIss,t(P) (9)
wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), PiIs on the meta pathNumber of instances of i nodes;
Figure BDA0002774593100000044
streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), pAIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;
Figure BDA0002774593100000051
Figure BDA0002774593100000052
3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);
and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;
4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:
Figure BDA0002774593100000053
where ρ isei(vi,Ci+1) Is represented by type Ci+1According to edge eiIs connected to node viA set of nodes x; pi...nRepresenting slave node C in meta-pathiTo Cnα is a fixed parameter and is set to 0.5; type C when node xi+1For movies, the similarity sum sigma sim (V) of the short text and the introduction of the movie text is usedx,VQ) Instead of | ρ in equation (13)ei(vi,Ci+1)|;
4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:
Figure BDA0002774593100000054
wherein, IjRepresenting different meta-paths PjThe corresponding importance;
4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m and the size of the similarity matrix is m × m, the similarity matrix of the meta-path P is recorded as SP:
Figure BDA0002774593100000055
when each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.
The steps of the heterogeneous network similar node query method based on the meta path are ended.
The invention provides a heterogeneous information network node query method based on meta-path by integrating the above technologies. In order to solve the problem that the traditional meta-path has no universality and cannot distinguish whether the meta-path is a key meta-path, an importance calculation method integrating three factors of meta-path length, rarity and strength is provided, and whether the generated meta-path is the 'key meta-path' is determined according to the importance of the meta-path. In addition, in order to make the generated meta-path be constrained by the short text semantics, the invention utilizes the movie text introduction information to carry out semantic matching on the movie text introduction and the short text description, thereby obtaining the meta-path which is constrained by the short text semantics and calculating the corresponding importance. And obtaining a similar node query result under the heterogeneous information network based on the computed meta-path and the importance thereof.
The invention has the advantages that: (1) the algorithm is novel in thinking. The invention judges whether the meta path is the key meta path by using the meta path importance, and effectively overcomes the defect that the meta path determined by a field expert does not have universality. (2) And (5) enriching meta-path semantics in multiple dimensions. In the process of generating the meta-path, the semantics of the attribute dimension of the node are increased. And (3) using short text to constrain the movie content, so that the generated meta-path not only contains self relation semantics, but also contains semantics related to node attributes. (3) The algorithm is simple and quick to implement. According to the method, the element path is extracted from the user input in real time and the importance is calculated in a mode of recursively expanding the greedy tree, data labeling and model training are not needed, and the efficiency of automatically generating the element path is greatly improved.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention.
Detailed Description
In order to ensure that the generated meta-path is a key meta-path and overcome the problem that meta-path semantics do not contain the attributes of the nodes, the invention provides a node query method based on the key meta-path aiming at a video heterogeneous information network with a star structure, and the key meta-path is automatically generated by combining the node attributes and the meta-path importance.
The invention defines the importance degree of the meta-path by combining the factors of the length, rarity, strength and the like of the meta-path, and combines the short text for describing the movie content when calculating the strength support function. The method generates the element path by expanding the greedy tree, the generation process comprises four stages, and the first stage generates the path greedy tree according to the input source node, the input target node and the short text description. And the second stage generates a type sequence on the element path according to the obtained path greedy tree. The third stage calculates the importance of the meta path. The fourth stage combines multiple element paths to carry out node query.
The overall flow of the heterogeneous network similar node query method based on the meta-path is shown in fig. 1, and specifically includes the following steps:
step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;
1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;
1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if the node is a movie node, performing a semantic matching process of 1.3; if the node is not the movie node, continuing the 1.2 recursive expansion greedy tree process until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;
1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the video information network is a star-structured network in which there is a central object and the other types of objects are all associated with itThe central objects are connected, and the attributes of the central objects can influence the relationships of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the following describes the semantic matching procedure between the movie profile and the short text entered by the user: (1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence0,q1,…qi…qn]Wherein q isiIs the ith word, and n is the length of the word sequence;
(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as Vqi
(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;
Figure BDA0002774593100000071
(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words0,…tj,…tm]And the TF-IDF value of each word as a weight, the weight sequence being [ w0,…wj,…wm];
(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;
(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out modifiers such as verbs, adjectives and adverbs, and keeping nouns;
(7) obtaining each word t using a DSG modeljWord vector VtjThen, a weighted average sentence vector V is calculated by the formula (2)T
Figure BDA0002774593100000072
(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:
Figure BDA0002774593100000081
step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;
2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;
2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence0,…li… }; for each meta path L in Li={t0,…,tj… } according to the edge type t thereinjDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;
and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:
Is,t(P)=Ss,t(P)*Rs,t(P)*Penalty(|P|)(P∈Ps→t)(4)
wherein the importance is divided into Ss,t(P),Rs,t(P) and Penalty (| P |);
3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using betaPAs a penalty function, where β is an attenuation coefficient of 0.5;
3.2 calculating the rarity of the meta-path;
rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using Ds,tTo represent a pair of nodes similar to the pair of input nodes, defined as:
Ds,t=Dt∪Ds (5)
wherein the content of the first and second substances,
Figure BDA0002774593100000091
Figure BDA0002774593100000092
Dtdoes not include t, DsThe rarity of the s-element path is obtained by calculation through a formula (8);
Figure BDA0002774593100000093
3.3 calculating the meta path strength; the meta path importance support function is:
Ss,t(P)=Strength(P)*MNIss,t(P) (9)
wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), PiThe number of instances of the ith node on the meta path;
Figure BDA0002774593100000094
streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), pAIs an instance set of node A; by node A in the instance setThe similarity of the vector of each node and the short text vector is summed to obtain the output degree of the node A;
Figure BDA0002774593100000095
Figure BDA0002774593100000096
3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);
and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;
4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:
Figure BDA0002774593100000101
where ρ isei(vi,Ci+1) Is represented by type Ci+1According to edge eiIs connected to node viA set of nodes x; pi...nRepresenting slave node C in meta-pathiTo Cnα is a fixed parameter and is set to 0.5; type C when node xi+1For movies, the similarity sum sigma sim (V) of the short text and the introduction of the movie text is usedx,VQ) Instead of | ρ in equation (13)ei(vi,Ci+1)|;
4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:
Figure BDA0002774593100000102
wherein, IjRepresenting different meta-paths PjThe corresponding importance;
4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m and the size of the similarity matrix is m × m, the similarity matrix of the meta-path P is recorded as SP:
Figure BDA0002774593100000103
when each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.
The steps of the heterogeneous network similar node query method based on the meta path are ended.
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims (1)

1. The heterogeneous network similar node query method based on the meta-path comprises the following steps:
step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;
1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;
1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if so, performing the semantic matching process of the step 1.3; if not, continuing the process of recursively expanding the greedy tree in the step 1.2 until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;
1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the step of semantic matching between the movie profile and the short text entered by the user comprises:
(1) use is based on TextPerforming word segmentation on short text input by using an open source ending word segmentation algorithm of Rank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence0,q1,…qi…qn]Wherein q isiIs the ith word, and n is the length of the word sequence;
(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as Vqi
(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;
Figure FDA0002774593090000011
(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words0,…tj,…tm]And the TF-IDF value of each word as a weight, the weight sequence being [ w0,…wj,…wm];
(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;
(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out modifiers such as verbs, adjectives and adverbs, and keeping nouns;
(7) obtaining a word vector V of each word tj using a DSG modeltjThen, a weighted average sentence vector V is calculated by the formula (2)T
Figure FDA0002774593090000021
(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:
Figure FDA0002774593090000022
step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;
2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;
2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence0,…li… }; for each meta path L in Li={t0,…,tj… } according to the edge type t thereinjDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;
and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:
Is,t(P)=Ss,t(P)*Rs,t(P)*Penalty(|P|)(P∈Ps→t) (4)
wherein the importance is divided into Ss,t(P),Rs,t(P) and Penalty (| P |);
3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using beta|P|As a penalty function, where β is an attenuation coefficient of 0.5;
3.2 calculating the rarity of the meta-path;
rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using Ds,tTo represent a pair of nodes similar to the pair of input nodes, defined as:
Ds,t=Dt∪Ds (5)
wherein the content of the first and second substances,
Figure FDA0002774593090000031
Figure FDA0002774593090000032
Dtdoes not include t, DsThe rarity of the s-element path is obtained by calculation through a formula (8);
Figure FDA0002774593090000033
3.3 calculating the meta path strength; the meta path importance support function is:
Ss,t(P)=Strength(P)*MNIss,t(P) (9)
wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), PiThe number of instances of the ith node on the meta path;
Figure FDA0002774593090000034
streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), pAIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;
Figure FDA0002774593090000041
Figure FDA0002774593090000042
3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);
and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;
4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:
Figure FDA0002774593090000043
where ρ isei(vi,Ci+1) Is represented by type Ci+1According to edge eiIs connected to node viA set of nodes x; pi...nRepresenting slave node C in meta-pathiTo Cnα is a fixed parameter and is set to 0.5; type C when node xi+1For movies, the similarity sum sigma sim (V) of the short text and the introduction of the movie text is usedx,VQ) Instead of | ρ in equation (13)ei(vi,Ci+1)|;
4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:
Figure FDA0002774593090000044
wherein, IjRepresenting different meta-paths PjThe corresponding importance;
4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m and the size of the similarity matrix is m × m, the similarity matrix of the meta-path P is recorded as SP:
Figure FDA0002774593090000045
when each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.
CN202011260846.0A 2020-11-12 2020-11-12 Node query method based on meta-path in heterogeneous information network Active CN112380360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011260846.0A CN112380360B (en) 2020-11-12 2020-11-12 Node query method based on meta-path in heterogeneous information network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011260846.0A CN112380360B (en) 2020-11-12 2020-11-12 Node query method based on meta-path in heterogeneous information network

Publications (2)

Publication Number Publication Date
CN112380360A true CN112380360A (en) 2021-02-19
CN112380360B CN112380360B (en) 2022-03-18

Family

ID=74583173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011260846.0A Active CN112380360B (en) 2020-11-12 2020-11-12 Node query method based on meta-path in heterogeneous information network

Country Status (1)

Country Link
CN (1) CN112380360B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357123A (en) * 2022-03-18 2022-04-15 北京创新乐知网络技术有限公司 Data matching method, device and equipment based on hierarchical structure and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777339A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of method that author is recognized based on heterogeneous network incorporation model
CN106802956A (en) * 2017-01-19 2017-06-06 山东大学 A kind of film based on weighting Heterogeneous Information network recommends method
US20170286190A1 (en) * 2016-03-31 2017-10-05 International Business Machines Corporation Structural and temporal semantics heterogeneous information network (hin) for process trace clustering
CN108304496A (en) * 2018-01-11 2018-07-20 上海交通大学 Node similarity relation detection method based on composite unit path in Heterogeneous Information net

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286190A1 (en) * 2016-03-31 2017-10-05 International Business Machines Corporation Structural and temporal semantics heterogeneous information network (hin) for process trace clustering
CN106777339A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of method that author is recognized based on heterogeneous network incorporation model
CN106802956A (en) * 2017-01-19 2017-06-06 山东大学 A kind of film based on weighting Heterogeneous Information network recommends method
CN108304496A (en) * 2018-01-11 2018-07-20 上海交通大学 Node similarity relation detection method based on composite unit path in Heterogeneous Information net

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴瑶等: "多元图融合的异构信息网嵌入", 《计算机研究与发展 》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357123A (en) * 2022-03-18 2022-04-15 北京创新乐知网络技术有限公司 Data matching method, device and equipment based on hierarchical structure and storage medium

Also Published As

Publication number Publication date
CN112380360B (en) 2022-03-18

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US10289717B2 (en) Semantic search apparatus and method using mobile terminal
CN112215837B (en) Multi-attribute image semantic analysis method and device
Celikyilmaz et al. Enriching word embeddings using knowledge graph for semantic tagging in conversational dialog systems
CN112989055B (en) Text recognition method and device, computer equipment and storage medium
WO2014126657A1 (en) Latent semantic analysis for application in a question answer system
KR20220115046A (en) Method and appartuas for semantic retrieval, device and storage medium
CN106446162A (en) Orient field self body intelligence library article search method
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN107818183B (en) Three-stage combined party building video recommendation method based on feature similarity measurement
CN109960722B (en) Information processing method and device
CN112000790B (en) Legal text accurate retrieval method, terminal system and readable storage medium
CN112380360B (en) Node query method based on meta-path in heterogeneous information network
JP6867963B2 (en) Summary Evaluation device, method, program, and storage medium
CN113761124B (en) Training method of text coding model, information retrieval method and equipment
CN112860916A (en) Movie-television-oriented multi-level knowledge map generation method
CN116756600A (en) Attribute network embedding and community finding method based on random walk
CN114528417B (en) Knowledge graph ontology construction method, device and equipment and readable storage medium
US9104755B2 (en) Ontology enhancement method and system
Chae et al. Uncertainty-based visual question answering: estimating semantic inconsistency between image and knowledge base
CN114429140A (en) Case cause identification method and system for causal inference based on related graph information
CN113010642A (en) Semantic relation recognition method and device, electronic equipment and readable storage medium
CN110633363A (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN111241254A (en) Statement similarity calculation method
US20240005094A1 (en) Hierarchical ontology matching with self-supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant