CN112380360B - Node query method based on meta-path in heterogeneous information network - Google Patents
Node query method based on meta-path in heterogeneous information network Download PDFInfo
- Publication number
- CN112380360B CN112380360B CN202011260846.0A CN202011260846A CN112380360B CN 112380360 B CN112380360 B CN 112380360B CN 202011260846 A CN202011260846 A CN 202011260846A CN 112380360 B CN112380360 B CN 112380360B
- Authority
- CN
- China
- Prior art keywords
- node
- path
- meta
- similarity
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000004364 calculation method Methods 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims description 26
- 239000011159 matrix material Substances 0.000 claims description 18
- 230000011218 segmentation Effects 0.000 claims description 12
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 239000003607 modifier Substances 0.000 claims description 3
- 238000005065 mining Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The heterogeneous network similar node query method based on the meta-path comprises the following steps: 1. generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process; 2. determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; 3. calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the element path by the number of example nodes in the greedy tree leaf node; 4. generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with semantically higher similarity in each meta-path need to be found.
Description
Technical Field
The invention relates to a node query method based on a meta path and oriented to a heterogeneous information network.
Background
Real-world systems are typically composed of a large number of interacting multi-type components, such as human social events, communications, and biological networks. In such systems, the components form a network by being connected to each other, and such a network is generally referred to as an information network. Most conventional information networks are homogeneous networks, that is, nodes in the information networks are entities of the same type, and the entities are connected through relationships of the same type. However, in practical applications most information networks are heterogeneous, i.e. the nodes and relations in the network are not of a single type. Without distinguishing the type of node or edge in the network, important semantic information contained therein is often lost. Data in many fields exists in the form of heterogeneous information networks, such as academic literature information networks, medical information networks, Twitter information networks, and the like.
In the heterogeneous information network, in order to represent different semantic relationships between two nodes, the concept of meta-path is proposed and applied to related tasks of heterogeneous information network mining. A meta-path is a sequence of node types and edge types that can be viewed as a simple graph schema that expresses different semantics. The meta-path can well express the connection semantics among different types of nodes in the heterogeneous information network, so that the meta-path is widely applied to various tasks of heterogeneous information network mining, in particular to a node query task based on meta-path similarity.
Most of the existing working meta-paths are determined by domain experts according to specific application scenarios, so the meta-paths usually have no universality, and different tasks and data sets need to be provided with different meta-paths. There are also some methods for automatically generating meta-paths, such as generating meta-paths within a length threshold by setting a maximum meta-path length using the length of the meta-path as a parameter. This approach may prevent the number of meta-paths from reaching infinity, but does not guarantee that meta-paths within the length threshold are all "critical" paths. The method for automatically generating the meta path mostly focuses on representing the connection relationship between the nodes as the meta path without considering the influence of the property of the nodes on the connection relationship. The generated representation of these meta-paths limits the accuracy of node queries.
Disclosure of Invention
In order to ensure that the generated meta-path is a key meta-path and overcome the problem that meta-path semantics do not contain the attributes of the nodes, the invention provides a node query method based on the key meta-path aiming at a video heterogeneous information network with a star structure, and the key meta-path is automatically generated by combining the node attributes and the meta-path importance.
The invention defines the importance degree of the meta-path by combining the factors of the length, rarity, strength and the like of the meta-path, and combines the short text for describing the movie content when calculating the strength support function. The method comprises the steps of generating a meta path by expanding a greedy tree, wherein the generation process comprises four stages, and in the first stage, the path greedy tree is generated according to input source nodes, target nodes and short text description; the second stage generates a type sequence on the element path according to the obtained path greedy tree; the third stage calculates the importance of the meta path; the fourth stage combines multiple element paths to carry out node query.
The general flow of the heterogeneous network similar node query method based on the meta-path specifically comprises the following steps:
step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;
1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;
1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if the node is a movie node, performing a semantic matching process of 1.3; if the node is not the movie node, continuing the 1.2 recursive expansion greedy tree process until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;
1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; the query text input at the same time is short text data, so that the short text input during query is foundThe movie introduction with similar query semantics can obtain movie contents conforming to the query semantics, thereby generating a meta-path conforming to the short text semantics; the following describes the semantic matching procedure between the movie profile and the short text entered by the user: (1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence0,q1,…qi…qn]Wherein q isiIs the ith word, and n is the length of the word sequence;
(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as Vqi;
(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;
(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words0,…tj,…tm]And the TF-IDF value of each word as a weight, the weight sequence being [ w0,…wj,…wm];
(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;
(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out modifiers such as verbs, adjectives and adverbs, and keeping nouns;
(7) obtaining each word t using a DSG modeljWord vector VtjThen, a weighted average sentence vector V is calculated by the formula (2)T;
(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:
step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;
2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;
2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence0,…li… }; for each meta path L in Li={t0,…,tj… } according to the edge type t thereinjDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;
and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:
Is,t(P)=Ss,t(P)*Rs,t(P)*Penalty(|P|)(P∈Ps→t)(4)
wherein the importance is divided into Ss,t(P),Rs,t(P) and Penalty (| P |);
3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using betaPAs a penalty function, where β is an attenuation coefficient of 0.5;
3.2 calculating the rarity of the meta-path;
rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using Ds,tTo represent a pair of nodes similar to the pair of input nodes, defined as:
Ds,t=Dt∪Ds (5)
wherein,
Dtdoes not include t, DsThe rarity of the s-element path is obtained by calculation through a formula (8);
3.3 calculating the meta path strength; the meta path importance support function is:
Ss,t(P)=Strength(P)*MNIss,t(P) (9)
wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), PiThe number of instances of the ith node on the meta path;
streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), pAIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;
3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);
and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;
4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:
where ρ isei(vi,Ci+1) Is represented by type Ci+1According to edge eiIs connected to node viA set of nodes x; pi...nRepresenting slave node C in meta-pathiTo Cnα is a fixed parameter and is set to 0.5; type C when node xi+1For movies, the similarity sum sigma sim (V) of the short text and the introduction of the movie text is usedx,VQ) Instead of | ρ in equation (13)ei(vi,Ci+1)|;
4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:
wherein, IjRepresenting different meta-paths PjThe corresponding importance;
4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m and the size of the similarity matrix is m × m, the similarity matrix of the meta-path P is recorded as SP:
when each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.
The steps of the heterogeneous network similar node query method based on the meta path are ended.
The invention provides a heterogeneous information network node query method based on meta-path by integrating the above technologies. In order to solve the problem that the traditional meta-path has no universality and cannot distinguish whether the meta-path is a key meta-path, an importance calculation method integrating three factors of meta-path length, rarity and strength is provided, and whether the generated meta-path is the 'key meta-path' is determined according to the importance of the meta-path. In addition, in order to make the generated meta-path be constrained by the short text semantics, the invention utilizes the movie text introduction information to carry out semantic matching on the movie text introduction and the short text description, thereby obtaining the meta-path which is constrained by the short text semantics and calculating the corresponding importance. And obtaining a similar node query result under the heterogeneous information network based on the computed meta-path and the importance thereof.
The invention has the advantages that: (1) the algorithm is novel in thinking. The invention judges whether the meta path is the key meta path by using the meta path importance, and effectively overcomes the defect that the meta path determined by a field expert does not have universality. (2) And (5) enriching meta-path semantics in multiple dimensions. In the process of generating the meta-path, the semantics of the attribute dimension of the node are increased. And (3) using short text to constrain the movie content, so that the generated meta-path not only contains self relation semantics, but also contains semantics related to node attributes. (3) The algorithm is simple and quick to implement. According to the method, the element path is extracted from the user input in real time and the importance is calculated in a mode of recursively expanding the greedy tree, data labeling and model training are not needed, and the efficiency of automatically generating the element path is greatly improved.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention.
Detailed Description
In order to ensure that the generated meta-path is a key meta-path and overcome the problem that meta-path semantics do not contain the attributes of the nodes, the invention provides a node query method based on the key meta-path aiming at a video heterogeneous information network with a star structure, and the key meta-path is automatically generated by combining the node attributes and the meta-path importance.
The invention defines the importance degree of the meta-path by combining the factors of the length, rarity, strength and the like of the meta-path, and combines the short text for describing the movie content when calculating the strength support function. The method generates the element path by expanding the greedy tree, the generation process comprises four stages, and the first stage generates the path greedy tree according to the input source node, the input target node and the short text description. And the second stage generates a type sequence on the element path according to the obtained path greedy tree. The third stage calculates the importance of the meta path. The fourth stage combines multiple element paths to carry out node query.
The overall flow of the heterogeneous network similar node query method based on the meta-path is shown in fig. 1, and specifically includes the following steps:
step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;
1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;
1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if the node is a movie node, performing a semantic matching process of 1.3; if the node is not the movie node, continuing the 1.2 recursive expansion greedy tree process until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;
1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the following describes the semantic matching procedure between the movie profile and the short text entered by the user: (1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence0,q1,…qi…qn]Wherein q isiIs the ith word, and n is the length of the word sequence;
(2) using a Directional Skip-Gram model (Directional Skip-Gram)m, DSG for short) to obtain word vector of each word, and marking as Vqi;
(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;
(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words0,…tj,…tm]And the TF-IDF value of each word as a weight, the weight sequence being [ w0,…wj,…wm];
(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;
(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out modifiers such as verbs, adjectives and adverbs, and keeping nouns;
(7) obtaining each word t using a DSG modeljWord vector VtjThen, a weighted average sentence vector V is calculated by the formula (2)T;
(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:
step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;
2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;
2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence0,…li… }; for each meta path L in Li={t0,…,tj… } according to the edge class thereinType tjDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;
and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:
Is,t(P)=Ss,t(P)*Rs,t(P)*Penalty(|P|)(P∈Ps→t)(4)
wherein the importance is divided into Ss,t(P),Rs,t(P) and Penalty (| P |);
3.1 calculating a length penalty function; obtaining the source path length from the meta path obtained in step 2, and using betaPAs a penalty function, where β is an attenuation coefficient of 0.5;
3.2 calculating the rarity of the meta-path;
rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using Ds,tTo represent a pair of nodes similar to the pair of input nodes, defined as:
Ds,t=Dt∪Ds (5)
wherein,
Dtdoes not include t, DsThe rarity of the s-element path is obtained by calculation through a formula (8);
3.3 calculating the meta path strength; the meta path importance support function is:
Ss,t(P)=Strength(P)*MNIss,t(P) (9)
wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), PiThe number of instances of the ith node on the meta path;
streth (P) calculating the strength coefficient of the meta-path P, and the formula (11) defines the calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), pAIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;
3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);
and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;
4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:
where ρ isei(vi,Ci+1) Is represented by type Ci+1According to edge eiIs connected to node viA set of nodes x; pi...nRepresenting slave node C in meta-pathiTo Cnα is a fixed parameter and is set to 0.5; type C when node xi+1For movies, the similarity sum sigma sim (V) of the short text and the introduction of the movie text is usedx,VQ) Instead of | ρ in equation (13)ei(vi,Ci+1)|;
4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:
wherein, IjRepresenting different meta-paths PjThe corresponding importance;
4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m and the size of the similarity matrix is m × m, the similarity matrix of the meta-path P is recorded as SP:
when each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.
The steps of the heterogeneous network similar node query method based on the meta path are ended.
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.
Claims (1)
1. The heterogeneous network similar node query method based on the meta-path comprises the following steps:
step 1: generating a path greedy tree; expanding the greedy tree according to the input source node and the short text description; performing semantic matching of short texts in the greedy tree expanding process;
1.1 building greedy tree root nodes; the object node of the greedy tree comprises two parts of information, one part is a node pair list generated in the path expansion process, wherein the node pairs are stored according to a dictionary form, a source node is stored as a dictionary key, and a target node is stored as a dictionary value; the other is a mark for judging whether the current greedy tree object can be expanded downwards or not, when the mark is True, the mark indicates that the current object can be continuously expanded downwards, when the mark is False, the mark indicates that the current object is the end point of a path or reaches the length threshold of a meta path, edges connecting the greedy tree object are marked by using edge types in a heterogeneous information network, and a root node of the greedy tree is not expanded, wherein the value corresponding to a source node is null;
1.2 recursively expanding the greedy tree; in the process of expanding the greedy tree, judging whether the next node is a movie node or not according to the edge type of the greedy tree; if so, performing the semantic matching process of the step 1.3; if not, continuing the process of recursively expanding the greedy tree in the step 1.2 until the target node appears in the value list of the greedy leaf node or the path reaches the length threshold value;
1.3, semantic matching is carried out on the short text and the movie text introduction which are input by the query; the film and television information network is a network with a star structure, wherein a central object exists, the other types of objects are connected with the central object, and the attribute of the central object can influence the relationship of all types in the network; in a film and television information network, the central object is a film, the contact among film persons is generated through the film, meanwhile, the film content contains rich semantics, and the semantics can highlight the characteristic of the connection relation among nodes from the aspect of node attributes; the movie introduction summarizes movie contents by using a short text, and can be understood as short text data; meanwhile, the input query text is short text data, so that a movie introduction similar to the short text query semantics input during query is found, movie contents conforming to the query semantics can be obtained, and a meta-path conforming to the short text semantics is generated; the semantic matching step between the movie profile and the short text entered by the user comprises:
(1) segmenting short texts input by the query by using an open source ending segmentation algorithm based on TextRank; the input short text is marked as Q, and each document after word segmentation can be represented as [ Q ] by a word sequence0,q1,…qi…qn]Wherein q isiIs the ith word, and n is the length of the word sequence;
(2) obtaining a word vector of each word by using a Directional Skip-Gram model (DSG for short), and marking the word vector as Vqi;
(3) After the word vectors are obtained, calculating the mean value of the word vectors through a formula (1) to obtain sentence vectors;
(4) segmenting words of the text introduction of the movie, recording the text introduction of the movie as T, and obtaining a word sequence [ T ] after segmenting words0,…tj,…tm]And the TF-IDF value of each word as a weight, the weight sequence being [ w0,…wj,…wm];
(5) The named entity recognition technology is adopted to process the names in the movie text introduction, and words recognized as the names are deleted from the word segmentation result;
(6) performing part-of-speech analysis on the words after word segmentation of the movie brief introduction, filtering out verbs, adjectives and adverb modifiers, and keeping nouns;
(7) obtaining each word t using a DSG modeljWord vector VtjThen, a weighted average sentence vector V is calculated by the formula (2)T;
(6) The similarity of the two sections of texts is obtained based on cosine similarity measurement, and the calculation formula is as follows:
step 2: determining a meta-path sequence; firstly traversing a greedy tree to obtain an edge type sequence, and then determining a node type sequence according to the edge type sequence; traversing the generated greedy tree, and separating a path connecting the input node pairs from the greedy tree; l is a path set, and all possible meta-path edge sequences are stored in L; recording the root node as the jth node of the ith layer, wherein i is 0, and j is 0;
2.1 traversing from the root node downwards; the root node is a current node, the jth node of the i +1 th layer of the greedy tree is a next node, and j is 0; putting an edge connecting a current node and a next node into a current path sequence l, and recording the dictionary value of the next node, namely the length of a target node set as the output degree of the node;
2.2, updating the current node as the next node of the previous step, wherein the next node is the jth leaf node of the i +1 th layer of the greedy tree, and j is 0; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.3 updating the next node to be the jth leaf node of the i +1 th layer of the greedy tree; if the mark of the next node is True, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; storing the current path sequence L in the set L, and making j equal to j +1, and performing step 2.3; otherwise, judging whether the next node has an extension edge, if so, putting the edge connecting the current node and the next node into the current path sequence l, and recording the dictionary value of the next node, namely the length of the target node set as the output degree of the node; repeating step 2.2 by making i ═ i +1 and j ═ 0; if the next node has no extended edge, let j equal to j +1, go to step 2.3;
2.4 after completing the traversal, get the meta-path set L ═ L containing the edge type sequence0,…li… }; for each meta path L in Li={t0,…,tj… } according to the edge type t thereinjDetermining a node type; finally, obtaining a complete meta-path containing the node type sequence and the edge type sequence;
and step 3: calculating the importance of the meta path; firstly, defining a calculation formula of the meta-path importance degree according to factors influencing the meta-path importance degree; calculating the importance of the meta-path by the number of instance nodes in the greedy tree leaf node, wherein the importance calculation formula of the meta-path is as follows:
Is,t(P)=Ss,t(P)*Rs,t(P)*Penalty(|P|)(P∈Ps→t)(4)
wherein the importance is divided into Ss,t(P),Rs,t(P) and Penalty (| P |);
3.1 calculating a length penalty function; obtaining meta path length from the meta path obtained in step 2, using beta|P|As a penalty function, where β is an attenuation coefficient of 0.5;
3.2 calculating the rarity of the meta-path;
rarity calculation function is used to evaluate that in a given heterogeneous information network G ═ V, E, meta-path P is in node pair with the input<s,t>Rarity in similar other node pairs, using Ds,tTo represent a pair of nodes similar to the pair of input nodes, defined as:
Ds,t=Dt∪Ds (5)
wherein,
Dtdoes not include t, DsS is not included, and the rarity of the meta-path can be obtained by calculation through a formula (8);
3.3 calculating the meta path strength; the meta path importance support function is:
Ss,t(P)=Strength(P)*MNIss,t(P) (9)
wherein, MNIs (P) calculates the minimum number of instances in the meta path P, and the calculation is shown in formula (10), PiThe number of instances of the ith node on the meta path;
Strength(p) Calculating the intensity coefficient of the meta-path P, wherein the formula (11) defines a calculation method; assuming that the node with the minimum number of instances obtained by the formula (10) is A, the out-degree of the node A is O (A), and the in-degree of the node A is I (A); when the node A is a movie node, the out-degree of the node A is calculated by the formula (12), pAIs an instance set of node A; summing the similarity of the vector of each node in the node A instance set and the short text vector to obtain the out degree of the node A;
3.4 calculating the importance of the meta path; respectively calculating the attenuation coefficient of the element path length, the rarity and the strength through the steps 3.1, 3.2 and 3.3, and then calculating the final element path importance according to a formula (4);
and 4, step 4: generating a query instance by combining a plurality of element paths; the example node pairs which conform to meta-path semantics have higher similarity in meta-path semantics; therefore, to obtain a query result instance, only node pairs with higher semanteme similarity of each meta-path need to be found;
4.1 calculating the similarity of node pairs based on meta-paths; the similarity calculation formula of the node pairs according to different meta-paths is as follows:
where ρ isei(vi,Ci+1) Is represented by type Ci+1According to edge eiIs connected to node viA set of nodes x; pi...nRepresenting slave node C in meta-pathiTo Cnα is a fixed parameter and is set to 0.5; type C when node xi+1When it is a movie, use electricitySimilarity sum sigma sim (V) of shadow text introduction and query short textx,VQ) Instead of | ρ in equation (13)ei(vi,Ci+1)|;
4.2 calculating the similarity between the node pair entities s and t; calculating the similarity sigma (s, t | P) between the entities s and t by using a linear aggregation function, taking the importance corresponding to the meta-path as the weight of the similarity, wherein the aggregation function is as follows:
wherein, IjRepresenting different meta-paths PjThe corresponding importance;
4.3 obtaining a query instance according to the similarity matrix; after the node similarity based on the element paths is obtained, calculating a similarity matrix for each element path; if the number of the shadow nodes is m, the size of the similarity matrix is m multiplied by m, and the similarity matrix of the meta-path P is marked as Sp:
When each meta-path is generated for the first time, a corresponding similarity matrix is constructed, the matrixes can be repeatedly used, when a plurality of meta-paths are combined for query, the similarity matrix of the corresponding meta-path is only needed to be selected, indexes and values with the values not being 0 at the same positions of all the matrixes are recorded, node pairs meeting the semantics of all the meta-paths can be obtained according to the indexes, and the similarity of the node pairs is calculated, so that a query result can be obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011260846.0A CN112380360B (en) | 2020-11-12 | 2020-11-12 | Node query method based on meta-path in heterogeneous information network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011260846.0A CN112380360B (en) | 2020-11-12 | 2020-11-12 | Node query method based on meta-path in heterogeneous information network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112380360A CN112380360A (en) | 2021-02-19 |
CN112380360B true CN112380360B (en) | 2022-03-18 |
Family
ID=74583173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011260846.0A Active CN112380360B (en) | 2020-11-12 | 2020-11-12 | Node query method based on meta-path in heterogeneous information network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112380360B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114357123B (en) * | 2022-03-18 | 2022-06-10 | 北京创新乐知网络技术有限公司 | Data matching method, device and equipment based on hierarchical structure and storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10656979B2 (en) * | 2016-03-31 | 2020-05-19 | International Business Machines Corporation | Structural and temporal semantics heterogeneous information network (HIN) for process trace clustering |
CN106777339A (en) * | 2017-01-13 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of method that author is recognized based on heterogeneous network incorporation model |
CN106802956B (en) * | 2017-01-19 | 2020-06-05 | 山东大学 | Movie recommendation method based on weighted heterogeneous information network |
CN108304496B (en) * | 2018-01-11 | 2022-02-25 | 上海交通大学 | Node similarity relation detection method based on combined element path in heterogeneous information network |
-
2020
- 2020-11-12 CN CN202011260846.0A patent/CN112380360B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN112380360A (en) | 2021-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11227118B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
WO2020082560A1 (en) | Method, apparatus and device for extracting text keyword, as well as computer readable storage medium | |
CN106960030B (en) | Information pushing method and device based on artificial intelligence | |
CN112860866B (en) | Semantic retrieval method, device, equipment and storage medium | |
CN112215837B (en) | Multi-attribute image semantic analysis method and device | |
CN112989055B (en) | Text recognition method and device, computer equipment and storage medium | |
Celikyilmaz et al. | Enriching word embeddings using knowledge graph for semantic tagging in conversational dialog systems | |
WO2014126657A1 (en) | Latent semantic analysis for application in a question answer system | |
CN107145485B (en) | Method and apparatus for compressing topic models | |
CN112214584B (en) | Using knowledge graphs to discover answers with entity relationships | |
CN112559747B (en) | Event classification processing method, device, electronic equipment and storage medium | |
CN107818183B (en) | Three-stage combined party building video recommendation method based on feature similarity measurement | |
CN109885723A (en) | A kind of generation method of video dynamic thumbnail, the method and device of model training | |
WO2021114836A1 (en) | Text coherence determining method, apparatus, and device, and medium | |
CN112214583A (en) | Extending knowledge graph using external data sources | |
CN112380360B (en) | Node query method based on meta-path in heterogeneous information network | |
CN110110218A (en) | A kind of Identity Association method and terminal | |
JP6867963B2 (en) | Summary Evaluation device, method, program, and storage medium | |
CN112860916A (en) | Movie-television-oriented multi-level knowledge map generation method | |
WO2024005960A1 (en) | Hierarchical ontology matching with self-supervision | |
CN116756600A (en) | Attribute network embedding and community finding method based on random walk | |
CN113010642B (en) | Semantic relation recognition method and device, electronic equipment and readable storage medium | |
Chae et al. | Uncertainty-based visual question answering: estimating semantic inconsistency between image and knowledge base | |
US20140164432A1 (en) | Ontology enhancement method and system | |
CN109657129B (en) | Method and device for acquiring information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |