CN112417224A - Graph embedding method and system for random walk based on entropy drive - Google Patents

Graph embedding method and system for random walk based on entropy drive Download PDF

Info

Publication number
CN112417224A
CN112417224A CN202011358300.9A CN202011358300A CN112417224A CN 112417224 A CN112417224 A CN 112417224A CN 202011358300 A CN202011358300 A CN 202011358300A CN 112417224 A CN112417224 A CN 112417224A
Authority
CN
China
Prior art keywords
node
graph
random
walk
random walk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011358300.9A
Other languages
Chinese (zh)
Inventor
王芳
冯丹
方鹏
徐湘灏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202011358300.9A priority Critical patent/CN112417224A/en
Publication of CN112417224A publication Critical patent/CN112417224A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a graph embedding method and a system for random walk based on entropy driving, wherein the method comprises the following steps: randomly selecting a node u which is not walked from the graph data as an initial node to carry out random walk; starting to walk from the current node u, randomly selecting a node v from neighbor nodes of the node u as a next hop candidate node and taking probability
Figure DDA0002803269870000011
U to v, otherwise, 1-P(u,v)Backtracking to a node u; determining coefficient of information entropy and path length L if the path is wandered
Figure DDA0002803269870000012
Recording the node sequence of the wandering path into a sample; after all the vertexes are sampled for one round, if the vertexes are sampled for two roundsVariation Δ D of relative entropy of the number of occurrences of a node in a sample and the distribution of degrees of the node in the original graphr(p | | q) is less than or equal to delta, and sampling is finished; and sequentially inputting the collected samples into a word embedding model Skip-Gram to generate vector representation of each node. The invention can efficiently and extendibly embed the graph data and ensure the effectiveness of node embedding representation.

Description

Graph embedding method and system for random walk based on entropy drive
Technical Field
The invention belongs to the technical field of graph embedding, and particularly relates to a graph embedding method and system based on entropy driving and random walk.
Background
With the continuous improvement of social informatization degree, a graph as an important data structure is widely applied to a plurality of fields, for example, in a social network, a vertex can be used for representing a user, and an edge can be used for representing the attention or the interaction between the users; in a chemical molecular network, vertices represent atoms and edges represent bonds; in an e-commerce network, vertices represent users, edges represent users' browsing and scoring of items, and so on. In recent years, analysis tasks based on graph data become more important, and graph embedding is also attracting attention as an effective graph analysis technology, such as Deepwalk, Node2vec, Struc2vec, VERSE, DiaRW and the like, which aims to map large-scale and high-dimensional graph data into a low-dimensional dense vector space, and simultaneously retain structural and attribute features in an original graph, and more importantly, can be effectively applied to machine learning tasks, such as link prediction, Node classification, recommendation system, visualization and the like.
Random walk has been widely applied to graph embedding technology due to its feature of high flexibility and relying only on local information of a graph. Specifically, random walk is performed on input graph data to collect a node sequence, a neighbor node is randomly selected and added into a walk path from an initial node, then the above process is repeated to generate upper and lower text samples, and finally the samples are processed through a word embedding model in a natural language processing technology to generate vector representation of each node. However, the problems of large computational overhead and difficulty in guaranteeing effectiveness generally exist in the existing graph embedding technology, and even more so in the case of large-scale graphs. Deepwalk is a graph embedding technology which is firstly proposed to carry out representation learning on nodes in a graph based on random walk, a node sequence is obtained by cutting off the random walk, and then a Skip-Gram model in a natural language processing technology is utilized to learn the embedded vector representation of network nodes; node2vec is an extension of deep walk, and a biased random walk is introduced to capture characteristics of nodes in a graph to generate a low-dimensional embedded vector, but a graph embedding technology based on random walk generally needs to rely on the walk to collect a large number of Node pairs to generate context samples to ensure the embedding quality, for example, it takes several months to complete the embedded representation of a graph with 100 ten thousand nodes and 500 ten thousand edges by adopting 20 threads to run Node2vec on a common commercial machine, which undoubtedly brings great challenges to computing resources.
In addition, by analyzing the existing graph embedding technology based on random walk, the fact that the path length (generally set to be L equal to 80) of the random walk and the sampling number (generally set to be r equal to 10) of each node are preset by the existing method is discovered, a large amount of low-quality information is introduced, and the effectiveness and the expansibility of graph embedding are greatly limited. In order to explore the influence of the walking path length and the node walking times on the sampling information representation, the information entropy, the walking path length, the relative entropy and the node sampling times are respectively described. The information entropy can express how much information is contained in a random process, and researchers indicate that local and global information in a random walk process can be carried out on graph data through the information entropy measurement. On the real graph data, as the length L of the walking path increases, the information entropy of the walking path gradually converges, whereas the information entropy corresponding to the fixed walking step length of L ═ 80 adopted by the existing graph embedding method based on random walking reaches convergence already, and at this time, the newly added node in the path contributes little to the expression of the walking path information, and a large amount of redundant information is inevitably generated. As described above, the graph embedding technology based on random walks generates upper and lower context samples through node paths collected by the walks, and then the upper and lower context samples are loaded into a word embedding model for training.
Previous studies have shown that (deep: Online learning of social representations), published by Perozzi and Al-Rfou et Al 2014, the distribution of the occurrence times of nodes in the generated upper and lower context samples has similarity to the degree distribution of the nodes in the original graph structure, and both follow the power law distribution. Based on the inspiration, on the real graph data Wiki-Vote, the kernel density distribution of the Node occurrence times in the upper and lower samples collected by the Node2vec and the Node degrees in the original graph data is researched along with the increase of the walk iteration round r, and the distribution of the Node occurrence times and the Node degree distribution in the original graph gradually show certain similarity along with the increase of r, which also proves the previous research findings. Because the relative entropy (D) is used as an effective tool for evaluating the difference between the two distributions, and the relative entropy of the two distributions after each round of walking is further calculated, it can be found that when r is 7, the relative entropy of the two distributions tends to be stable, which indicates that the upper and lower context samples collected by random walking are enough, and if the number of fixed walks r is 10, which is usually set, information redundancy is generated, so that the training performance in the word embedding model is reduced, and the accuracy of the downstream task is further affected. Therefore, the path length of the random walk and the sampling times of each node are preset by a 'one-cutting' strategy adopted by the existing graph embedding method based on random sampling according to empirical values, so that on one hand, the effectiveness of a training model is influenced, and on the other hand, the calculation performance is seriously tired.
Therefore, how to design an efficient and scalable graph embedding method and system is a problem to be solved by those skilled in the art.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a graph embedding method and a graph embedding system based on entropy driving random walk, and aims to solve the problem that the effectiveness and the expansibility are poor because a 'one-knife-cut' strategy presets the path length of the random walk and the sampling frequency of each node according to empirical values in the prior art.
The invention provides a graph embedding method based on entropy-driven random walk, which comprises the following steps:
(1) randomly selecting a node in graph data as a starting point to start random walk according to a random walk strategy based on mixed attribute perception, and extracting an upper and lower context sample of each node;
(2) and sequentially inputting the upper and lower samples into a currently general word embedding model Skip-Gram to realize the embedding of high-dimensional graph data into a low-dimensional vector space.
Further, the step (1) specifically comprises:
(1.1) capturing the characteristics of each node in the graph data based on a mixed attribute-aware random walk strategy to complete context sampling;
(1.2) evaluating information of the random walk path along with the random walk of each node based on the information entropy to determine the length of the walk path;
(1.3) after all the vertexes in the graph are sampled once, measuring the variation of the difference between the generated upper and lower context samples and the distribution of the node degrees in the original graph data based on the relative entropy to determine whether the random walk sampling is finished.
Further, the step (1.1) specifically includes:
(1.1.1) randomly selecting a node which does not swim through from the graph data as an initial node;
(1.1.2) if the current walking point is u, randomly selecting a node v from all neighbor nodes of u;
(1.1.3) determining the probability p of the current walking node u transferring to the random node v through the number Cm (u, v) of the common neighbors of the current walking node u and the random node v, the degree deg (deg) of the current walking node u and the degree deg (deg) of the random node v, if the probability p is accepted, adding the random node v into the walking path, and if the probability p is rejected, transferring the path to the random node v and then backtracking to the current walking node u.
Further, the step (1.2) specifically includes:
(1.2.1) taking the random walk path taking the current walk node u as the starting node as
Figure BDA0002803269850000041
Selecting a node to join through step (1.1.1)
Figure BDA0002803269850000042
Performing the following steps;
(1.2.2) random node v with i in the wandering pathiIs added, v is calculatediIn that
Figure BDA0002803269850000043
Frequency of occurrence
Figure BDA0002803269850000044
Since the entropy of information can measure how much information a random process contains, it can be based on
Figure BDA0002803269850000045
Information entropy for determining variation of wandering path with wandering length L
Figure BDA0002803269850000046
(1.2.3) coefficient of determination between information entropy H and walk length L through the walk path
Figure BDA0002803269850000047
To determine whether the random walk using the current walk node u as the start node is terminated, if so
Figure BDA0002803269850000048
Figure BDA0002803269850000049
Wherein 0.99 is not less than mu<And 1, transferring to the step (1.2.1), otherwise, recording the node sequence of the wandering path and transferring to the step (1.1.1).
Further, the step (1.3) specifically includes:
(1.3.1) when all nodes in the graph data complete one round of random walk sampling, obtaining a node viFrequency of occurrence q (v) in the acquired context and context samplesi);
(1.3.2) frequency q (v) of node appearance in accordance with the context samplei) Degree distribution p (v) with nodes in original graph datai) The similarity between the two is obtained based onRelative entropy plots q (v)i) And p (v)i) The difference between D (p | | q);
(1.3.3) observing the difference value delta D of D (p | | | q) of two adjacent rounds as the iteration round r increasesr(p | | q), if Δ Dr(p | | q) ≧ δ, where 0<If delta is less than or equal to 0.01, the step (1.1) is carried out, otherwise, the sampling is finished.
Further, the step (2) is specifically: and sequentially inputting the extracted upper and lower samples into a word embedding model Skip-Gram, generating vector representation of each node through the word embedding model, and embedding the graph data into a low-dimensional vector space.
The word embedding model Skip-Gram is a commonly used word embedding model Skip-Gram in the graph embedding technology based on random walk at present, and predicts the context according to the given input text, in short, the goal of the model is to calculate the probability of occurrence of other words under the condition of the given word. On the basis of a graph data structure, learning is carried out through Skip-Gram according to a node path acquired by random walk, and vector representation of the node can be simply and efficiently obtained.
The invention also provides a graph embedding system based on entropy-driven random walk, which comprises a processor and a computer-readable storage medium, wherein the computer-readable storage medium is used for storing an executable program; the processor is used for reading the executable program in the computer readable storage medium and executing the graph embedding method.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
(1) in the graph embedding method based on entropy-driven random walk, in the process of random walk, considering that graph data of the real world generally has scale-free characteristics and common neighbors in the graph data can describe similarity between nodes, and a random walk strategy based on mixed attribute perception is provided to capture the characteristics of the nodes.
(2) The method has the advantages that the length of the walking path and the repeated sampling times of the nodes are respectively determined through the information entropy and the relative entropy, the problem that the existing model needs to set the super-parameters by means of manual experience is solved, compared with a 'one-time-cutting' strategy adopted by the traditional graph embedding method, the method is more flexible, the effectiveness of sampling context information can be guaranteed without repeatedly testing to obtain manual experience values, the generation of redundant information in the sampling process can be reduced, the sampling efficiency is greatly improved, the training performance of the word embedding model is improved on the premise that the accuracy of downstream tasks is not lost, and the expansibility of the graph embedding model can be guaranteed.
Drawings
FIG. 1 is a flowchart of an implementation of a graph embedding method based on entropy-driven random walk according to an embodiment of the present invention;
FIG. 2 is a diagram of random walk examples provided by an embodiment of the present invention;
fig. 3 is a schematic diagram of a graph embedding method based on entropy-driven random walk according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a graph embedding method and a graph embedding system based on entropy driving random walk, and aims to enable learned node embedding vectors to retain attribute information in an original graph as far as possible through the method and the system, and determine the path length and the walking frequency of the random walk according to information acquired in the random walk process, so that the accuracy of downstream tasks can be improved, and the expansibility of the system can be guaranteed.
To achieve the above object, the present invention provides a graph embedding method based on entropy-driven random walk, which comprises the following steps:
(1) randomly selecting a node in the graph data as a starting point to start random walk according to a random walk strategy, and extracting an upper and lower context sample of each node;
(2) and sequentially inputting the extracted upper and lower samples into a word embedding model, and further generating vector representation of each node, thereby realizing embedding of high-dimensional graph data into a low-dimensional vector space.
Specifically, the step (1) comprises the following steps:
(1.1) capturing the characteristics of each node in the graph data based on a mixed attribute-aware random walk strategy to complete context sampling;
(1.1.1) randomly selecting a node which does not swim through from the graph data as an initial node;
(1.1.2) if the current walking point is u, randomly selecting a node v from all neighbor nodes of u;
(1.1.3) similarity between nodes can be represented due to the ubiquitous scale-free nature of graph data in the real world and the common neighbors in graph data. The height nodes generally play an important role in graph data structures, such as information dissemination in social networks, infectious disease control in disease dissemination networks, and the like. For random walk on the graph data, the height nodes are often accessed for multiple times, and richer node characteristics in the graph can be captured from the height nodes. Therefore, the probability p of transferring the node u to the node v is determined through the number Cm (u, v) of the common neighbors of the node u and the node v, the degree deg (u) of the node u and the degree deg (v) of the node v, if the probability p is accepted, the node v is added into the wandering path, and if the probability p is rejected, the path is transferred to the node v and then is traced back to the node u;
(1.2) evaluating information of the random walk path along with the random walk of each node based on the information entropy to determine the length of the walk path;
(1.2.1) a random walk path starting from node u as a start node
Figure BDA0002803269850000071
Selecting a node to join through step (1.1.1)
Figure BDA0002803269850000072
Performing the following steps;
(1.2.2) node v in the path followed by the wanderingiIs added, v is calculatediIn
Figure BDA0002803269850000073
Frequency of occurrence
Figure BDA0002803269850000074
Since the entropy of information can measure how much information a random process contains, it can be based on
Figure BDA0002803269850000075
Information entropy for determining variation of wandering path with wandering length L
Figure BDA0002803269850000076
(1.2.3) coefficient of determination between information entropy H and walk length L through the walk path
Figure BDA0002803269850000077
Determining whether the random walk using the node u as the initial node is terminated, if so
Figure BDA0002803269850000078
Wherein 0.99 is not less than mu<1, switching to the step (1.2.1), otherwise, recording the node sequence of the wandering path and switching to the step (1.1.1);
(1.3) after all the vertexes in the graph are sampled once, measuring the generated context samples based on the relative entropy to determine whether the random walk sampling is finished;
(1.3.1) after all nodes in the graph data complete one round of random walk sampling, the node v can be determinediFrequency of occurrence q (v) in the acquired context and context samplesi);
(1.3.2) the appearance of nodes in the context samples was previously found by researchersFrequency q (v)i) Degree distribution p (v) with nodes in original graph datai) Have similarities and thus, characterize q (v) based on relative entropyi) And p (v)i) The difference between D (p | | q);
(1.3.3) with the increase of the iteration round r, observing the difference value delta D of the D (p | | q) of the two rounds before and after the observationr(p | | q), if Δ Dr(p | | q) ≧ δ, where 0<If delta is less than or equal to 0.01, the step (1.1) is carried out, otherwise, sampling is carried out
Finishing;
specifically, the step (2) includes:
(2.1) sequentially inputting the extracted upper and lower samples into a word embedding model Skip-Gram, or other models such as a Glove model and a CBOW model;
(2.2) generating a vector representation of each node through a word embedding model, and embedding the graph data into a low-dimensional vector space.
The invention also provides an entropy-based graph embedding system for random walk, which comprises a processor and a computer-readable storage medium, wherein the computer-readable storage medium is used for storing an executable program; the processor is used for reading an executable program in a computer readable storage medium and executing the graph embedding method based on entropy-driven random walk provided by the first aspect of the invention.
To further illustrate the graph embedding method and system based on entropy random walk provided by the embodiment of the present invention, the following is detailed with reference to the accompanying drawings and specific examples:
as shown in FIG. 1, the invention discloses a graph embedding method of entropy-driven random walk, which comprises the following steps:
(1) randomly selecting a node u which is not walked from graph data as an initial node to carry out random walk, and setting a minimum walk path length threshold value to be L for facilitating word embedding model trainingminAnd an evaluation threshold μ of the wandering path length L;
(2) starting random walk from a current node u, and randomly selecting a node v from neighbor nodes of the node u as a next hop candidate node of the walk;
(3) at the time of selection of the next-hop node,in order to reduce the wandering overhead, a backtracking strategy frequently used in random wandering is adopted, and the node v adopts probability
Figure BDA0002803269850000091
U to v, otherwise, 1-P(u,v)Backtracking to a node u;
FIG. 2 is an example of random walk of a graph data, assuming that the current walk node is at node v0The next hop node is from v0Is connected to the neighbor node { v1,v2,v3,v4And (6) selecting. In the existing graph embedding technology, Deepwalk selects a next hop node by adopting uniform probability, but the method cannot effectively capture the characteristics of the nodes in a graph structure, particularly the ubiquitous remarkable scale-free characteristic of the graph structure; node2vec takes the homogeneity and isomorphism between the nodes into consideration by setting two hyper-parameters to carry out biased random walk, but the parameter adjustment of the method needs to consume a large amount of expenditure; the recent graph embedding method DiaRW proposes a degree-biased random walk in the sampling phase, and ignores implicit features between nodes, such as similarity of nodes, although considering the scale-free property in the graph data structure. Based on the consideration, the invention provides a random walk strategy based on mixed attribute perception to capture the characteristics of nodes, and the similarity between the nodes can be represented due to the ubiquitous scale-free characteristic in the graph data of the real world and the common neighbors in the graph data. The height nodes generally play an important role in graph data structures, such as information dissemination in social networks, infectious disease control in disease dissemination networks, and the like. For random walk on the graph data, the height nodes are often accessed for multiple times, and richer node characteristics in the graph can be captured from the height nodes.
By defining the similarity between nodes
Figure BDA0002803269850000092
And combining the degrees of node u and node v to
Figure BDA0002803269850000093
Utilizing altitude nodes as weights for wandering to capture richer information and reduce redundancy in future wandering processes, the weight of the transition probability from node u to node v can thus be defined as
Figure BDA0002803269850000101
Taking FIG. 2 as an example, from v0The next hop node is selected from the neighbor nodes, based on the walk strategy in the invention,
Figure BDA0002803269850000102
v1will be added to the tour path as the next hop node.
In order to reduce the overhead generated in the walking process, the invention adopts a backtracking strategy which is frequently used in random walking research. In each step of walking, a node v is randomly selected as a candidate node, and v receives the probability P(u,v)U to v wander Path transfer occurs, otherwise, at 1-P(u,v)Backtracking to node u, so the transition probability from u to v can be defined as P(u,v)=Z(α(u,v)) Wherein
Figure BDA0002803269850000103
Is a standardized function widely used in machine learning and is suitable for the backtracking strategy adopted in the invention;
(4) if the wandering path length L is larger than or equal to LminDetermining coefficients of the information entropy and the path length L of the wandering path can be determined
Figure BDA0002803269850000104
With the joining of nodes in the wandering path, if
Figure BDA0002803269850000105
For example, if μ is 0.99, recording the node sequence of the walking path into the upper and lower sample, and going to step (5), otherwise going to step (2);
in the conventional graph embedding technique based on random walk, when sampling random walk, an empirical value (generally set to L80) is generally adopted before the random walkThe length of the wandering path is preset, as analyzed before, when L is too large, a large amount of redundant information is generated, when L is too small, the validity of the information is difficult to ensure, and when L is too large, the calculation and storage efficiency in the sampling process is directly influenced. Based on the discovery, the invention measures the acquisition path along with the addition of the node in the wandering process based on the information entropy
Figure BDA0002803269850000106
The validity of the information in (1).
With node viBy adding an acquisition path, v can be determinediIn
Figure BDA0002803269850000107
Frequency of occurrence
Figure BDA0002803269850000108
Since the entropy of information can measure how much information a random process contains, it can be based on
Figure BDA0002803269850000109
Information entropy for determining variation of wandering path with wandering length
Figure BDA00028032698500001010
As mentioned above, as L grows, the entropy of the information of the wandering path gradually stabilizes indicating that nodes newly joining the path have a limited contribution to the expression of the information. Thus, the decision coefficient between the information entropy H and the walk length L is passed
Figure BDA0002803269850000111
Determining whether the random walk taking the node u as an initial node is terminated or not;
(5) if nodes which are not walked exist, the step (1) is carried out, otherwise, the step (6) is carried out after the random walk of the current round is finished;
(6) and after one round of sampling is completed on all the vertexes, measuring the generated context samples based on the relative entropy to determine whether the random walk sampling is finished. Distribution of node occurrence times in context samples acquired in current round rRelative entropy D from distribution of node degrees in original graph datarWith the previous round Dr-1Amount of change Δ D ofr(p | | q) ≧ δ, for example δ equals 0.01, step (1) is carried out to start a new round of random walk, otherwise, sampling ends;
the existing graph embedding technology based on random walk generally needs to perform multiple rounds of sampling on nodes to ensure the training quality of generated context samples in a word embedding model. As the walk path adopts a fixed length, in the existing graph embedding technology, a preset node walk frequency (generally set to r being 10) is often adopted to sample nodes for multiple times, but serious performance problems are brought along with the sampling, and as mentioned above, as the iteration frequency r of the nodes increases, the distribution of the node appearance frequency in the collected upper and lower samples gradually tends to be stable and gradually shows similarity with the distribution of the node degree in the original graph. Therefore, the present invention intends to solve the performance problem caused by the fixed node walk times adopted in the prior art graph embedding technology by describing the variation of the two distributions.
Relative entropy has been widely used in various fields as a tool for quantifying the difference between two distributions, and based on this, the present invention utilizes the relative entropy to determine the number of node walks. When all nodes in the graph data complete one round of random walk sampling, a determined node v can be providediFrequency of occurrence q (v) in the acquired context and context samplesi) Degree distribution p (v) of nodes in original graph datai) Then it remains unchanged during the iteration round and therefore, q (v) can be determined for each round r timei) And p (v)i) The relative entropy D (p | | q) between the two is observed along with the increase of the iteration round rr(p | | q) and Dr-1Difference Δ D of (p | | q)r(p | | q), according to Δ Dr(p | | q) to determine whether sampling is over;
(7) and sequentially inputting the extracted upper and lower samples into a word embedding model Skip-Gram, generating vector representation of each node through the word embedding model, and embedding the graph data into a low-dimensional vector space.
By adopting the graph embedding method of random walk based on entropy driving, not only can the characteristic information of the nodes be fully acquired, but also the information redundancy in the random walk process can be effectively reduced, and the effectiveness of acquiring the upper and lower sample information is ensured. Compared with the existing graph embedding method based on random walk, Deepwalk (the article Deepwalk: Online Learning of social representation issued in 2014 by Perozzi and Al-Rfou et Al), Node2vec (the article published in 2016 by Grove and Leskovec: Node2vec: Scalable Feature Learning for Networks), DiaRW (the article Degreee-binary walk Learning for large-scale network encoding issued in 2019 by Zhang and Shi et Al), the graph embedding running time in the invention can be averagely shortened by 12.8 times on a plurality of graph data sets of different scales, while the occupation of the memory in the random walk process can be averagely reduced by 68.9%, in the downstream task linking prediction, the Node expression vector generated by the method in the invention can improve the score effect (the above-mentioned techniques) (3% and the score effect of classification F-3% of the present invention) and the classification F-3% of the present invention (7.84-3% of the present task), and the score of the classification F-3% of the present invention (3-3% of the present invention) and the classification F-3% of the present task linking method, respectively, the graph embedding method based on entropy-driven random walk can ensure the effectiveness of graph data embedding and provide good expansibility.
According to a second aspect of the invention, there is also provided a system comprising a processor and a computer readable storage medium for storing an executable program; the processor is configured to read an executable program in a computer-readable storage medium, and execute the graph embedding method based on entropy-driven random walk, as shown in fig. 3, which is a schematic diagram of the graph embedding method of the present invention, and the method specifically includes: the input of graph data, the sampling of context samples based on random walks, and the word embedding training generate node representation vectors.
The method comprises the steps of carrying out random walk on an input graph data set, capturing characteristics of nodes based on a random walk strategy of mixed attribute perception, adding the nodes into collected walk paths in the walk process, determining the length of each walk through information entropy, and determining the walk times of the nodes through the relative entropy of collected upper and lower samples and original graph data after the random walk of each round is finished. And after sampling of random walk is finished, inputting the collected upper and lower samples into a word embedding model Skip-Gram, generating vector representation of each node, and further embedding the graph data into a low-dimensional vector space.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. An entropy-driven random walk-based graph embedding method is characterized by comprising the following steps of:
(1) randomly selecting a node in the graph data as a starting point to start random walk according to a random walk strategy of mixed attribute perception, and extracting an upper and lower context sample of each node;
(2) and sequentially inputting the upper and lower samples into a word embedding model to realize the embedding of high-dimensional graph data into a low-dimensional vector space.
2. The graph embedding method according to claim 1, wherein the step (1) specifically includes:
(1.1) capturing the characteristics of each node in the graph data based on a random walk strategy of mixed attribute perception so as to complete context sampling;
(1.2) carrying out random walk according to each node, and evaluating the information of the random walk path through the information entropy to determine the length of the walk path;
(1.3) when all the vertexes in the graph are sampled once, measuring the variation of the difference between the generated upper and lower context samples and the distribution of the node degrees in the original graph data through relative entropy to determine whether the random walk sampling is finished.
3. The graph embedding method according to claim 2, wherein the step (1.1) specifically comprises:
(1.1.1) randomly selecting a node which does not swim through from the graph data as an initial node;
(1.1.2) if the current walking point is u, randomly selecting a node v from all neighbor nodes of u;
(1.1.3) determining the probability p of the current walking node u transferring to the random node v through the number Cm (u, v) of the common neighbors of the current walking node u and the random node v, the degree deg (deg) of the current walking node u and the degree deg (deg) of the random node v, if the probability p is accepted, adding the random node v into the walking path, and if the probability p is rejected, transferring the path to the random node v and then backtracking to the current walking node u.
4. The graph embedding method of claim 3, wherein a backtracking strategy in random walks is adopted, and node v is probabilistic
Figure FDA0002803269840000021
U to v, otherwise, 1-P(u,v)Backtracking to a node u; wherein Cm (u, v) is the number of common neighbors of the current wandering node u and the random node v, deg (u) is the degree of the current wandering node u, and deg (v) is the degree of the random node v.
5. The graph embedding method according to any one of claims 2 to 4, wherein the step (1.2) specifically comprises:
(1.2.1) taking the random walk path taking the current walk node u as the starting node as
Figure FDA0002803269840000022
Selecting a node to join through step (1.1.1)
Figure FDA0002803269840000023
Performing the following steps;
(1.2.2) random node v with i in the wandering pathiIs addedCalculating viIn that
Figure FDA0002803269840000024
Frequency of occurrence
Figure FDA0002803269840000025
Since the entropy of information can measure how much information a random process contains, it can be based on
Figure FDA0002803269840000026
Information entropy for determining variation of wandering path with wandering length L
Figure FDA0002803269840000027
(1.2.3) coefficient of determination between information entropy H and walk length L through the walk path
Figure FDA0002803269840000028
To determine whether the random walk using the current walk node u as the start node is terminated, if so
Figure FDA0002803269840000029
Figure FDA00028032698400000210
If mu is more than or equal to 0.99 and less than 1, the step (1.2.1) is carried out, otherwise, the node sequence of the wandering path is recorded and the step (1.1.1) is carried out.
6. The graph embedding method according to any one of claims 2 to 5, wherein the step (1.3) specifically comprises:
(1.3.1) when all nodes in the graph data complete one round of random walk sampling, obtaining a node viFrequency of occurrence q (v) in the acquired context and context samplesi);
(1.3.2) frequency q (v) of node appearance in accordance with the context samplei) Degree distribution p (v) with nodes in original graph datai) BetweenThe similarity acquisition is based on a relative entropy notation q (v)i) And p (v)i) The difference between D (p | | q);
(1.3.3) observing the difference value delta D of D (p | | | q) of two adjacent rounds as the iteration round r increasesr(p | | q), if Δ DrAnd (p | | q) is more than or equal to delta, wherein delta is more than 0 and less than or equal to 0.01, the step (1.1) is carried out, and otherwise, the sampling is finished.
7. The graph embedding method according to any one of claims 1 to 6, wherein the step (2) is specifically: and sequentially inputting the extracted upper and lower samples into a word embedding model, generating a vector representation of each node through the word embedding model, and embedding the graph data into a low-dimensional vector space.
8. The graph embedding method of claim 7, wherein the word embedding model is a Skip-Gram model.
9. An entropy-driven random walk-based graph embedding system, comprising a processor and a computer-readable storage medium for storing an executable program; the processor is used for reading an executable program in a computer readable storage medium and executing the graph embedding method of any one of claims 1 to 8.
CN202011358300.9A 2020-11-27 2020-11-27 Graph embedding method and system for random walk based on entropy drive Pending CN112417224A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011358300.9A CN112417224A (en) 2020-11-27 2020-11-27 Graph embedding method and system for random walk based on entropy drive

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011358300.9A CN112417224A (en) 2020-11-27 2020-11-27 Graph embedding method and system for random walk based on entropy drive

Publications (1)

Publication Number Publication Date
CN112417224A true CN112417224A (en) 2021-02-26

Family

ID=74843096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011358300.9A Pending CN112417224A (en) 2020-11-27 2020-11-27 Graph embedding method and system for random walk based on entropy drive

Country Status (1)

Country Link
CN (1) CN112417224A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022198713A1 (en) * 2021-03-25 2022-09-29 上海交通大学 Graphics processing unit-based graph sampling and random walk acceleration method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347881A (en) * 2019-06-19 2019-10-18 西安交通大学 A kind of group's discovery method for recalling figure insertion based on path

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347881A (en) * 2019-06-19 2019-10-18 西安交通大学 A kind of group's discovery method for recalling figure insertion based on path

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022198713A1 (en) * 2021-03-25 2022-09-29 上海交通大学 Graphics processing unit-based graph sampling and random walk acceleration method and system

Similar Documents

Publication Publication Date Title
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
CN109165664A (en) A kind of attribute missing data collection completion and prediction technique based on generation confrontation network
Fu et al. Deep reinforcement learning framework for category-based item recommendation
CN112214689A (en) Method and system for maximizing influence of group in social network
CN109471982B (en) Web service recommendation method based on QoS (quality of service) perception of user and service clustering
Zhang et al. Learning to walk with dual agents for knowledge graph reasoning
CN109740106A (en) Large-scale network betweenness approximation method based on graph convolution neural network, storage device and storage medium
CN110866134A (en) Image retrieval-oriented distribution consistency keeping metric learning method
Shen et al. The Application of Artificial Intelligence to The Bayesian Model Algorithm for Combining Genome Data
CN114520743A (en) Method and system for detecting network abnormal flow and storable medium
CN112417224A (en) Graph embedding method and system for random walk based on entropy drive
de Castro et al. BAIS: A Bayesian Artificial Immune System for the effective handling of building blocks
Priya et al. Community Detection in Networks: A Comparative study
Thamarai et al. An evolutionary computation approach for project selection in analogy based software effort estimation
CN113128689A (en) Entity relationship path reasoning method and system for regulating knowledge graph
Sinha et al. Completely automated cnn architecture design based on vgg blocks for fingerprinting localisation
CN116484979A (en) Information-driven-oriented distributed graph representation learning method and system
CN111126443A (en) Network representation learning method based on random walk
CN115577290A (en) Distribution network fault classification and source positioning method based on deep learning
CN113779385A (en) Friend attention degree measurement sequencing method and system based on complex network graph embedding
Chen et al. Efficient model evaluation in the search-based approach to latent structure discovery
CN114139937A (en) Indoor thermal comfort data generation method, system, equipment and medium
Cai et al. Application of improved wavelet neural network in MBR flux prediction
Zhang et al. PA-FEAT: Fast Feature Selection for Structured Data via Progress-Aware Multi-Task Deep Reinforcement Learning
Nugroho et al. Decision tree using ant colony for classification of health data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination