CN116090525A

CN116090525A - Embedded vector representation method and system based on hierarchical random walk sampling strategy

Info

Publication number: CN116090525A
Application number: CN202211423375.XA
Authority: CN
Inventors: 郭仕钧; 徐圣兵; 谢锐; 王振友
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-05-09
Anticipated expiration: 2042-11-15
Also published as: CN116090525B

Abstract

The invention discloses an embedded vector representation method and system based on a hierarchical random walk sampling strategy, wherein the method comprises the following steps: setting node parameters of a network structure, and randomly selecting nodes in the network structure to obtain an initial node; carrying out neighborhood division processing on the initial node to obtain a node layer; according to the node selection rule, carrying out random walk processing on the node layer to obtain a node walk sequence; and inputting the obtained node walk sequence into a word2vec model for vectorization characterization training to obtain network embedded vector characterization corresponding to all the walk nodes. The system comprises: the device comprises a selection module, a division module, a wandering module and a training module. By using the method and the device, the neighbor node information can be fully considered, and further, the network embedded vector characterization learning can be realized through the random walk sampling based on the hierarchical priority. The embedded vector representation method and the embedded vector representation system based on the hierarchical random walk sampling strategy can be widely applied to the technical field of computer data mining.

Description

Embedded vector representation method and system based on hierarchical random walk sampling strategy

Technical Field

The invention relates to the technical field of computer data mining, in particular to an embedded vector representation method and system based on a hierarchical random walk sampling strategy.

Background

Task clustering, link prediction, classification, etc. common in complex network analysis; the attribute information of the nodes in the complex network is difficult to acquire, and the structural information of the network is easy to acquire; therefore, analysis based on network structure information is receiving more and more attention, for these machine learning problems, the primary task is to establish a set of feature vectors capable of accurately expressing network structure information, a feature vector expression method is constructed for nodes and edges, namely a network embedded vector expression method, the network embedded vector expression method commonly used in the present stage is to manually extract features, machine learning and dimension reduction methods and the like, the machine learning method of the graph embedded vector is the deepflk algorithm at the earliest, and the deepflk algorithm further generates a node2vec algorithm, namely a semi-supervision method for expanding feature learning in the network, and is mainly characterized in that a return parameter p and a far-reaching parameter q are controlled, so that a random walk mode which becomes a biased random walk mode on the basis of deepflk is used for sampling, a random walk mode obtained by sampling is still used for generating the network embedded vector expression, the node2vec algorithm comprehensively considers the depth first and the depth first walk on the nodes, and the depth first walk through the node first, and the node2vec algorithm further generates a semi-supervision method for expanding feature learning in the network, and the local walk mode is not considered in the nodes, and the local walk through the node first important node is not important to the node important, and the important node important information in the network node important is not considered in the fact that the node important is not considered in the actual node important.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide an embedded vector representation method and system based on a hierarchical random walk sampling strategy, which can fully consider neighbor node information and further realize network embedded vector characterization learning through hierarchical priority-based random walk sampling.

The first technical scheme adopted by the invention is as follows: the embedded vector representation method based on the hierarchical random walk sampling strategy comprises the following steps:

setting node parameters of a network structure, and randomly selecting nodes in the network structure to obtain an initial node;

carrying out neighborhood division processing on the initial node to obtain a node layer;

according to the node selection rule, carrying out random walk processing on the node layer to obtain a node walk sequence;

and inputting the obtained node walk sequence into a word2vec model for vectorization characterization training to obtain network embedded vector characterization corresponding to all the walk nodes.

Further, the step of setting a node parameter of the network structure and randomly selecting a node in the network structure to obtain a starting node specifically includes:

setting network structure node parameters, wherein the network structure node parameters comprise the upper limit of the quantity length of the travelling sequences of the initial node, the total length of the node travelling sequences in one travelling, the total travelling length of the node travelling sequences in the child node layer and the total travelling length of the node travelling sequences in the grandchild node layer;

and randomly selecting nodes in the network structure according to an initial node selection condition, wherein the initial node selection condition is that the number length of the corresponding node walking sequences is calculated by taking the node as an initial node in the current acquired walking sequence set and is required to be smaller than the upper limit of the number length of the walking sequences of the initial node.

Further, the step of performing neighborhood division processing on the starting node to obtain a node layer specifically includes:

preprocessing the initial node to obtain a preprocessed initial node;

connecting the preprocessed direct neighbor nodes of the initial node, dividing the range of the sub-node layer, and generating the sub-node layer;

connecting the indirect neighbor nodes of the preprocessed starting node, dividing the range of the grandchild node layer, and generating the grandchild node layer;

and integrating the child node layer and the grandchild node layer to construct a node layer.

Further, the step of preprocessing the starting node to obtain a preprocessed starting node specifically includes:

judging the attribute of the initial node;

judging that the self-loop exists in the initial node, and removing the self-loop;

judging the initial node as an isolated node, and reserving the initial node;

and integrating and removing the initial node and the isolated node after the self-ring treatment to obtain the initial node after the pretreatment.

Further, the step of performing random walk processing on the node layer according to the node selection rule to obtain a node walk sequence specifically includes:

adding weight to the node layer according to the node selection rule to obtain a node layer with a weight value;

the node layer with the weight value comprises a child node layer with the weight value and a grandchild node layer with the weight value;

carrying out random walk processing on the sub-node layers with the weight values according to the range of the sub-node layers to obtain a sub-node walk sequence;

carrying out random walk processing on the grandchild node layer with the weight value according to the grandchild node layer range to obtain a grandchild node walk sequence;

integrating the child node wandering sequence and the grandchild node wandering sequence to obtain the node wandering sequence.

Further, the step of performing random walk processing on the child node layer with the weight value according to the child node layer range to obtain a child node walk sequence specifically includes:

initializing a sub-node layer, wherein the initializing process comprises the steps of defining an initial node as a head node and a current node, and initializing a list for storing a wandering sequence in one wandering and a list for temporarily storing a sampled set of nodes to be empty;

selecting a free child node in the child node layer range according to the weight value of the node, and embedding the node sequence of the free child node pointed by the current node into a list for storing the migration sequence and a list for temporarily storing the sampled node set in one-time migration;

judging a node sequence of the free child node pointed by the current node;

judging that a node sequence of a free child node pointed by a current node has a brother node which is not in a list for temporarily storing a set of sampled nodes, embedding the brother node into the list for temporarily storing the set of sampled nodes, wherein the brother node is a connected edge between adjacent nodes;

judging that the node sequence of the free child node pointed by the current node does not have a brother node and the brother node does not exist in a list for temporarily storing the set of the sampled nodes, and embedding the node sequence of the head node pointed by the current node into the list for temporarily storing the set of the sampled nodes;

and outputting the sub-node wandering sequence until the length of the list for temporarily storing the sampled node aggregate is equal to the product length of the total length of the node wandering sequence in one wandering and the total length of the node wandering sequence wandering in the sub-node layer.

Further, the step of performing random walk processing on the grandchild node layer with the weight value according to the grandchild node layer range to obtain a grandchild node walk sequence specifically includes:

acquiring a child node walk sequence, selecting the last node in the list as a root node, and selecting the penultimate node in the list as the root node if the last node is the starting node;

defining a root node as a head node of a grandchild node layer;

acquiring a node sequence of a head node of a grandchild node layer pointed by a current node and defining the node sequence as a grandchild node layer range;

selecting a free grandchild node in the grandchild node layer range according to the weight value of the node, wherein the free grandchild node is not in a list for temporarily storing the sampled set of nodes;

acquiring a node sequence of a free grandchild node pointed by a current node and judging;

judging that a node sequence pointed to a free grandchild node by a current node has a brother node which is not in a list for temporarily storing a set of already sampled nodes, and embedding the brother node into the list for temporarily storing the set of already sampled nodes;

judging that a node sequence of a free grandchild node pointed by a current node does not have a brother node and the brother node does not exist in a list for temporarily storing a set of sampled nodes, and embedding the node sequence of a head node of a grandchild node layer pointed by the current node into the list for temporarily storing the set of sampled nodes;

and outputting the grandchild node wandering sequence until the length of the list for temporarily storing the sampled node aggregate is equal to the sum of the product of the total length of the node wandering sequence in one wandering and the total length of the node wandering sequence in the child node layer and the product of the total length of the node wandering sequence in one wandering and the total length of the node wandering sequence in the grandchild node layer.

Further, the step of inputting the obtained node walk sequence into a word2vec model to perform vectorization characterization training to obtain network embedded vector characterization corresponding to all the walk nodes specifically includes:

inputting the node walk sequence into a word2vec model for coding processing and constructing a parameter matrix, wherein the parameter matrix comprises a central word matrix and surrounding word matrices;

multiplying the encoded node walk sequence with a central word matrix to obtain a central word vector;

multiplying the coded node walk sequence with a surrounding word matrix to obtain a surrounding word vector;

and combining the central word vector and the surrounding word vectors, and carrying out normalization processing to obtain network embedded vector characterization corresponding to all the wandering nodes.

The second technical scheme adopted by the invention is as follows: an embedded vector representation system based on a hierarchical random walk sampling strategy, comprising:

the selecting module is used for setting the node parameters of the network structure and randomly selecting nodes in the network structure to obtain initial nodes;

the dividing module is used for carrying out neighborhood division processing on the initial node to obtain a node layer;

the wander module is used for carrying out random wander processing on the node layer according to the node selection rule to obtain a node wander sequence;

the training module is used for inputting the obtained node walk sequence into a word2vec model to perform vectorization characterization training, and obtaining network embedded vector characterizations corresponding to all the walk nodes.

The method and the system have the beneficial effects that: according to the invention, related network structure parameters are input firstly, after a current node is selected as an initial node in a network according to preset initial node selection conditions, neighborhood division is carried out, node selection rules are set, neighbor node information can be fully considered, local information of each node in the network can be fully represented, then the nodes are sequentially walked in a child node layer range and a grandchild node layer range to obtain a walked sequence, the weight relation of the nodes in the network is considered in the walked process, a certain number of node walked sequences taking each node in the network as the initial node are finally obtained, random walked sampling based on hierarchical priority is realized, network embedded vector representation is obtained, the network embedded vector representation can be input for a downstream machine learning task, the consideration of node weights is also represented by a hierarchical priority method, and the importance degree of the nodes is represented by priority correspondence.

Drawings

FIG. 1 is a flow chart of steps of an embedded vector representation method of the present invention based on a hierarchical random walk sampling strategy;

FIG. 2 is a block diagram of an embedded vector representation system based on a hierarchical random walk sampling strategy in accordance with the present invention;

FIG. 3 is a flowchart outlining the overall process of implementing the embedded vector representation learning method of the present invention;

FIG. 4 is a schematic diagram of a specific flow of a sub-node layer performing random walk processing according to a sub-node layer of the present invention;

FIG. 5 is a schematic diagram of a specific flow of the random walk processing performed by the child node layer of the grandchild node layer according to the present invention;

FIG. 6 is a flowchart illustrating steps of performing a neighborhood division process on an originating node according to the present invention.

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

First, the symbols appearing in the algorithm are explained as follows:

walk_all represents a collection list for depositing the obtained walk sequence; node_choice represents a node selection rule; num_walk represents the upper limit of the number of walk sequences taking any node in the network as the starting node; w represents the total length of the wander sequence in one walk; s represents the total length of the wander sequence in the child node layer; g represents the total length of the walk sequence in the grandchild node layer; head represents the head node; head2 represents the starting node of the grandchild node layer walk; p represents the current node; walk represents a list for storing a walk sequence in one walk; have_walked represents a list for temporarily storing a set of nodes that have been sampled; son represents the sub-node layer wander range; gson represents grandchild node level walk range.

Referring to fig. 1, the present invention provides an embedded vector representation method based on a hierarchical random walk sampling strategy, the method comprising the steps of:

s1, inputting parameters;

specifically, the input parameters include num_walk, w, s, g; setting the total length of the running sequence generated by each node as the starting node as w, setting the proportion of the running sequence length generated in the running process of the child node layer to the total length of the preset running sequence as s, setting the proportion of the running sequence length generated in the running process of the grandchild node layer to the total length of the preset running sequence as g, wherein the sum of s and g is 1, and the running length of any layer is more than or equal to 1.

S2, selecting nodes;

specifically, a node is randomly selected from a network as an initial node, and the node meets the condition: and in the currently obtained wander sequence set, the number of wander sequences taking the node as a starting node is smaller than num_walk, and if no node meeting the condition exists, the wander process is ended.

S3, node neighborhood division;

specifically, referring to fig. 6, a neighborhood of an initial node is divided, if a current node has a self-loop, the self-loop is removed, if the current node is an isolated node, a walk sequence with the node as a head node and the length of 1 is directly generated and returned, and subsequent processing is not needed, wherein the judgment basis of the self-loop node is a connecting node and a side of the self-loop node, the judgment basis of the isolated node is that no side is connected with the node;

the method comprises the steps of taking a direct neighbor node of a starting node as a child node layer, taking an indirect neighbor node of the starting node as a grandchild node layer, defining the two layers as a neighborhood of the starting node, limiting the range of the wander, simultaneously reserving the connecting edges between all nodes in the two layers, and regarding any two nodes in the same layer as brother nodes if any two nodes have the connecting edges;

in the network structure, the information of the performance of any node and the adjacent nodes is important, and the local information of each node in the network can be fully embodied by fully considering the adjacent information.

S4, a starting node selection rule;

specifically, the node_choice node selection rule is that a node weight algorithm is applied to the obtained child node layer and grandchild node layer, so that all nodes in the child node layer and grandchild node layer are related to the importance degree of the nodes in the network, when all the nodes are obtained later, the weight of the nodes is considered, the probability of obtaining the node is in direct proportion to the importance degree of the node when one node is obtained each time, and the larger the node weight is, the larger the probability of obtaining the node is;

further, the node weight giving algorithm is a pagerank algorithm, and the calculation process is as follows:

initial values PR are given to all nodes in the network, pr=1/N, where N is the total number of nodes in the network, and iterating through the following formula:

in the above formula, PR (a) represents the PR value of the current node a, PR (Ti) represents the PR value of other nodes (connected with a with edges), L (Ti) represents the number of edges of other nodes (connected with a with edges), and i represents the current time or the iteration number;

and iterating until PR values of all the nodes are not changed, regarding the PR values as weights of the nodes, wherein the larger the PR values are, the larger the weights of the nodes in the network are, and the smaller the PR values are, the smaller the weights of the nodes in the network are.

S5, carrying out migration in the range son of the child node layer and considering node weight in the migration process;

specifically, referring to fig. 4, a start node is defined as a head node head and current nodes p and p point to the head, an initialized walk sequence walk is null, an initialized node set having been fetched is null, an initialized sub-node layer walk range son is all nodes of a sub-node layer, and in a cyclic process, if walk is not null at this time and a last node in walk is the head, the last node is deleted from walk. The node pointed by p is put into walk, have_walk;

firstly, randomly selecting a node which does not exist in one wave_walked from son (node_choice rule), pointing p to the node, and putting the node pointed by p into a walk sequence walked and the wave_walked; at this time, checking whether a brother node exists in the son, if the condition that one or more brother nodes exist and the brother node does not exist in the wave_walked is satisfied, randomly selecting a node from the brother nodes (node_choice rule) and pointing the current node p to the node, and putting the p into the walk and the wave_walked; if the current node does not meet the requirement that one or more brother nodes exist and the brother nodes do not exist in the wave_walked, the current node p is pointed to the head node head, and the current node p is placed in the walk and the wave_walked; and (3) repeatedly selecting the next node for wandering according to the rule, checking the wandering sequence length s x w every time one node is placed in the wandering sequence walk, ending wandering of the sub-node layer when the wandering sequence length is equal to s x w, and repeating the operation from initializing the wave_walked if all the nodes in son are already owned in the wave_walked and the wandering sequence length is smaller than s x w, and the like.

S6, carrying out migration in the grandchild node layer range gson and considering node weight in the migration process;

specifically, referring to fig. 5, for walk sequence walk, the last node in the walk sequence is obtained, if the last node in walk is the starting node, the last node in walk is obtained, the obtained node is stored as the head node head2 of the grandchild node layer of the root node, p points to head2, the intersection of the neighbor node of head2 and the grandchild node layer is initialized as the walk range gson of the grandchild node layer, if the last node in walk is not p, p is put into walk, and the initialized wave_walk is empty;

selecting a node which does not exist in one wave_walked from gson (node_choice rule), pointing p to the node, putting the node pointed to by p into walked and wave_walked, checking whether the current node meets the existence of sibling nodes in gson, if one or more sibling nodes exist and the existence of no sibling nodes in the wave_walked is met, selecting the node which does not exist in one wave_walked from the sibling nodes (node_choice rule), pointing p to the node, and putting p into walked and wave_walked;

if the current node does not meet the requirement that one or more brothers exist and the brothers do not exist in the wave_walked, pointing p to a head node head2, and putting the node pointed by p into the walk and the wave_walked; and (3) repeating the rule for p, selecting the next node to walk, and checking whether the walk length is smaller than s+g+w every time one node is put in the walk sequence. When the walk length is equal to s+g+w, finishing the walk of the grandchild node layer, and if all nodes (the intersection of the neighbor node and the grandchild node layer of the head 2) in the gson are already owned in the wave_walk and the walk length does not reach the stop condition, adding the head into the walk; at this time, the current node p is pointed to the head, and the following operations are performed:

a node (node_choice rule) is randomly obtained from the child node layer and defined as head2, a wandering sequence is put in, p points to the head2, and the intersection of the neighbor node and the grandchild layer is taken as the wandering range gson of the grandchild node layer for the head 2. Repeating the above operation (from initializing the wave_walked), if the walk length still does not reach the stop condition after the current operation is completed, putting the head into the walk sequence, repeating the operation until the stop condition is reached.

S7, acquiring a network embedded vector.

Specifically, after the above method is applied to all nodes in a network according to the predefined times, the same number of random walk sequences (walk_all) are obtained for each node, the walk sequences are put into a word2vec model for training, and a network embedded vector representation is obtained, wherein an algorithm for training the word2vec model is a skip-gram algorithm, the function of the algorithm is to predict the context of the word by the premise of knowing the word wt, the training process is to input a one-hot code of wt into an input layer, and a parameter matrix is constructed: the method comprises the steps of obtaining a vector of 1*N dimensions by multiplying a weight one-hot code by a center word vector matrix W, obtaining a vector of 1*N dimensions by multiplying the center word vector by the surrounding word matrix U, obtaining a final vector result, namely an un-normalized output value vector y=Utanh (Wx+p) +q, wherein p and q represent bias vectors on a hidden layer and an output layer, carrying out softmax normalization processing on y, carrying out backward propagation on y, and updating W and U matrices, wherein the larger the normalized probability is, the larger the weight of the word is represented, the minimizing of a loss function is finally realized, the context between the word and the word is equivalent to that of a node, the correlation between the word and the node is equivalent to that of a weight parameter between the node, and the word is equivalent to that of a random word in a random walk sequence according to a certain random walk node.

In summary, referring to fig. 3, the invention proposes an embedded vector representation learning method implemented based on a hierarchical priority random walk sampling strategy, where some key nodes may exist in a network, the nodes are associated with most other nodes in the network, the existence of the nodes can be effectively represented through hierarchical traversal, meanwhile, the hierarchical priority method also represents consideration of node weights, priority corresponds to the importance degree of the nodes, each node in the network is used as an initial node to generate a specified number of walk sequences according to a certain rule, related parameters are input first, after a current node is selected according to a certain rule in the network, the neighborhood is divided, node selection rules are set, then the walk sequences are obtained in sequence in the range of child node layers and in the range of grandchild node layers, the weight relation of the nodes in the network is considered in the walk process, and finally a certain number of walk sequences taking all the nodes in the network as initial nodes are obtained and are put into a machine learning model for training, so as to obtain a network embedded vector representation for input to a machine learning task.

Referring to fig. 2, an embedded vector representation system based on a hierarchical random walk sampling strategy, comprising:

In summary, the invention can be used for regarding each user as a node in a network social platform, forming a network by all users, regarding the friend relationship between the users as the connection edge between the nodes, and predicting whether the connection edge possibly exists between the nodes without two sides. For two users on a social platform, judging whether the two users are likely to be friends through common friends, adding other relationships except friend relationships such as concerned bloggers, and adding attributes such as favorite motions into a network on the basis, so as to achieve the effect of accurately pushing the users.

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

While the preferred embodiment of the present invention has been described in detail, the invention is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the invention, and these modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The embedded vector representation method based on the hierarchical random walk sampling strategy is characterized by comprising the following steps of:

2. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 1, wherein the step of setting node parameters of the network structure and randomly selecting nodes in the network structure to obtain an initial node specifically comprises the steps of:

3. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 2, wherein the step of performing neighborhood division processing on the starting node to obtain a node layer specifically comprises the following steps:

preprocessing the initial node to obtain a preprocessed initial node;

4. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 3, wherein the step of preprocessing the start node to obtain a preprocessed start node specifically comprises the following steps:

judging the attribute of the initial node;

judging the initial node as an isolated node, and reserving the initial node;

5. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 4, wherein the step of performing random walk processing on the node layer according to the node selection rule to obtain the node walk sequence specifically comprises the following steps:

6. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 5, wherein the step of performing random walk processing on the sub-node layer with the weight value according to the sub-node layer range to obtain the sub-node walk sequence specifically comprises the following steps:

judging a node sequence of the free child node pointed by the current node;

7. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 6, wherein the step of performing random walk processing on a grandchild node layer with a weight value according to a grandchild node layer range to obtain a grandchild node walk sequence specifically comprises the following steps:

defining a root node as a head node of a grandchild node layer;

8. The embedded vector representation method based on the hierarchical random walk sampling strategy according to claim 7, wherein the step of inputting the obtained node walk sequence into a word2vec model to perform vectorization characterization training to obtain network embedded vector characterization corresponding to all the walk nodes specifically comprises the following steps:

9. An embedded vector representation system based on a hierarchical random walk sampling strategy is characterized by comprising the following modules: