CN116662565A - Heterogeneous information network keyword generation method based on contrast learning pre-training - Google Patents

Heterogeneous information network keyword generation method based on contrast learning pre-training Download PDF

Info

Publication number
CN116662565A
CN116662565A CN202310587606.9A CN202310587606A CN116662565A CN 116662565 A CN116662565 A CN 116662565A CN 202310587606 A CN202310587606 A CN 202310587606A CN 116662565 A CN116662565 A CN 116662565A
Authority
CN
China
Prior art keywords
text
node
representation
encoder
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310587606.9A
Other languages
Chinese (zh)
Inventor
曾维新
赵翔
吴丹
王宇恒
方阳
谭真
肖卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310587606.9A priority Critical patent/CN116662565A/en
Publication of CN116662565A publication Critical patent/CN116662565A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a heterogeneous information network keyword generation method based on contrast learning pre-training, which comprises the following steps: encoding the text into a low-dimensional vector by a text encoder to generate a text representation; the method comprises the steps of adopting a map encoder to encode structural features, heterogeneous features and self-supervision information of a heterogeneous information network to obtain a map representation; pre-training and aligning the text representation and the chart representation through contrast learning; introducing automatically generated learnable continuous prompt vectors, providing the identified natural language sentences to a text encoder, comparing the natural language sentences with the structure and heterogeneous characteristic representations generated by the atlas encoder to generate weights in classification, and fusing to obtain single representations; and generating keywords of the heterogeneous information network by using the obtained single representation. The method can obtain more excellent and remarkable generation performance in the keyword generation task of the heterogeneous information network.

Description

Heterogeneous information network keyword generation method based on contrast learning pre-training
Technical Field
The application relates to the technical field of knowledge graph networks in natural language processing, in particular to a heterogeneous information network keyword generation method based on contrast learning pre-training.
Background
Heterogeneous information networks are ubiquitous. Interactions between users and items in social networks, knowledge maps, and search and recommendation systems can be modeled as networks with multiple types of nodes and edges. A text heterogeneous information network is a network with text information, such as titles and summaries of paper nodes in an academic network, that can provide productive ancillary information for downstream tasks. Most current efforts on heterogeneous information networks ignore such textual information and map the nodes of the graph to a low-dimensional representation based only on structural information. To fill this gap, some models mining heterogeneous information networks suggest integrating text information into node representations. They mainly design a framework that combines structural information of nodes with textual information to generate a single node representation.
The text network embedding model mentioned above faces many limitations. First, they can only classify nodes with trained labels, in other words, they are not suitable for small sample learning settings. In small sample learning, we need to migrate a pre-trained model to classify nodes with invisible labels during the test phase. In practice, only a few tags are typically available, which presents a serious challenge to maintaining performance. Second, previous methods of using text information were originally designed for homogeneous information networks, and no effort has been made to solve the problem of small sample learning on text heterogeneous information networks.
To solve the small sample learning problem, natural language processing related studies (e.g., chatGPT) propose prompt learning, which reformulates the downstream task to look like a pre-training task. Prompt learning, whether or not fine tuning is present, facilitates rapid application of a priori knowledge to new tasks, thereby enhancing small sample learning. Recently, hint learning has also been employed in multimodal scenes to align image and text data. However, no prompt learning-based technique has been used to process atlases and text data.
In view of the above, a heterogeneous information network keyword generation method based on contrast learning pre-training is provided, prompt learning is used for map data, the problem of small sample learning on a text heterogeneous information network is solved, and a more efficient and accurate heterogeneous information network keyword generation task result is obtained.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application discloses a heterogeneous information network keyword generation method based on contrast learning pre-training. The method adopts a text encoder to encode text information; adopting a map encoder to encode the structure and heterogeneous characteristics and self-supervision information; a contrast learning mechanism is used for aligning text representations with network representations, and a learnable continuous vector type prompt learning framework is used for solving the problem of small samples on a text heterogeneous information network.
A heterogeneous information network keyword generation method based on contrast learning pre-training comprises the following steps:
step 1, a text encoder is adopted to encode the text into a low-dimensional vector, and a text representation is generated;
step 2, encoding structural features, heterogeneous features and self-supervision information of the heterogeneous information network by adopting a map encoder to obtain a map representation;
step 3, pre-training and aligning the text representation and the chart representation through contrast learning;
step 4, introducing automatically generated learnable continuous prompt vectors, providing the identified natural language sentences to a text encoder, comparing the natural language sentences with the structure and heterogeneous characteristic representation generated by the atlas encoder to generate weights in classification, and fusing to obtain a single representation;
and 5, performing a keyword generation task of the heterogeneous information network by using the obtained single representation.
Specifically, the text encoder uses a Sentence-BERT model to generate a fixed-size text representation.
Specifically, the step 2 specifically includes the following steps:
step 201, sampling heterogeneous subgraphs, wherein for a given node, the subgraphs around the node need to be sampled first;
step 202, capturing structural information of the sub-graph using the self-encoder, given the adjacency matrix A of the sub-graph, which will first be processed by the encoder to generate a multi-layered potential representation, and then the decoder reversing the above process to obtain a reconstructed outputThe self-encoder aims to minimize reconstruction errors of the input and output, to make nodes with similar structures have similar representations, and to lose the function L structure The calculation formula is as follows:
where B is a penalty sparse applied to non-zero elements to mitigate the sparsity problem, e represents a bitwise multiplication,representing a regularization operation;
step (a)203, exploring heterogeneous characteristics of heterogeneous information networks, grouping nodes having the same type together, applying Bi-LSTM on each group to model type-specific characteristics, given type T j Node group of (2)Representation of node v->The calculation is as follows:
wherein Bi-LSTM { v } represents that Bi-LSTM is applied to the type grouping of node v,representing node group->Is the number of (3);
an attention mechanism is then applied to aggregate all types of groups to generate a representation h of a given node v
Wherein delta represents the activation function, using LeakyReLU, u ε R d Is a weight parameter, u T Represents a transpose of u and,is a representation of node v, { T } represents a collection of types, α v,j Represents an attention weight;
step 204, pre-training the subgraph based on self-supervision information, introducing two pre-training tasks, a mask node modeling task and an edge reconstruction task, so as to realize the graph exploration of the node level and the edge level.
Specifically, the MASK node modeling task performs sorting according to the ranking of the nodes, and randomly extracts the nodes with preset proportion to [ MASK ]]The identification is replaced, the ordered nodes are sent to an encoder of a transducer, the representation generated by Bi-LSTM is used as the identification representation, the ordering information is used as a position vector, and the hidden layer obtained by learning by the transducer encoderWill feed forward to predict the target node, expressed mathematically as:
p v =softmax(W MNM z v ),
wherein z is v Is the output of the feed-forward layer, feedforward () represents the output of the feed-forward layer, softmax () represents the activation function, W MNM ∈V v X d is the weight for classification shared with the input node representation matrix, V v Is the number of nodes of the subgraph, d is the dimension of the hidden layer vector, p v Is the predictive distribution of v over all nodes, and in training, a one-hot tag is usedAnd forecast->Cross entropy between them, loss function L MNM The calculation is as follows:
wherein y is i And p i Is y i And p i Is the ith component, y i Representing a set of tags, p i A set representing a prediction probability;
the edge reconstruction task samples positive edges and negative edges in the subgraph, wherein the positive edges are edges which do exist in the original subgraph, the negative edges do not exist in the original subgraph, and the positive edges and the negative edges are given a union N S The score of the edge reconstruction is calculated by the inner product between a pair of nodes, i.eIs to calculate a score, h v Is a representation of node v, e is an inner product, h u Is a representation of the node u, employing the binary cross entropy between the predicted edge and the true edge to calculate the loss function L for edge reconstruction ER
N S Representing the number of node pairs, binaryCrossEntropy () represents the binary cross entropy, e uv Representing the actual scores of node u and node v, (u, v) representing the conjoined edges of node u and node v.
Furthermore, the sub-graph around the node is sampled by adopting a sampling strategy with restarting random walk, the neighborhood of the given node v is traversed iteratively, and a certain probability is returned to the starting node v, so that the random walk strategy reaches the nodes with high rank first for sampling the nodes with higher importance, and the traversal is limited to sampling all types of nodes for enabling the spectrum encoder to have heterogeneity.
In particular, the contrast learning is used to align text representations with graph representations during training, the learning objective is designed to compare the loss function, given a batch of text-subgraph pairs, maximize the similarity score of matched text-subgraph pairs, while minimizing the score of non-matched text-subgraph pairs.
In the contrast learning process, given a node v, the node learned by the spectrum encoder is denoted as H, and the weight vector generated by the text encoder is denoted as HWhere K represents the number of categories, each weight wi is learned from hints, and the predictive probability can be calculated as
Wherein τ is the learned temperature super-parameter,<·,·>the similarity score is represented by a score of similarity,<w i ,H>representing text weight vector w i And the node represents the similarity score of vector H.
Specifically, the introduction of automatically generated learnable and continuous hint vectors described in step 4 is to replace discrete text words with continuous vectors learned from end to end in the data, and the hint P input to the text encoder is designed to
P=[V 1 ][V 2 ][V M ][CLASS],
Wherein, [ CLASS ]]Class label representing node, [ V ] M ]Is a word vector with the same dimension as the word representation in the training stage, M is a super parameter which represents the number of continuous Text vectors in the prompt, the continuous prompt P is input into a Text coder Text (), and then a classification weight vector representing the node concept can be obtained, and the prediction probability is calculated as
Wherein each prompt P i The category labels in (a) are replaced by the word vector representation of the i-th category name, text (P) i ) Representing the presentation P i The vector obtained after being fed into the text encoder.
Preferably, a more accurate hint vector is obtained in step 4, using the residual connection between the text encoder and the atlas encoder to utilize the context Wen Zitu of the given node, inputting the text representation of the category label and the node representation in the subgraph into the text-subgraph self-attention layer, helping the text feature to find the most relevant context node of the given node;
obtaining the output D of the text-to-subgraph comparator e After that, the text feature is updated by means of the residual connection,
Text(P)←Text(P)+λD e
where λ is a learnable parameter for controlling the extent of the residual connection.
Preferably, λ is initialized to 10 -4 A small value so that a priori linguistic knowledge from the text features can be retained to the maximum.
Compared with the prior art, the method has the advantages that: a prompt learning framework is provided for utilizing text information in a text heterogeneous information network and simultaneously processing a small sample learning problem; a graph encoder is introduced which captures the structure and heterogeneous characteristics of the heterogeneous information network while preserving the self-supervision information at the node level and edge level of the network subgraph. Therefore, the method for generating the heterogeneous information network keywords based on the contrast learning pre-training obtains more excellent and remarkable generation performance in the heterogeneous information network keyword generation task.
Drawings
FIG. 1 shows a schematic flow diagram of an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of a pre-training framework in an embodiment of the application;
FIG. 3 illustrates a schematic diagram of a prompt learning optimization framework in accordance with an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Preliminary knowledge:
let g= (V, E, T) represent a heterogeneous information network, where V and E represent node sets and edge sets, respectively; t (T) V And T E Representing a node type set and an edge type set, respectively. One heterogeneous information network is |T V |>1 and/or |T E |>1.
As shown in fig. 1, the method for generating heterogeneous information network keywords based on contrast learning pre-training according to the embodiment of the application includes:
step 1, a text encoder is adopted to encode the text into a low-dimensional vector, and a text representation is generated;
step 2, encoding structural features, heterogeneous features and self-supervision information of the heterogeneous information network by adopting a map encoder to obtain a map representation;
step 3, pre-training and aligning the text representation and the chart representation through contrast learning;
step 4, introducing automatically generated learnable continuous prompt vectors, providing the identified natural language sentences to a text encoder, comparing the natural language sentences with the structure and heterogeneous characteristic representation generated by the atlas encoder to generate weights in classification, and fusing to obtain a single representation;
and 5, generating keywords of the heterogeneous information network by using the obtained single representation.
The method mainly comprises a text encoder and a map encoder, which respectively encode the text and the network subgraph into low-dimensional vectors. In an embodiment, a text encoder is used as a text encoder to generate a text representation; for a spectrum encoder, the sub-graph to be processed is first sampled and all types of nodes are forced to be sampled to ensure heterogeneity, then the self-encoder mechanism is applied to explore the structural features and Bi-LSTM is applied to the nodes grouped by types to characterize the heterogeneity of the spectrum.
Two atlas pre-training tasks, namely mask node modeling and edge reconstruction, are introduced to utilize the self-supervision information of the node level and the edge level. After that, a contrast learning framework is introduced, and the two representations can be aligned. Specifically, given a pair of text and subgraphs, they are matched if they all belong to a given node. The contrast learning framework is used to maximize the similarity score for matched text sub-graph pairs and minimize the similarity score for non-matched text sub-graph pairs.
The pre-trained models described above need to be migrated into downstream tasks to accommodate the low sample set-up. Specifically, in the optimization stage, for each new classification task, the weights at the time of classification may be generated by providing natural language sentences describing the class of interest to the text encoder and comparing them to the structural and heterogeneous feature representations generated by the web encoder. How are cues that are very important to downstream tasks designed? Subtle changes to the words in the hint may affect the performance of the model. In this embodiment, manual cues such as "a paper of [ CLASS ] domain" are not designed, but automatically generated learnable and continuous cue vectors are introduced. The automatic prompt mechanism in the embodiment can bring more task-related and efficient migration effects to the pre-trained model.
The specific technical scheme is as follows.
Text encoder
The pre-training framework of this embodiment is shown in fig. 2. It consists of two encoders, namely a text encoder and a atlas encoder. The text encoder maps natural language text to a low-dimensional representation vector. A text representation of a fixed size is generated using a Sentence-BERT (SBERT) model.
Atlas encoder
The atlas encoder maps the network data into a low-dimensional representation.
Heterogeneous sub-graph sampling
For a given node, the subgraphs around the node need to be sampled and then processed by the graph encoder to generate the node representation. After sampling the sub-graph, the nodes in the sub-graph will be ranked by a centrality index that evaluates the importance of the nodes.
A random walk sampling strategy with restart is employed. It will iteratively traverse the neighborhood of a given node v with a certain probability back to the starting node v. To sample the more important nodes, the walk strategy is made to reach the high ranked nodes first. In order for the encoder to have heterogeneity, traversal is limited to sampling all types of nodes.
Structural module
The self-encoder is first employed to capture the structural information of the sub-picture. Given the adjacency matrix a of the subgraph, it will first be processed by the encoder to generate a multi-layer potential representation. Then, the decoder reverses the above process to obtain a reconstructed outputThe self-encoder aims to minimize reconstruction errors of the input and output, so that nodes with similar structures have similar representations. In the mathematical sense, the data of the data collection system,
where B is a penalty sparse applied to non-zero elements to mitigate the sparsity problem, e represents a bitwise multiplication,representing a regularization operation.
Heterogeneous module
To explore heterogeneous features of a network, nodes of the same type are first grouped together. This operation may destroy the structure of the sub-picture, but the previously employed automatic encoder already retains the structural features. Bi-LSTM is then applied to each group to model the type-specific features. Bi-LSTM is capable of capturing interactions of node features and has a wide range of sequence representation capabilities. Given type T j Node group of (2)Representation of node v->The calculation is as follows:
wherein Bi-LSTM { v } represents that Bi-LSTM is applied to the type grouping of node v,representing node group->Is the number of (3);
an attention mechanism is then applied to aggregate all types of groups to generate a representation h of a given node v
Wherein delta represents the activation function, using LeakyReLU, u ε R d Is a weight parameter, u T Represents a transpose of u and,is a representation of node v, { T } represents a collection of types, α v,j Representing the attention weight.
Self-supervising pre-training
The subgraph is further pre-trained based on the self-supervision information. Specifically, two pre-training tasks are introduced, mask node modeling (Masked node modeling, MNM) and edge reconstruction (Edge Reconstruction, ER) to enable node level and edge level atlas exploration.
For the MASK node modeling task, we rank according to the rank of nodes, randomly extract 15% of nodes to [ MASK ]]The identification replaces. The ordered nodes are fed into the transducer encoder,wherein the representation generated by Bi-LSTM would be the identification representation and the ordering information would be the position vector. Hidden layer learned by a Transformer encoderWill be fed into the feed forward layer to predict the target node, mathematical,
p v =softmax(W MNM z v ), (6)
wherein z is v Is the output of the feed-forward layer, feedforward () represents the output of the feed-forward layer, softmax () represents the activation function, W MNM ∈V v X d is the weight for classification shared with the input node representation matrix, V v Is the number of nodes of the subgraph, d is the dimension of the hidden layer vector, p v Is the predictive distribution of v over all nodes, and in training, a one-hot tag is usedAnd forecast->Cross entropy between them, loss function L MNM The calculation is carried out as follows,
wherein y is i And p i Is y i And p i Is the ith component, y i Representing a set of tags, p i Representing a set of prediction probabilities.
The edge reconstruction task samples positive edges and negative edges in the subgraph, wherein the positive edges are edges which do exist in the original subgraph, and the negative edges do not exist in the original subgraph. In practice, |N can be set S |=6, and the number of positive and negative edges is the same. Given positive and negative edge mergingSet N S The score of the edge reconstruction is calculated by the inner product between a pair of nodes, i.eIs to calculate a score, h v Is a representation of node v, e is an inner product, h u Is a representation of the node u, employing the binary cross entropy between the predicted edge and the true edge to calculate the loss function L for edge reconstruction ER
|N S I represents the number of node pairs, binaryCrossEntropy () represents the binary cross entropy, e uv Representing the actual scores of node u and node v, (u, v) representing the conjoined edges of node u and node v.
Pre-training by contrast learning
The present embodiment aligns the representation space of text and graphics during training, with its learning objective designed as a contrast loss function. In particular, given a batch of text-sub-graph pairs, the present embodiment needs to maximize the similarity score for matched text-sub-graph pairs while minimizing the score for non-matched text-sub-graph pairs. For example, given a sub-graph of a node, text information is a summary of the node, then the text-sub-graph pair is matched and text information is not matched regardless of the node. The similarity score is calculated using cosine similarity.
In contrast learning environments, high quality negative examples help to improve model performance. Thus in a training batch, the text and subgraphs used are selected from the nodes with the same labels, making them indistinguishable.
Fig. 3 illustrates a prompt learning optimization framework. The embodiment can be applied to an experimental environment with few samples. The pre-trained model, when faced with a sample of new label types, can predict whether the node's subgraph matches the text description. This may be accomplished by comparing the node representation generated by the atlas encoder with the classification weights generated by the text encoder. Text tracingThe may be used to specify the node class of interest even though the class is sample-less. Given a node v, the node learned by the atlas encoder is denoted as H, and the weight vector generated by the text encoder is denoted as HWhere K represents the number of categories. Each weight wi is learned from cues, e.g., "a paper of [ CLASS ]]domain "," CLASS "identification may be a specific CLASS name, such as" Information Retrieval "," database "or" data mining ". In order to facilitate downstream tasks, hints may also be designed as "The two nodes are [ CLASS ]]"it is a binary identification such as" connected "and" unconnected ". Mathematically, the predictive probability can be calculated as
Wherein τ is the learned temperature super-parameter,<·,·>the similarity score is represented by a score of similarity,<w i ,H>representing text weight vector w i And the node represents the similarity score of vector H.
Continuous prompt
The traditional prompt learning method adopts manual prompts designed by experts, and the embodiment selects continuous vectors which can be learned from end to end in data to replace discrete text words. In particular, the hint P input to the text encoder should be designed as
P=[V 1 ][V 2 ][V M ][CLASS], (10)
Wherein, [ CLASS ]]Class label representing node, [ V ] M ]Is a word vector of the same dimension as the word representation in the training phase, M is a super-parameter representing the number of consecutive text vectors in the prompt. After the continuous hint P is input to the Text coder Text (·), a classification weight vector representing the node concept can be obtained. Mathematically, the predictive probability is calculated as
Wherein each prompt P i The category labels in (a) are replaced by the word vector representation of the i-th category name, text (P) i ) Representing the presentation P i The vector obtained after being fed into the text encoder.
Residual connection
Considering the context node of a given node, e.g., the author node of a paper node, will help the text encoder become more accurate. Thus, to further hint the pre-trained language model, a context sub-graph based on the residual connection between the text encoder and the atlas encoder is employed to exploit a given node. The text representation of the category label and the node representations in the sub-graph are first input to the text-sub-graph self-attention layer, helping the text feature to find the most relevant context node for the given node. Obtaining the output D of the text-to-subgraph comparator e Thereafter, the text features are updated by residual connection as follows
Text(P)←Text(P)+λD e (12)
Where λ is a learnable parameter for controlling the extent of the residual connection. Initializing lambda to a small value of 10 -4 So that a priori linguistic knowledge from the text features can be retained to the maximum.
To optimize the text vector, training is performed to minimize the cross entropy based standard class loss function. Gradients can be counter-propagated by Text encoder Text (·) to exploit the rich knowledge encoded in the parameters. Selection of successive text vectors may also fully explore the word representation space, thereby improving learning of task related text.
The present embodiment contemplates a real world data set, namely an OAG. OAG is an academic network with four types of nodes, selecting headlines and abstracts as text, and classifying the corresponding paper nodes into five categories: (1) information retrieval, (2) database, (3) data mining, (4) machine learning, and (5) natural language processing.
The data set was divided into 80% training data set, 10% validation data set and 10% test data set. Table 1 summarizes the information of the data sets described above.
Table 1: data set statistics.
The vector dimensions of all representations are fixed at 512. For the text encoder, the vocabulary is 49,152, and each text sequence is fixed at 77, containing [ SOS ] and [ EOS ] tags. The text vector in the optimization process is initialized by a zero-mean gaussian distribution with standard deviation equal to 0.02. The number of text words at training time is set to 8. Training by adopting random gradient descent, wherein the initial learning rate is 0.002; attenuation is performed using cosine annealing rules. The maximum number of training cycles is set to 200. To mitigate the explosive gradients that may be encountered in early training iterations, a warm-up technique is used to fix the learning rate to 1e-5 during the first training period.
Experiments for pre-training and downstream tasks were run using Intel (R) Xeon (R) platform 8268CPU and Tesla V100.
This example evaluates the performance of the inventive method and benchmark model on keyword generation tasks. ACC and Macro-F1 values were used as evaluation indexes (five averages).
Table 2 shows the experimental results of the keyword generation task; the highest score is shown in bold.
The embodiment is applied to a task which is not considered in the field of map evaluation, namely keyword generation. The method of the present application is capable of generating keywords because it has autoregressive generation capabilities under a pre-training and fine-tuning framework. In practice, keywords of paper nodes in OAGs are used as real-trunk labels for training. The fine tuning starts with the text [ CLS ] [ MASK ], where [ CLS ] is the first word of the keyword, and next predicts a new word in the [ MASK ] tag. When [ SEP ] is returned, the process will stop. Traditional text network representation methods cannot generate keywords. Thus, for comparison purposes, a KeyBERT model is used in the experiment that generates keywords based on the BERT representation. It takes as input a description of each node. The predicted F1 values for Top 1 and Top 3 keywords were used as evaluation indicators.
Table 2 shows the results of the keyword generation task. The method of the application is superior to KeyBERT because the method of the application combines text information with sub-picture information, and KeyBERT can only utilize text information.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (10)

1. The heterogeneous information network keyword generation method based on contrast learning pre-training is characterized by comprising the following steps of:
step 1, a text encoder is adopted to encode the text into a low-dimensional vector, and a text representation is generated;
step 2, encoding structural features, heterogeneous features and self-supervision information of the heterogeneous information network by adopting a map encoder to obtain a map representation;
step 3, pre-training and aligning the text representation and the chart representation through contrast learning;
step 4, introducing automatically generated learnable continuous prompt vectors, providing the identified natural language sentences to a text encoder, comparing the natural language sentences with the structure and heterogeneous characteristic representation generated by the atlas encoder to generate weights in classification, and fusing to obtain a single representation;
and 5, generating keywords of the heterogeneous information network by using the obtained single representation.
2. The method for generating heterogeneous information network keywords based on contrast learning pre-training according to claim 1, wherein the step 2 specifically comprises the following steps:
step 201, sampling heterogeneous subgraphs, wherein for a given node, the subgraphs around the node need to be sampled first;
step 202, capturing structural information of the sub-graph using the self-encoder, given the adjacency matrix A of the sub-graph, which will first be processed by the encoder to generate a multi-layered potential representation, and then the decoder reversing the above process to obtain a reconstructed outputThe self-encoder aims to minimize reconstruction errors of the input and output, to make nodes with similar structures have similar representations, and to lose the function L structure The calculation formula is as follows:
where B is a penalty sparse applied to non-zero elements to mitigate the sparsity problem, e represents a bitwise multiplication,representing a regularization operation;
step 203, exploring heterogeneous characteristics of heterogeneous information network, grouping nodes with the same type together, applying Bi-LSTM on each group to model type-specific characteristics, given type T j Node group of (2)Representation of node vThe calculation is as follows:
wherein Bi-LSTM { v } represents that Bi-LSTM is applied to the type grouping of node v,representing node group->Is the number of (3);
an attention mechanism is then applied to aggregate all types of groups to generate a representation h of a given node v
Wherein delta represents the activation function, using LeakyReLU, u ε R d Is a weight parameter, u T Represents a transpose of u and,is a representation of node v, { T } represents a collection of types, α v,j Represents an attention weight;
step 204, pre-training the subgraph based on self-supervision information, introducing two pre-training tasks, a mask node modeling task and an edge reconstruction task, so as to realize the graph exploration of the node level and the edge level.
3. The method for generating heterogeneous information network keywords based on contrast learning pre-training of claim 2, wherein the MASK node modeling task sorts according to the ranking of nodes, randomly extracts the nodes with preset proportion to [ MASK ]]Identification replacement and ordered sectionsThe points are fed into the encoder of the transducer, the Bi-LSTM generated representation is used as an identification representation, the ordering information is used as a position vector, and the hidden layer learned by the transducer encoderInto the feed-forward layer to predict the target node, expressed mathematically as:
p v =soft max(W MNM z v ),
wherein z is v Is the output of the feed-forward layer, feedforward () represents the output of the feed-forward layer, softmax () represents the activation function, W MNM ∈V v X d is the weight for classification shared with the input node representation matrix, V v Is the number of nodes of the subgraph, d is the dimension of the hidden layer vector, p v Is the predictive distribution of v over all nodes, and in training, a one-hot tag is usedAnd forecast->Cross entropy between them, loss function L MNM The calculation is as follows:
wherein y is i And p i Is y i And p i Is the ith component, y i Representing a set of tags, p i A set representing a prediction probability;
the edge reconstruction task samples positive edges and negative edges in the subgraph, wherein the positive edges are edges which do exist in the original subgraph, the negative edges do not exist in the original subgraph, and the positive edges are givenPositive and negative edges are combined and combined to form N S The score of the edge reconstruction is calculated by the inner product between a pair of nodes, i.e Is to calculate a score, h v Is a representation of node v, e is an inner product, h u Is a representation of the node u, employing the binary cross entropy between the predicted edge and the true edge to calculate the loss function L for edge reconstruction ER
|N S I represents the number of node pairs, binaryCrossEntropy () represents the binary cross entropy, e uv Representing the actual scores of node u and node v, (u, v) representing the conjoined edges of node u and node v.
4. The method for generating heterogeneous information network keywords based on contrast learning pre-training according to claim 2, wherein the sampling of sub-graphs around the nodes adopts a sampling strategy with restarting random walk, the neighborhood of a given node v is traversed iteratively, and a certain probability is returned to the starting node v, so that the random walk strategy reaches the nodes with high rank first for sampling the nodes with higher importance, and the traversal is limited to sampling all types of nodes for making the atlas encoder have heterogeneity.
5. The method for generating heterogeneous information network keywords based on contrast learning pre-training of claim 1, wherein the contrast learning is used for aligning text representations and graphic representations during training, the learning objective is designed as a contrast loss function, a batch of text-sub-graph pairs are given, and the similarity scores of the matched text-sub-graph pairs are maximized while the scores of the unmatched text-sub-graph pairs are minimized.
6. The method for generating heterogeneous information network keywords based on contrast learning pre-training of claim 5, wherein in the contrast learning process, given a node v, the node learned by the graph encoder is denoted as H, and the weight vector generated by the text encoder is denoted as HWhere K represents the number of categories, each weight w i All are learned from cues, and the prediction probability is calculated as:
wherein τ is the learned temperature super-parameter,<,·>the similarity score is represented by a score of similarity,<w i ,H>representing text weight vector w i And the node represents the similarity score of vector H.
7. The method for generating heterogeneous information network keywords based on contrast learning pre-training according to claim 2, wherein the automatically generated learnable and continuous hint vectors introduced in step 4 are continuous vectors learned from end to end in data to replace discrete text words, and the hint P input to the text encoder is designed as follows:
P=[V 1 ][V 2 ]…[V M ][CLASS],
wherein, [ CLASS ]]Class label representing node, [ V ] M ]Is a word vector with the same dimension as the word representation in the training stage, M is a super parameter which represents the number of continuous Text vectors in the prompt, the continuous prompt P is input into a Text coder Text (), and then a classification weight vector representing the node concept can be obtained, and the prediction probability is calculated as
Wherein each prompt P i The category labels in (a) are replaced by the word vector representation of the i-th category name, text (P) i ) Representing the presentation P i The vector obtained after being fed into the text encoder.
8. The contrast learning pretraining-based heterogeneous information network keyword generation method of claim 7, wherein a more accurate hint vector is obtained in step 4, the text representation of the category label and the node representation in the subgraph are input to the text-subgraph self-attention layer using the context Wen Zitu of the given node based on the residual connection between the text encoder and the atlas encoder, helping the text feature to find the most relevant context node of the given node;
obtaining the output D of the text-to-subgraph comparator e After that, the text feature is updated by means of the residual connection,
Text(P)←Text(P)+λD e
where λ is a learnable parameter for controlling the extent of the residual connection.
9. The method for generating heterogeneous information network keywords based on contrast learning pre-training of claim 1, wherein the text encoder uses a Sentence-BERT model to generate a text representation of a fixed size.
10. The method for generating heterogeneous information network keywords based on contrast learning pre-training of claim 8, wherein λ is initialized to 10 -4
CN202310587606.9A 2023-05-23 2023-05-23 Heterogeneous information network keyword generation method based on contrast learning pre-training Pending CN116662565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310587606.9A CN116662565A (en) 2023-05-23 2023-05-23 Heterogeneous information network keyword generation method based on contrast learning pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310587606.9A CN116662565A (en) 2023-05-23 2023-05-23 Heterogeneous information network keyword generation method based on contrast learning pre-training

Publications (1)

Publication Number Publication Date
CN116662565A true CN116662565A (en) 2023-08-29

Family

ID=87721709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310587606.9A Pending CN116662565A (en) 2023-05-23 2023-05-23 Heterogeneous information network keyword generation method based on contrast learning pre-training

Country Status (1)

Country Link
CN (1) CN116662565A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994098A (en) * 2023-09-27 2023-11-03 西南交通大学 Large model prompt learning method based on category attribute knowledge enhancement
CN117576710A (en) * 2024-01-15 2024-02-20 西湖大学 Method and device for generating natural language text based on graph for big data analysis

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994098A (en) * 2023-09-27 2023-11-03 西南交通大学 Large model prompt learning method based on category attribute knowledge enhancement
CN116994098B (en) * 2023-09-27 2023-12-05 西南交通大学 Large model prompt learning method based on category attribute knowledge enhancement
CN117576710A (en) * 2024-01-15 2024-02-20 西湖大学 Method and device for generating natural language text based on graph for big data analysis
CN117576710B (en) * 2024-01-15 2024-05-28 西湖大学 Method and device for generating natural language text based on graph for big data analysis

Similar Documents

Publication Publication Date Title
CN111581510B (en) Shared content processing method, device, computer equipment and storage medium
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN116304066B (en) Heterogeneous information network node classification method based on prompt learning
CN109829104A (en) Pseudo-linear filter model information search method and system based on semantic similarity
CN116662565A (en) Heterogeneous information network keyword generation method based on contrast learning pre-training
CN109977220B (en) Method for reversely generating abstract based on key sentence and key word
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN113515632A (en) Text classification method based on graph path knowledge extraction
CN115495555A (en) Document retrieval method and system based on deep learning
CN112732862B (en) Neural network-based bidirectional multi-section reading zero sample entity linking method and device
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN118171149B (en) Label classification method, apparatus, device, storage medium and computer program product
CN111858984A (en) Image matching method based on attention mechanism Hash retrieval
CN115203507A (en) Event extraction method based on pre-training model and oriented to document field
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN117763363A (en) Cross-network academic community resource recommendation method based on knowledge graph and prompt learning
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
CN114298055B (en) Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN116662566A (en) Heterogeneous information network link prediction method based on contrast learning mechanism
CN114758283A (en) Video label classification method, system and computer readable storage medium
CN117828024A (en) Plug-in retrieval method, device, storage medium and equipment
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN116955818A (en) Recommendation system based on deep learning
CN115827871A (en) Internet enterprise classification method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination