CN110705260A - Text vector generation method based on unsupervised graph neural network structure - Google Patents

Text vector generation method based on unsupervised graph neural network structure Download PDF

Info

Publication number
CN110705260A
CN110705260A CN201910905090.1A CN201910905090A CN110705260A CN 110705260 A CN110705260 A CN 110705260A CN 201910905090 A CN201910905090 A CN 201910905090A CN 110705260 A CN110705260 A CN 110705260A
Authority
CN
China
Prior art keywords
word
document
node
text
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910905090.1A
Other languages
Chinese (zh)
Other versions
CN110705260B (en
Inventor
段大高
闫光宇
韩忠明
杨伟杰
刘文文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN201910905090.1A priority Critical patent/CN110705260B/en
Publication of CN110705260A publication Critical patent/CN110705260A/en
Application granted granted Critical
Publication of CN110705260B publication Critical patent/CN110705260B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text vector generation method based on an unsupervised graph neural network structure, which is characterized in that stop word processing is carried out on all collected text corpora by utilizing stop word corpora, keywords are selected from the processed corpora, text keyword weight and weight among the keywords are calculated and stored, and a text-keyword network adjacency matrix is constructed; secondly, calculating initial node characteristics of the text by using the trained word vectors as word node characteristics and keywords appearing in the document to obtain a text-keyword network characteristic matrix; and finally, constructing a negative sample adjacency matrix and a characteristic matrix corresponding to the positive sample, converging the loss by utilizing the steps of the loss function and the constructed network model and gradient reduction, and obtaining a text node characteristic vector after convergence to obtain a text expression vector based on the unsupervised GNN. The invention fully considers the discontinuous global word co-occurrence and long-distance semantics in the corpus and the total relevance of a single document to all document-keyword sets.

Description

Text vector generation method based on unsupervised graph neural network structure
Technical Field
The invention relates to the technical field of data mining and natural language processing, in particular to a text vector generation method based on an unsupervised graph neural network, which can be applied to extracting document vectors and also can be applied to downstream tasks such as text classification, clustering and text similarity calculation.
Background
Text has become a hot issue for research on many platforms today, and since most texts are unstructured or semi-structured data, text mining has been one of the important research angles for data mining in multiple fields. Meanwhile, with the gradual popularization of the internet, the data size of the web text is larger and larger, the growth speed of the information amount is gradually increased, and it becomes more and more difficult to extract the information required by the user from the mass data.
The conventional method represents that the average value of all word vectors contained in a document is calculated, and a doc2vec model is adopted. Recently, deep learning models have been widely used to learn text representations, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Because the CNN and RNN give priority to locality and sequentiality, these deep learning models can obtain semantic and syntactic information in locally continuous word sequences well, but ignore non-continuous global word co-occurrence and long-distance semantics in the corpus and the overall correlation of a single document to all document-keyword sets. Aiming at the problem, a novel unsupervised graph neural network-based text vector generation method is provided.
Disclosure of Invention
The invention aims to provide a text vector generation method based on an unsupervised graph neural network structure, which is used for expressing text vectors based on the unsupervised graph neural network and then can utilize document expression vectors to perform downstream tasks such as classification, clustering and the like. To solve the problems of the prior art.
In order to achieve the purpose, the invention provides the following scheme: the invention provides a text vector generation method based on an unsupervised graph neural network structure, which comprises the following steps:
step one, obtaining keywords: performing word segmentation processing and word deactivation processing on all texts in a corpus to obtain a document set, then calculating and storing word frequency of each word in the document set, taking the times with the word frequency being more than or equal to n as a keyword, wherein n is more than 1;
step two, judging the relevance of word semantics in the corpus: defining m continuous words of an unprocessed single document as a word co-occurrence window, wherein m is greater than 1, the unprocessed words are words which are subjected to word segmentation processing but are not subjected to word deactivation processing, setting # W (i) as the number of word co-occurrence windows containing the occurrence of a word i in all documents in a corpus, # W (i, j) as the number of windows containing the occurrence of the words i and j in all documents in the corpus and appearing in the same co-occurrence window simultaneously, # W as the total number of the word co-occurrence windows in all the documents in the corpus, and the formula of the inter-keyword mutual information PMI (i, j) is as follows:
Figure BDA0002213042240000021
Figure BDA0002213042240000022
Figure BDA0002213042240000023
wherein p (i) represents the proportion of the window containing the word i to all the co-occurrence windows, and p (i, j) represents the proportion of the window containing the words i and j to all the co-occurrence windows simultaneously; PMI (i, j) >0 represents high semantic relevance of words in the corpus, PMI (i, j) <0 represents little or no semantic relevance in the corpus;
step three, calculating and storing the weight TF-IDF word frequency-inverse text frequency of the document-word, wherein the formula is as follows:
TF-IDF=tf(t,Di)×idf(t)
Figure BDA0002213042240000031
wherein, tf (t, D)i) Representing the word frequency of the keyword t in the ith document, M representing the total number of documents, ntRepresenting the number of documents with keywords t in the document set, and IDF representing the frequency of calculating the reverse text;
step four, constructing an adjacency matrix of the document-word complex network;
step five, training all keywords in the text set, and expressing the keywords by word vectors; setting the vector of the initial document i to be equal to the sum of word vectors of all keywords in the document i divided by the number of the keywords in the document i;
constructing a node characteristic matrix X of the document-word complex network, wherein a behavior characteristic column is a node, a keyword node characteristic is a word vector of a keyword, and a document node characteristic is an initial document vector; defining a node characteristic matrix X as a positive sample characteristic matrix, and A as a positive sample adjacency matrix; the negative samples use the feature matrix mixed by lines as the feature matrix, and the adjacent matrix uses the same adjacent matrix as the positive samples, i.e. the negative samples
Figure BDA0002213042240000032
Step six, defining a loss function L:
Figure BDA0002213042240000033
wherein N represents the number of positive sample nodes, M represents the number of negative sample nodes,which represents a network of positive samples,
Figure BDA0002213042240000035
a network of negative examples is represented,
Figure BDA0002213042240000036
the representative vectors of the nodes after extracting the local features for the positive samples,
Figure BDA0002213042240000041
extracting a representation vector of a node after local features are extracted for the negative sample; the local characteristics of the positive and negative samples are processed by the same convolution method epsilon, epsilon (X, A) represents a node characteristic matrix after the positive samples are processed,
Figure BDA0002213042240000042
representing the global characteristics of the positive sample, wherein R represents a processing process of the node characteristic matrix after the positive sample is processed;
Figure BDA0002213042240000043
a representation discriminator, discriminating
Figure BDA0002213042240000044
Whether or not they are similar to each other, if
Figure BDA0002213042240000045
Approaching 1 indicates similarity, if
Figure BDA0002213042240000046
Approaching 0 indicates dissimilarity;
step seven, constructing a graph neural network model
Processing the positive sample node feature matrix X and the adjacency matrix A to obtain a negative sample feature matrix
Figure BDA0002213042240000047
Adjacency matrix
Figure BDA0002213042240000048
Extracting local features of each node of the positive sample and the negative sample to obtain the global features of the positive sample, and determining the global features of the positive sample by a discriminator
Figure BDA0002213042240000049
Calculating a loss function, and updating epsilon and R, D by using the loss function with gradient reduction until loss is converged;
and taking the converged text node feature vector to generate an unsupervised GNN-based text representation vector.
Preferably, the process of step four specifically comprises: firstly, taking each document as a node in a network, and taking each keyword as a node; then, edges among the nodes are constructed, and the edge weight between the node i and the node j is defined as AijThe formula is as follows:
Figure BDA00022130422400000410
preferably, the specific process of the step six is as follows:
local features are extracted by applying a single-layer GCN structure: the formula of the node feature matrix epsilon (X, A) after the positive sample processing is as follows:
Figure BDA0002213042240000051
wherein
Figure BDA0002213042240000052
INIs a matrix of the units,
Figure BDA0002213042240000053
is that
Figure BDA0002213042240000054
θ is a learnable parameter matrix;
for global features, averaging is performed on all node features of the positive sample by using an averaging method:
Figure BDA0002213042240000055
wherein sigma is a nonlinear Sigmoid function, and N represents the number of nodes;
for the discriminator, a simple bilinear scoring function is applied:
Figure BDA0002213042240000056
where W is a learnable scoring matrix and σ is a nonlinear Sigmoid function.
The invention discloses the following technical effects: in the process of compressing and converting text content, constructing a text-keyword network is an effective method, namely, a large amount of texts and keywords are converted after each text keyword is found, and the texts and the keywords are converted into a large complex network. The text scale can be greatly compressed, and basic information in the text is lost as little as possible; and learning the constructed text-keyword network by using a graph neural network to obtain a new text expression vector which not only contains the keyword information in the document, but also contains the weight of the keyword and the structural information of the graph. In the process of the invention, data is easy to process keywords, the process is simple and effective, the text-keyword complex network adjacency matrix node characteristic matrix is convenient to obtain, meanwhile, the negative sample and the graph neural network model are easy to calculate, and discontinuous global word co-occurrence and long-distance semantics in a corpus and the total correlation of a single document to all document-keyword sets are fully considered.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is an exemplary diagram of text-keywords in step four according to the embodiment of the present invention;
fig. 3 is an exemplary diagram of a network model in step seven of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1 to 3, the present invention provides a method for generating a text vector based on an unsupervised graph neural network, which specifically includes the following steps:
step one, acquiring a large amount of text corpora as a corpus, taking a data set 20Newsgroups (20NG) as an example, and downloading addresses as follows:
http:// qwone.com/. jason/20Newsgroups/20news-bydate.tar.gz, which includes 18846 documents. The present invention borrows 3 documents from this data to build a complex network example.
Obtaining the stop word corpus, taking the stop word list summarized by the Xinlang user as an example, and the download address is as follows: http:// blog.sina.com. cn/s/blog _ a19a 3770102wjau. html, which includes 891 stop words, including ' \\ about ', ' above ', ' also ', ' I ', ' wait ', ' to ', ' the ' … … ', etc. The invention borrows the data to remove stop words; all texts are processed by the stop words, and if 891 stop words such as 'about', 'above', 'also', 'I', 'wait', 'to', 'the' … …, and the like appear in the texts, the words are deleted in the texts, and finally a document set after the stop words is obtained. For example document D10Comprises the following steps: "I wait to fly in the sky", according to stopping word list order and deleting the word in the word list, look for "about" in the file at first, if exist, delete "about"; then 'above' is deleted in the document; … … until the last word in the deactivation vocabulary is deleted. Since '"I', 'wait', 'to', 'in', 'the' are stop words, document D after the stop words are removed10Comprises the following steps: "fly sky". Computing sums in document collections after stop wordsThe word frequency (TF) of each word is stored, wherein the word frequency is the frequency of the occurrence of a word in a certain article, and words with the word frequency of more than or equal to 5 are taken as keywords. Then document D1Containing a keyword { w1,w2Document D2Containing a keyword { w1,w3Document D3Containing a keyword { w3,w4}。
Step two, calculating and storing a formula of the mutual information PMI (i, j) among the keywords as follows:
Figure BDA0002213042240000071
Figure BDA0002213042240000081
Figure BDA0002213042240000082
the size of a word co-occurrence window is regulated to be 5, the invention defines continuous 5 words of an unprocessed single document as a word co-occurrence window, wherein # W (i) is the number of the word co-occurrence windows containing the occurrence of a word i in all documents in a corpus, # W (i, j) is the number of the word co-occurrence windows containing the occurrence of the words i and j in all documents in the corpus and # W is the total number of the word co-occurrence windows in all the documents in the corpus, namely W is the same as the number of the words in all the documents. A positive PMI value implies high semantic relevance of words in the corpus, while a negative PMI value indicates little or no semantic relevance in the corpus. For example for original document D1In the "I way to fly in the sky", the first word co-occurrence window is "I way to fly in", the second word co-occurrence window is "way to fly in the same", the third word co-occurrence window is "to fly in the sky", the fourth word co-occurrence window is "fly in the sky", the fifth word co-occurrence window is "in the sky", the sixth word co-occurrence window is "the sky", and the seventh word co-occurrence window is "sky". Wherein, represents the word appearing after the word from the word to the word in the word sky, if the sentence is the endSentence then represents the automatically filled space symbol. If 'sky' and 'fly' are document D1The keywords do not appear in other co-occurrence windows at the same time, thenAssuming that # W is 100000, W (sky) is 5, and W (fly) is 40, the method is as follows
Figure BDA0002213042240000084
PMI (sky, fly) ═ log1000 ═ 3. Then PMI (w) is calculated1,w2)=3,PMI(w1,w3)=2,PMI(w1,w4)=-1,PMI(w2,w3)=1,PMI(w2,w4)=-2,PMI(w3,w4)=2。
Step three, calculating and storing the weight TF-IDF (word frequency-inverse text frequency) of the document-word, wherein the formula is as follows:
TF-IDF=tf(t,Di)×idf(t)
wherein, tf (t, D)i) The word frequency of the keyword t in the ith document, M is the total number of the documents, ntFor the number of documents in the document set in which the keyword t appears, IDF indicates the calculation of the inverted text frequency, wherein
Figure BDA0002213042240000092
The purpose of adding 0.01 is to prevent the document-word weight from being 0 (n)tM). (the text frequency refers to the number of times that a certain keyword appears in all articles in the whole corpus, the text frequency is the reciprocal of the text frequency and is mainly used for reducing the effect of words which are common in all documents but have little influence on the documents.)
For the 3 selected data sets, the document number M is 3, for 'sky', if it is in document D2The number of word frequencies occurring in is tf (sky, D)2) 30, which is a keyword n in 2 documentssky=2,
Figure BDA0002213042240000093
Figure BDA0002213042240000094
Is calculated to obtain
Figure BDA0002213042240000096
Step four, constructing a document-word complex network adjacency matrix:
each document is taken as a node in the network and each keyword is also taken as a node. Constructing edges between nodes, and defining the edge weight between the node i and the node j as AijThe formula is as follows:
suppose that 3 of the documents are selected to build a document-word complex network, document D1Containing a keyword { w1,w2Document D2Containing a keyword { w1,w3Document D3Containing a keyword { w3,w4},PMI(w1,w2)=3,PMI(w1,w3)=2,PMI(w1,w4)=-1,PMI(w2,w3)=1,PMI(w2,w4)=-2,PMI(w3,w4)=2;
Figure BDA0002213042240000101
Figure BDA0002213042240000102
Figure BDA0002213042240000103
The complex network adjacency matrix is a:
Figure BDA0002213042240000104
since there are 3 documents and 4 keywords, the adjacency matrix A is a 7 × 7 matrix with the order { D }1、D2、D3、W1、w2、w3、w4}。
Step five, acquiring word vectors trained by using word2vec, taking the word vectors of Wikipedia trained by GloVe as an example, and downloading addresses are as follows: http:// nlp.stanford.edu/data/glove.6B.zip. With the word vector dimension being 3 for the example. All keywords in the text set are represented by trained word vectors. The initial document i vector is equal to the sum of the word vectors for all keywords in document i divided by the number of keywords in document i. For example for document D1Containing a keyword { w1,w2},w1The node of (a) represents a vector of
Figure BDA0002213042240000105
w2The node of (a) represents a vector of
Figure BDA0002213042240000106
Figure BDA0002213042240000107
Then D is1Is represented as a vector
Figure BDA0002213042240000108
Figure BDA0002213042240000109
And similarly, constructing a node characteristic matrix X of the document-word complex network, wherein the behavior characteristic column is a node, the keyword node characteristic is a word vector of the keyword, and the document node characteristic is an initial document vector. Defining A as a positive sample adjacency matrix, and defining a node characteristic matrix X as a positive sample characteristic matrix, wherein the positive sample characteristic matrix X has the following formula:
Figure BDA0002213042240000111
the negative samples use the feature matrix mixed by lines as the feature matrix, and the adjacent matrix uses the same adjacent matrix as the positive samples, i.e.
Figure BDA0002213042240000112
For w1,w2The nodes are characterized by (1, 2, 3), (3, 1, 2), w after confusion1,w2The node features may be (3, 1, 2) or (1, 2, 3), that is, for all node features, the dimensions and the included values thereof do not change, and only the original wiThe characteristics of the node may no longer represent wi. Namely, it is
Figure BDA0002213042240000113
Can be as follows:
Figure BDA0002213042240000114
or
Figure BDA0002213042240000115
Step six: defining a loss function L:
Figure BDA0002213042240000116
wherein N represents the number of positive sample nodes, M represents the number of negative sample nodes,
Figure BDA0002213042240000117
which represents a network of positive samples,
Figure BDA0002213042240000121
a network of negative examples is represented,
Figure BDA0002213042240000122
the representative vectors of the nodes after extracting the local features for the positive samples,
Figure BDA0002213042240000123
extracting a part for a negative sampleA representation vector of the feature back node; the local characteristics of the positive and negative samples are processed by adopting the same convolution method epsilon, and epsilon (X, A) represents a node characteristic matrix after the positive samples are processed.
Figure BDA0002213042240000124
Representing the global characteristics of the positive sample,
Figure BDA0002213042240000125
wherein R represents a processing process of the node characteristic matrix after the positive sample is processed;
Figure BDA0002213042240000126
a representation discriminator, discriminating
Figure BDA0002213042240000127
Whether or not they are similar to each other, ifApproaching 1 indicates similarity, if
Figure BDA0002213042240000129
Approaching 0 indicates dissimilarity.
The formula for extracting the local features and applying the node feature matrix epsilon (X, A) of the single-layer GCN structure is as follows:
wherein
Figure BDA00022130422400001211
INIs a matrix of the units,
Figure BDA00022130422400001212
is that
Figure BDA00022130422400001213
Degree matrix of, e.g.
Figure BDA00022130422400001214
θ is a learnable parameter matrix. For the non-linear parameter σ, we use the ReLU function. I.e. for an input a,comprises the following steps:
Figure BDA00022130422400001216
Figure BDA00022130422400001217
for global features, averaging all node features of the positive sample by using a simple averaging method:
where σ is a nonlinear Sigmoid function and N represents the number of nodes.
For the discriminator, a simple bilinear scoring function is applied:
Figure BDA0002213042240000132
where W is a learnable scoring matrix and σ is a nonlinear Sigmoid function.
Step seven: step of constructing graph neural network model
Mixing the positive sample node feature matrix X line by line to obtain a negative sample feature matrix
Figure BDA0002213042240000133
Adjacency matrix
Figure BDA0002213042240000134
Extracting local features of each node of the alignment sample
Figure BDA0002213042240000135
Extracting local features of each node of negative sample
Figure BDA0002213042240000136
Obtaining positive sample global features
Figure BDA0002213042240000137
Pass discriminator
Figure BDA0002213042240000138
And (4) calculating the Loss, and if the Loss does not converge, repeating the steps 1,2, 3,4 and 5 until the Loss converges.
For example, the first training: first step input
Figure BDA0002213042240000139
Figure BDA00022130422400001310
To obtain
Figure BDA0002213042240000141
Secondly, extracting node characteristics of the positive sample, and firstly calculating:
Figure BDA0002213042240000142
Figure BDA0002213042240000143
the learnable concealment layer number is set to 6, and θ is initialized to [ -0.5692, -0.6487,0.3339], [ -0.1275, -0.4908, -0.1815], [ -0.7875, -0.4873, -0.3584], [0.7263,0.3383,0.4262], [0.2747,0.5978, -0.5178], [0.6429, -0.4805, -0.2120] ] whose dimension is ([6,3 ]).
And calculating to obtain:
h [ [ [ [ -0.7087, -0.5292, -1.1873,4.3132,1.2835,0.4477], [ -0.9484, -0.5760, -1.3368,4.7530,2.1082,0.9065], [ -1.2537, -0.6958, -1.6096,5.6373,3.0459,1.2284], [ -0.5677, -0.5290, -1.1280, 4.1638, 0.7651,0.0254], [ -0.5199, -0.4663, -0.9613,3.5127,0.8292, -0.0187], [ -1.0183, -0.6611, -1.5236,5.4620,2.1234,0.9007], [ -1.1076, -0.6147, -1.3856,4.8227,2.7644,0.9474] ], its dimension is ([1,7,6 ]).
And the third step is to calculate:
its dimension is ([1,7,6 ]).
Fourthly, calculating global features:
Figure BDA0002213042240000152
[ its dimension is (1, 6).
The fifth step is initialization:
w [ [ [ [ -0.2551,0.1770, -0.2642, -0.1486,0.2632, -0.3471], [0.3415,0.0906, -0.0688,0.3749, -0.0906, -0.0022], [0.0060, -0.3362, -0.1600, -0.2831,0.0946, -0.3743], [ -0.2553, -0.0699, -0.1703,0.2189,0.2910, -0.2692], [0.2505,0.2865,0.1543, -0.1312,0.2121, -0.1092], [0.2638, -0.3451,0.1210,0.2282, 0.3422, -0.0979] ], the dimension of which is ([1,6,6 ]).
The discriminator score is calculated as:
[1.7562,2.3045,2.9415,1.3836,1.1912,2.5069,2.5062,2.3998,1.1779,1.3023,2.6828,1.5789,2.8675,2.0953]. The Loss is calculated to be 1.483.
And (3) training for the second time: the first step A, X,
Figure BDA0002213042240000161
In the same way, the first and second,in the second step
Figure BDA0002213042240000163
Also, θ becomes:
[[-0.5682,-0.6477,0.3349],[-0.1285,-0.4918,-0.1825],[-0.7865,-0.4863,-0.3574],[0.7253,0.3373,0.4252],[0.2737,0.5968,-0.5188],[0.6419,-0.4815,-0.2130]]its dimension is ([6, 3)]) The obtained H [ [ [ [ -0.7035, -0.5294, -1.1803,4.3039,1.2742,0.4384],[-0.9422,-0.5762,-1.3290,4.7432,2.0984,0.8967],[-1.2459,-0.6958,-1.6004,5.6259,3.0346,1.2170],[-0.5631,-0.5292,-1.1212,4.1545,0.7558,0.0162],[-0.5158,-0.4664,-0.9555,3.5047,0.8211,-0.0206],[-1.0114,-0.6612,-1.5147,5.4508,2.1122,0.8895],[-1.1007,-0.6147,-1.3776,4.8128,2.7545,0.9375]]]Its dimension is ([1,7, 6)]). The third step of the same process
Figure BDA0002213042240000164
The fourth step
Figure BDA0002213042240000165
And may also be obtained by calculation.
The fifth step learns the weight matrix W to become:
[ [ [ [ [ -0.2541,0.1780, -0.2632, -0.1476,0.2642, -0.3461], [0.3425,0.0916, -0.0678,0.3759, -0.0896, -0.0012], [0.0070, -0.3352, -0.1590, -0.2821,0.0956, -0.3733], [ -0.2563, -0.0709, -0.1713,0.2179,0.2900, -0.2702], [0.2495,0.2855,0.1533, -0.1322,0.2111, -0.1102], [0.2628, -0.3461,0.1200,0.2272,0.3412, -0.0989] ] its dimension is ([1,6,6 ]). A new discriminator score [1.7128,2.2531,2.8782,1.3445,1.1597,2.4496,2.4513,1.3324,2.7238,2.8783,2.3160,1.1949,2.5691,1.1294] may be calculated, resulting in an updated Loss of 1.1571.
The 3 rd training resulted in Loss being 1.1187, 1.1330 …, 0.6888 at 54 th, 0.6988 at 55 th, 0.6938 at 56 th, 0.6944 at 57 th, 0.6992 at 58 th, 0.6973 at 59 th. When the pass setting was 5, that is, 55 to 59 passes of Loss was not as good as 54, the Loss was considered to converge.
And (3) taking the converged text node feature vectors (the first three line vectors obtained by the 54 th training) [ -0.0229,0.0709,0.1803,0.1003, -0.0238,0.3006], [ -0.0212,0.0884,0.1985,0.1419, -0.0237,0.3395], [ -0.0267, -0.0027,0.0808, -0.0097, -0.0214,0.1106], so as to generate the unsupervised GNN-based text representation vector.
The principle is as follows: firstly, a large amount of text corpora are collected, and stop word corpora are downloaded. And performing stop word processing on all the collected text corpora by using the stop word corpora. And then calculating and storing the word frequency of the words in each document, taking the words with the word frequency larger than n as keywords, calculating and storing a text keyword weight TF-IDF and a weight PMI among the keywords, and defining network nodes and node edge weights to obtain a text-keyword network adjacency matrix. Secondly, the trained word vectors are used as word node characteristics, and initial node characteristics of the text are calculated by using keywords appearing in the document to obtain a text-keyword network characteristic matrix. And then constructing a negative sample adjacency matrix and a characteristic matrix corresponding to the positive sample, utilizing the defined loss function and the constructed network model step, utilizing gradient descent to make loss convergence, and taking the converged text node characteristic vector to obtain a text expression vector based on the unsupervised GNN. By adopting the method, the data is easy to obtain, the keyword processing process is simple and effective, the text-keyword complex network adjacency matrix node characteristic matrix is convenient to obtain, the negative sample is easy to construct, the graph neural network model is easy to calculate, the discontinuous global word co-occurrence and long-distance semantics in the corpus and the total correlation of a single document to all document-keyword sets are fully considered, and the user can conveniently extract the required information from mass data.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims (3)

1. A text vector generation method based on an unsupervised graph neural network structure is characterized by comprising the following steps:
step one, obtaining keywords: performing word segmentation processing and stop word processing on all texts in a corpus to obtain a document set, then calculating and storing word frequency of each word in the document set, and selecting times with the word frequency being more than or equal to n as keywords, wherein n is more than 1;
step two, judging the relevance of word semantics in the corpus: defining m continuous words of an unprocessed single document as a word co-occurrence window, wherein m is greater than 1, the unprocessed words are words which are subjected to word segmentation processing but are not subjected to word deactivation processing, setting # W (i) as the number of word co-occurrence windows containing the occurrence of a word i in all documents in a corpus, # W (i, j) as the number of windows containing the occurrence of the words i and j in all documents in the corpus and appearing in the same co-occurrence window simultaneously, # W as the total number of the word co-occurrence windows in all the documents in the corpus, and the formula of the inter-keyword mutual information PMI (i, j) is as follows:
Figure FDA0002213042230000012
Figure FDA0002213042230000013
wherein p (i) represents the proportion of the window containing the word i to all the co-occurrence windows, and p (i, j) represents the proportion of the window containing the words i and j to all the co-occurrence windows simultaneously; PMI (i, j) >0 represents high semantic relevance of words in the corpus, PMI (i, j) <0 represents little or no semantic relevance in the corpus;
step three, calculating and storing the weight TF-IDF word frequency-inverse text frequency of the document-word, wherein the formula is as follows:
TF-IDF=tf(t,Di)×idf(t)
Figure FDA0002213042230000021
wherein, tf (t, D)i) Representing the word frequency of the keyword t in the ith document, M representing the total number of documents, ntRepresenting the number of documents with keywords t in the document set, and IDF representing the frequency of calculating the reverse text;
step four, constructing an adjacency matrix of the document-word complex network;
step five, training all keywords in the text set, expressing the keywords by word vectors, and setting the vectors of the initial document i to be equal to the sum of the word vectors of all the keywords in the document i divided by the number of the keywords in the document i;
constructing a node characteristic matrix X of the document-word complex network, wherein a row characteristic column is a node, a keyword node characteristic is a word vector of a keyword, and a document node characteristic is an initial document vector; defining a node characteristic matrix X as a positive sample characteristic matrix, and A as a positive sample adjacency matrix; the negative samples use the feature matrix mixed by lines as the feature matrix, and the adjacent matrix uses the same adjacent matrix as the positive samples, i.e. the negative samples
Figure FDA0002213042230000022
Step six, defining a loss function L:
Figure FDA0002213042230000023
wherein N represents the number of positive sample nodes, M represents the number of negative sample nodes,
Figure FDA0002213042230000024
which represents a network of positive samples,
Figure FDA0002213042230000025
a network of negative examples is represented,
Figure FDA0002213042230000026
the representative vectors of the nodes after extracting the local features for the positive samples,
Figure FDA0002213042230000027
extracting a representation vector of a node after local features are extracted for the negative sample; the local characteristics of the positive and negative samples are processed by the same convolution method epsilon, epsilon (X, A) represents a node characteristic matrix after the positive samples are processed,
Figure FDA0002213042230000031
representing the global characteristics of the positive sample, wherein R represents a processing process of the node characteristic matrix after the positive sample is processed;
Figure FDA0002213042230000032
a representation discriminator, discriminating
Figure FDA0002213042230000033
Whether or not they are similar to each other, if
Figure FDA0002213042230000034
Approaching 1 indicates similarity, if
Figure FDA0002213042230000035
Approaching 0 indicates dissimilarity;
step seven, constructing a graph neural network model
Processing the positive sample node feature matrix X and the adjacency matrix A to obtain a negative sample feature matrix
Figure FDA0002213042230000036
Adjacency matrix
Figure FDA0002213042230000037
Extracting local features of each node of the positive sample and the negative sample to obtain the global features of the positive sample, and determining the global features of the positive sample by a discriminator
Figure FDA0002213042230000038
Calculating a loss function, and updating epsilon and R, D by using the loss function with gradient reduction until loss is converged;
and taking the converged text node feature vector to generate an unsupervised GNN-based text representation vector.
2. The unsupervised graph neural network structure-based text vector generation method of claim 1, wherein: the process of the fourth step is specifically as follows: firstly, taking each document as a node in a network, and taking each keyword as a node; then, edges among the nodes are constructed, and the edge weight between the node i and the node j is defined as AijThe formula is as follows:
3. the unsupervised graph neural network structure-based text vector generation method of claim 1, wherein: the concrete process of the step six is as follows:
local features are extracted by applying a single-layer GCN structure: the formula of the node feature matrix epsilon (X, A) after the positive sample processing is as follows:
Figure FDA0002213042230000041
wherein
Figure FDA0002213042230000042
INIs a matrix of the units,
Figure FDA0002213042230000043
is that
Figure FDA0002213042230000044
θ is a learnable parameter matrix;
for global features, averaging is performed on all node features of the positive sample by using an averaging method:
wherein sigma is a nonlinear Sigmoid function, and N represents the number of nodes;
for the discriminator, a simple bilinear scoring function is applied:
Figure FDA0002213042230000046
where W is a learnable scoring matrix and σ is a nonlinear Sigmoid function.
CN201910905090.1A 2019-09-24 2019-09-24 Text vector generation method based on unsupervised graph neural network structure Expired - Fee Related CN110705260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910905090.1A CN110705260B (en) 2019-09-24 2019-09-24 Text vector generation method based on unsupervised graph neural network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910905090.1A CN110705260B (en) 2019-09-24 2019-09-24 Text vector generation method based on unsupervised graph neural network structure

Publications (2)

Publication Number Publication Date
CN110705260A true CN110705260A (en) 2020-01-17
CN110705260B CN110705260B (en) 2023-04-18

Family

ID=69196022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910905090.1A Expired - Fee Related CN110705260B (en) 2019-09-24 2019-09-24 Text vector generation method based on unsupervised graph neural network structure

Country Status (1)

Country Link
CN (1) CN110705260B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461301A (en) * 2020-03-30 2020-07-28 北京沃东天骏信息技术有限公司 Serialized data processing method and device, and text processing method and device
CN111492370A (en) * 2020-03-19 2020-08-04 香港应用科技研究院有限公司 Device and method for recognizing text images of a structured layout
CN111552803A (en) * 2020-04-08 2020-08-18 西安工程大学 Text classification method based on graph wavelet network model
CN111694957A (en) * 2020-05-29 2020-09-22 新华三大数据技术有限公司 Question list classification method and device based on graph neural network and storage medium
CN112000788A (en) * 2020-08-19 2020-11-27 腾讯云计算(长沙)有限责任公司 Data processing method and device and computer readable storage medium
CN112016438A (en) * 2020-08-26 2020-12-01 北京嘀嘀无限科技发展有限公司 Method and system for identifying certificate based on graph neural network
CN112214993A (en) * 2020-09-03 2021-01-12 拓尔思信息技术股份有限公司 Graph neural network-based document processing method and device and storage medium
CN112364141A (en) * 2020-11-05 2021-02-12 天津大学 Scientific literature key content potential association mining method based on graph neural network
CN112465006A (en) * 2020-11-24 2021-03-09 中国人民解放军海军航空大学 Graph neural network target tracking method and device
CN112860897A (en) * 2021-03-12 2021-05-28 广西师范大学 Text classification method based on improved ClusterGCN
CN113220884A (en) * 2021-05-19 2021-08-06 西北工业大学 Graph neural network text emotion classification method based on double sliding windows
WO2021184396A1 (en) * 2020-03-19 2021-09-23 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for recognizing image-based content presented in a structured layout
CN114119191A (en) * 2020-08-28 2022-03-01 马上消费金融股份有限公司 Wind control method, overdue prediction method, model training method and related equipment
CN114357271A (en) * 2021-12-03 2022-04-15 天津大学 Microblog popularity prediction method based on text and time sequence information fused by graph neural network
CN114818737A (en) * 2022-06-29 2022-07-29 北京邮电大学 Method, system and storage medium for extracting semantic features of scientific and technological paper data text
CN115759183A (en) * 2023-01-06 2023-03-07 浪潮电子信息产业股份有限公司 Related method and related device for multi-structure text graph neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334605A (en) * 2018-02-01 2018-07-27 腾讯科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text
US20180357531A1 (en) * 2015-11-27 2018-12-13 Devanathan GIRIDHARI Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof
CN109299270A (en) * 2018-10-30 2019-02-01 云南电网有限责任公司信息中心 A kind of text data unsupervised clustering based on convolutional neural networks
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357531A1 (en) * 2015-11-27 2018-12-13 Devanathan GIRIDHARI Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text
CN108334605A (en) * 2018-02-01 2018-07-27 腾讯科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN109299270A (en) * 2018-10-30 2019-02-01 云南电网有限责任公司信息中心 A kind of text data unsupervised clustering based on convolutional neural networks
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11403488B2 (en) 2020-03-19 2022-08-02 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for recognizing image-based content presented in a structured layout
CN111492370A (en) * 2020-03-19 2020-08-04 香港应用科技研究院有限公司 Device and method for recognizing text images of a structured layout
WO2021184396A1 (en) * 2020-03-19 2021-09-23 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for recognizing image-based content presented in a structured layout
CN111492370B (en) * 2020-03-19 2023-05-26 香港应用科技研究院有限公司 Apparatus and method for recognizing text image of structured layout
CN111461301A (en) * 2020-03-30 2020-07-28 北京沃东天骏信息技术有限公司 Serialized data processing method and device, and text processing method and device
CN111552803B (en) * 2020-04-08 2023-03-24 西安工程大学 Text classification method based on graph wavelet network model
CN111552803A (en) * 2020-04-08 2020-08-18 西安工程大学 Text classification method based on graph wavelet network model
CN111694957B (en) * 2020-05-29 2024-03-12 新华三大数据技术有限公司 Method, equipment and storage medium for classifying problem sheets based on graph neural network
CN111694957A (en) * 2020-05-29 2020-09-22 新华三大数据技术有限公司 Question list classification method and device based on graph neural network and storage medium
CN112000788A (en) * 2020-08-19 2020-11-27 腾讯云计算(长沙)有限责任公司 Data processing method and device and computer readable storage medium
CN112000788B (en) * 2020-08-19 2024-02-09 腾讯云计算(长沙)有限责任公司 Data processing method, device and computer readable storage medium
CN112016438A (en) * 2020-08-26 2020-12-01 北京嘀嘀无限科技发展有限公司 Method and system for identifying certificate based on graph neural network
CN112016438B (en) * 2020-08-26 2021-08-10 北京嘀嘀无限科技发展有限公司 Method and system for identifying certificate based on graph neural network
CN114119191A (en) * 2020-08-28 2022-03-01 马上消费金融股份有限公司 Wind control method, overdue prediction method, model training method and related equipment
CN112214993A (en) * 2020-09-03 2021-01-12 拓尔思信息技术股份有限公司 Graph neural network-based document processing method and device and storage medium
CN112214993B (en) * 2020-09-03 2024-02-06 拓尔思信息技术股份有限公司 File processing method, device and storage medium based on graphic neural network
CN112364141A (en) * 2020-11-05 2021-02-12 天津大学 Scientific literature key content potential association mining method based on graph neural network
CN112465006A (en) * 2020-11-24 2021-03-09 中国人民解放军海军航空大学 Graph neural network target tracking method and device
CN112860897A (en) * 2021-03-12 2021-05-28 广西师范大学 Text classification method based on improved ClusterGCN
CN113220884B (en) * 2021-05-19 2023-01-31 西北工业大学 Graph neural network text emotion classification method based on double sliding windows
CN113220884A (en) * 2021-05-19 2021-08-06 西北工业大学 Graph neural network text emotion classification method based on double sliding windows
CN114357271A (en) * 2021-12-03 2022-04-15 天津大学 Microblog popularity prediction method based on text and time sequence information fused by graph neural network
CN114357271B (en) * 2021-12-03 2024-09-03 天津大学 Microblog heat prediction method based on fusion text of graphic neural network and time sequence information
CN114818737A (en) * 2022-06-29 2022-07-29 北京邮电大学 Method, system and storage medium for extracting semantic features of scientific and technological paper data text
CN114818737B (en) * 2022-06-29 2022-11-18 北京邮电大学 Method, system and storage medium for extracting semantic features of scientific and technological paper data text
CN115759183A (en) * 2023-01-06 2023-03-07 浪潮电子信息产业股份有限公司 Related method and related device for multi-structure text graph neural network

Also Published As

Publication number Publication date
CN110705260B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110705260B (en) Text vector generation method based on unsupervised graph neural network structure
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN108052588B (en) Method for constructing automatic document question-answering system based on convolutional neural network
CN104615767B (en) Training method, search processing method and the device of searching order model
CN107168954B (en) Text keyword generation method and device, electronic equipment and readable storage medium
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
CN102622338B (en) Computer-assisted computing method of semantic distance between short texts
WO2017090051A1 (en) A method for text classification and feature selection using class vectors and the system thereof
CN111125367B (en) Multi-character relation extraction method based on multi-level attention mechanism
CN110717042A (en) Method for constructing document-keyword heterogeneous network model
CN107577671A (en) A kind of key phrases extraction method based on multi-feature fusion
CN111552803A (en) Text classification method based on graph wavelet network model
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN109815400A (en) Personage&#39;s interest extracting method based on long text
CN109710916A (en) A kind of tag extraction method, apparatus, electronic equipment and storage medium
Mahmoud et al. A text semantic similarity approach for Arabic paraphrase detection
CN112818113A (en) Automatic text summarization method based on heteromorphic graph network
CN112784602B (en) News emotion entity extraction method based on remote supervision
CN112966523A (en) Word vector correction method based on semantic relation constraint and computing system
Qun et al. End-to-end neural text classification for tibetan
Wint et al. Deep learning based sentiment classification in social network services datasets
CN111353032B (en) Community question and answer oriented question classification method and system
WO2022228127A1 (en) Element text processing method and apparatus, electronic device, and storage medium
Han et al. CNN-BiLSTM-CRF model for term extraction in Chinese corpus
Jia et al. Attention in character-based BiLSTM-CRF for Chinese named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230418