CN110705260A - Text vector generation method based on unsupervised graph neural network structure - Google Patents
Text vector generation method based on unsupervised graph neural network structure Download PDFInfo
- Publication number
- CN110705260A CN110705260A CN201910905090.1A CN201910905090A CN110705260A CN 110705260 A CN110705260 A CN 110705260A CN 201910905090 A CN201910905090 A CN 201910905090A CN 110705260 A CN110705260 A CN 110705260A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- node
- text
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 14
- 239000011159 matrix material Substances 0.000 claims abstract description 74
- 238000012545 processing Methods 0.000 claims abstract description 18
- 230000009467 reduction Effects 0.000 claims abstract description 3
- 230000008569 process Effects 0.000 claims description 12
- 238000012935 Averaging Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 5
- 230000009849 deactivation Effects 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 239000002356 single layer Substances 0.000 claims description 3
- 239000013604 expression vector Substances 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 244000089409 Erythrina poeppigiana Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 239000010410 layer Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text vector generation method based on an unsupervised graph neural network structure, which is characterized in that stop word processing is carried out on all collected text corpora by utilizing stop word corpora, keywords are selected from the processed corpora, text keyword weight and weight among the keywords are calculated and stored, and a text-keyword network adjacency matrix is constructed; secondly, calculating initial node characteristics of the text by using the trained word vectors as word node characteristics and keywords appearing in the document to obtain a text-keyword network characteristic matrix; and finally, constructing a negative sample adjacency matrix and a characteristic matrix corresponding to the positive sample, converging the loss by utilizing the steps of the loss function and the constructed network model and gradient reduction, and obtaining a text node characteristic vector after convergence to obtain a text expression vector based on the unsupervised GNN. The invention fully considers the discontinuous global word co-occurrence and long-distance semantics in the corpus and the total relevance of a single document to all document-keyword sets.
Description
Technical Field
The invention relates to the technical field of data mining and natural language processing, in particular to a text vector generation method based on an unsupervised graph neural network, which can be applied to extracting document vectors and also can be applied to downstream tasks such as text classification, clustering and text similarity calculation.
Background
Text has become a hot issue for research on many platforms today, and since most texts are unstructured or semi-structured data, text mining has been one of the important research angles for data mining in multiple fields. Meanwhile, with the gradual popularization of the internet, the data size of the web text is larger and larger, the growth speed of the information amount is gradually increased, and it becomes more and more difficult to extract the information required by the user from the mass data.
The conventional method represents that the average value of all word vectors contained in a document is calculated, and a doc2vec model is adopted. Recently, deep learning models have been widely used to learn text representations, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Because the CNN and RNN give priority to locality and sequentiality, these deep learning models can obtain semantic and syntactic information in locally continuous word sequences well, but ignore non-continuous global word co-occurrence and long-distance semantics in the corpus and the overall correlation of a single document to all document-keyword sets. Aiming at the problem, a novel unsupervised graph neural network-based text vector generation method is provided.
Disclosure of Invention
The invention aims to provide a text vector generation method based on an unsupervised graph neural network structure, which is used for expressing text vectors based on the unsupervised graph neural network and then can utilize document expression vectors to perform downstream tasks such as classification, clustering and the like. To solve the problems of the prior art.
In order to achieve the purpose, the invention provides the following scheme: the invention provides a text vector generation method based on an unsupervised graph neural network structure, which comprises the following steps:
step one, obtaining keywords: performing word segmentation processing and word deactivation processing on all texts in a corpus to obtain a document set, then calculating and storing word frequency of each word in the document set, taking the times with the word frequency being more than or equal to n as a keyword, wherein n is more than 1;
step two, judging the relevance of word semantics in the corpus: defining m continuous words of an unprocessed single document as a word co-occurrence window, wherein m is greater than 1, the unprocessed words are words which are subjected to word segmentation processing but are not subjected to word deactivation processing, setting # W (i) as the number of word co-occurrence windows containing the occurrence of a word i in all documents in a corpus, # W (i, j) as the number of windows containing the occurrence of the words i and j in all documents in the corpus and appearing in the same co-occurrence window simultaneously, # W as the total number of the word co-occurrence windows in all the documents in the corpus, and the formula of the inter-keyword mutual information PMI (i, j) is as follows:
wherein p (i) represents the proportion of the window containing the word i to all the co-occurrence windows, and p (i, j) represents the proportion of the window containing the words i and j to all the co-occurrence windows simultaneously; PMI (i, j) >0 represents high semantic relevance of words in the corpus, PMI (i, j) <0 represents little or no semantic relevance in the corpus;
step three, calculating and storing the weight TF-IDF word frequency-inverse text frequency of the document-word, wherein the formula is as follows:
TF-IDF=tf(t,Di)×idf(t)
wherein, tf (t, D)i) Representing the word frequency of the keyword t in the ith document, M representing the total number of documents, ntRepresenting the number of documents with keywords t in the document set, and IDF representing the frequency of calculating the reverse text;
step four, constructing an adjacency matrix of the document-word complex network;
step five, training all keywords in the text set, and expressing the keywords by word vectors; setting the vector of the initial document i to be equal to the sum of word vectors of all keywords in the document i divided by the number of the keywords in the document i;
constructing a node characteristic matrix X of the document-word complex network, wherein a behavior characteristic column is a node, a keyword node characteristic is a word vector of a keyword, and a document node characteristic is an initial document vector; defining a node characteristic matrix X as a positive sample characteristic matrix, and A as a positive sample adjacency matrix; the negative samples use the feature matrix mixed by lines as the feature matrix, and the adjacent matrix uses the same adjacent matrix as the positive samples, i.e. the negative samples
Step six, defining a loss function L:
wherein N represents the number of positive sample nodes, M represents the number of negative sample nodes,which represents a network of positive samples,a network of negative examples is represented,the representative vectors of the nodes after extracting the local features for the positive samples,extracting a representation vector of a node after local features are extracted for the negative sample; the local characteristics of the positive and negative samples are processed by the same convolution method epsilon, epsilon (X, A) represents a node characteristic matrix after the positive samples are processed,representing the global characteristics of the positive sample, wherein R represents a processing process of the node characteristic matrix after the positive sample is processed;a representation discriminator, discriminatingWhether or not they are similar to each other, ifApproaching 1 indicates similarity, ifApproaching 0 indicates dissimilarity;
step seven, constructing a graph neural network model
Processing the positive sample node feature matrix X and the adjacency matrix A to obtain a negative sample feature matrixAdjacency matrixExtracting local features of each node of the positive sample and the negative sample to obtain the global features of the positive sample, and determining the global features of the positive sample by a discriminatorCalculating a loss function, and updating epsilon and R, D by using the loss function with gradient reduction until loss is converged;
and taking the converged text node feature vector to generate an unsupervised GNN-based text representation vector.
Preferably, the process of step four specifically comprises: firstly, taking each document as a node in a network, and taking each keyword as a node; then, edges among the nodes are constructed, and the edge weight between the node i and the node j is defined as AijThe formula is as follows:
preferably, the specific process of the step six is as follows:
local features are extracted by applying a single-layer GCN structure: the formula of the node feature matrix epsilon (X, A) after the positive sample processing is as follows:
for global features, averaging is performed on all node features of the positive sample by using an averaging method:
wherein sigma is a nonlinear Sigmoid function, and N represents the number of nodes;
for the discriminator, a simple bilinear scoring function is applied:
where W is a learnable scoring matrix and σ is a nonlinear Sigmoid function.
The invention discloses the following technical effects: in the process of compressing and converting text content, constructing a text-keyword network is an effective method, namely, a large amount of texts and keywords are converted after each text keyword is found, and the texts and the keywords are converted into a large complex network. The text scale can be greatly compressed, and basic information in the text is lost as little as possible; and learning the constructed text-keyword network by using a graph neural network to obtain a new text expression vector which not only contains the keyword information in the document, but also contains the weight of the keyword and the structural information of the graph. In the process of the invention, data is easy to process keywords, the process is simple and effective, the text-keyword complex network adjacency matrix node characteristic matrix is convenient to obtain, meanwhile, the negative sample and the graph neural network model are easy to calculate, and discontinuous global word co-occurrence and long-distance semantics in a corpus and the total correlation of a single document to all document-keyword sets are fully considered.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is an exemplary diagram of text-keywords in step four according to the embodiment of the present invention;
fig. 3 is an exemplary diagram of a network model in step seven of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1 to 3, the present invention provides a method for generating a text vector based on an unsupervised graph neural network, which specifically includes the following steps:
step one, acquiring a large amount of text corpora as a corpus, taking a data set 20Newsgroups (20NG) as an example, and downloading addresses as follows:
http:// qwone.com/. jason/20Newsgroups/20news-bydate.tar.gz, which includes 18846 documents. The present invention borrows 3 documents from this data to build a complex network example.
Obtaining the stop word corpus, taking the stop word list summarized by the Xinlang user as an example, and the download address is as follows: http:// blog.sina.com. cn/s/blog _ a19a 3770102wjau. html, which includes 891 stop words, including ' \\ about ', ' above ', ' also ', ' I ', ' wait ', ' to ', ' the ' … … ', etc. The invention borrows the data to remove stop words; all texts are processed by the stop words, and if 891 stop words such as 'about', 'above', 'also', 'I', 'wait', 'to', 'the' … …, and the like appear in the texts, the words are deleted in the texts, and finally a document set after the stop words is obtained. For example document D10Comprises the following steps: "I wait to fly in the sky", according to stopping word list order and deleting the word in the word list, look for "about" in the file at first, if exist, delete "about"; then 'above' is deleted in the document; … … until the last word in the deactivation vocabulary is deleted. Since '"I', 'wait', 'to', 'in', 'the' are stop words, document D after the stop words are removed10Comprises the following steps: "fly sky". Computing sums in document collections after stop wordsThe word frequency (TF) of each word is stored, wherein the word frequency is the frequency of the occurrence of a word in a certain article, and words with the word frequency of more than or equal to 5 are taken as keywords. Then document D1Containing a keyword { w1,w2Document D2Containing a keyword { w1,w3Document D3Containing a keyword { w3,w4}。
Step two, calculating and storing a formula of the mutual information PMI (i, j) among the keywords as follows:
the size of a word co-occurrence window is regulated to be 5, the invention defines continuous 5 words of an unprocessed single document as a word co-occurrence window, wherein # W (i) is the number of the word co-occurrence windows containing the occurrence of a word i in all documents in a corpus, # W (i, j) is the number of the word co-occurrence windows containing the occurrence of the words i and j in all documents in the corpus and # W is the total number of the word co-occurrence windows in all the documents in the corpus, namely W is the same as the number of the words in all the documents. A positive PMI value implies high semantic relevance of words in the corpus, while a negative PMI value indicates little or no semantic relevance in the corpus. For example for original document D1In the "I way to fly in the sky", the first word co-occurrence window is "I way to fly in", the second word co-occurrence window is "way to fly in the same", the third word co-occurrence window is "to fly in the sky", the fourth word co-occurrence window is "fly in the sky", the fifth word co-occurrence window is "in the sky", the sixth word co-occurrence window is "the sky", and the seventh word co-occurrence window is "sky". Wherein, represents the word appearing after the word from the word to the word in the word sky, if the sentence is the endSentence then represents the automatically filled space symbol. If 'sky' and 'fly' are document D1The keywords do not appear in other co-occurrence windows at the same time, thenAssuming that # W is 100000, W (sky) is 5, and W (fly) is 40, the method is as followsPMI (sky, fly) ═ log1000 ═ 3. Then PMI (w) is calculated1,w2)=3,PMI(w1,w3)=2,PMI(w1,w4)=-1,PMI(w2,w3)=1,PMI(w2,w4)=-2,PMI(w3,w4)=2。
Step three, calculating and storing the weight TF-IDF (word frequency-inverse text frequency) of the document-word, wherein the formula is as follows:
TF-IDF=tf(t,Di)×idf(t)
wherein, tf (t, D)i) The word frequency of the keyword t in the ith document, M is the total number of the documents, ntFor the number of documents in the document set in which the keyword t appears, IDF indicates the calculation of the inverted text frequency, whereinThe purpose of adding 0.01 is to prevent the document-word weight from being 0 (n)tM). (the text frequency refers to the number of times that a certain keyword appears in all articles in the whole corpus, the text frequency is the reciprocal of the text frequency and is mainly used for reducing the effect of words which are common in all documents but have little influence on the documents.)
For the 3 selected data sets, the document number M is 3, for 'sky', if it is in document D2The number of word frequencies occurring in is tf (sky, D)2) 30, which is a keyword n in 2 documentssky=2, Is calculated to obtain
Step four, constructing a document-word complex network adjacency matrix:
each document is taken as a node in the network and each keyword is also taken as a node. Constructing edges between nodes, and defining the edge weight between the node i and the node j as AijThe formula is as follows:
suppose that 3 of the documents are selected to build a document-word complex network, document D1Containing a keyword { w1,w2Document D2Containing a keyword { w1,w3Document D3Containing a keyword { w3,w4},PMI(w1,w2)=3,PMI(w1,w3)=2,PMI(w1,w4)=-1,PMI(w2,w3)=1,PMI(w2,w4)=-2,PMI(w3,w4)=2; The complex network adjacency matrix is a:
since there are 3 documents and 4 keywords, the adjacency matrix A is a 7 × 7 matrix with the order { D }1、D2、D3、W1、w2、w3、w4}。
Step five, acquiring word vectors trained by using word2vec, taking the word vectors of Wikipedia trained by GloVe as an example, and downloading addresses are as follows: http:// nlp.stanford.edu/data/glove.6B.zip. With the word vector dimension being 3 for the example. All keywords in the text set are represented by trained word vectors. The initial document i vector is equal to the sum of the word vectors for all keywords in document i divided by the number of keywords in document i. For example for document D1Containing a keyword { w1,w2},w1The node of (a) represents a vector ofw2The node of (a) represents a vector of Then D is1Is represented as a vector And similarly, constructing a node characteristic matrix X of the document-word complex network, wherein the behavior characteristic column is a node, the keyword node characteristic is a word vector of the keyword, and the document node characteristic is an initial document vector. Defining A as a positive sample adjacency matrix, and defining a node characteristic matrix X as a positive sample characteristic matrix, wherein the positive sample characteristic matrix X has the following formula:
the negative samples use the feature matrix mixed by lines as the feature matrix, and the adjacent matrix uses the same adjacent matrix as the positive samples, i.e.For w1,w2The nodes are characterized by (1, 2, 3), (3, 1, 2), w after confusion1,w2The node features may be (3, 1, 2) or (1, 2, 3), that is, for all node features, the dimensions and the included values thereof do not change, and only the original wiThe characteristics of the node may no longer represent wi. Namely, it isCan be as follows:
Step six: defining a loss function L:
wherein N represents the number of positive sample nodes, M represents the number of negative sample nodes,which represents a network of positive samples,a network of negative examples is represented,the representative vectors of the nodes after extracting the local features for the positive samples,extracting a part for a negative sampleA representation vector of the feature back node; the local characteristics of the positive and negative samples are processed by adopting the same convolution method epsilon, and epsilon (X, A) represents a node characteristic matrix after the positive samples are processed.Representing the global characteristics of the positive sample,wherein R represents a processing process of the node characteristic matrix after the positive sample is processed;a representation discriminator, discriminatingWhether or not they are similar to each other, ifApproaching 1 indicates similarity, ifApproaching 0 indicates dissimilarity.
The formula for extracting the local features and applying the node feature matrix epsilon (X, A) of the single-layer GCN structure is as follows:
whereinINIs a matrix of the units,is thatDegree matrix of, e.g.θ is a learnable parameter matrix. For the non-linear parameter σ, we use the ReLU function. I.e. for an input a,comprises the following steps:
for global features, averaging all node features of the positive sample by using a simple averaging method:
where σ is a nonlinear Sigmoid function and N represents the number of nodes.
For the discriminator, a simple bilinear scoring function is applied:
where W is a learnable scoring matrix and σ is a nonlinear Sigmoid function.
Step seven: step of constructing graph neural network model
Mixing the positive sample node feature matrix X line by line to obtain a negative sample feature matrixAdjacency matrix
Pass discriminatorAnd (4) calculating the Loss, and if the Loss does not converge, repeating the steps 1,2, 3,4 and 5 until the Loss converges.
To obtain
Secondly, extracting node characteristics of the positive sample, and firstly calculating:
the learnable concealment layer number is set to 6, and θ is initialized to [ -0.5692, -0.6487,0.3339], [ -0.1275, -0.4908, -0.1815], [ -0.7875, -0.4873, -0.3584], [0.7263,0.3383,0.4262], [0.2747,0.5978, -0.5178], [0.6429, -0.4805, -0.2120] ] whose dimension is ([6,3 ]).
And calculating to obtain:
h [ [ [ [ -0.7087, -0.5292, -1.1873,4.3132,1.2835,0.4477], [ -0.9484, -0.5760, -1.3368,4.7530,2.1082,0.9065], [ -1.2537, -0.6958, -1.6096,5.6373,3.0459,1.2284], [ -0.5677, -0.5290, -1.1280, 4.1638, 0.7651,0.0254], [ -0.5199, -0.4663, -0.9613,3.5127,0.8292, -0.0187], [ -1.0183, -0.6611, -1.5236,5.4620,2.1234,0.9007], [ -1.1076, -0.6147, -1.3856,4.8227,2.7644,0.9474] ], its dimension is ([1,7,6 ]).
And the third step is to calculate:
its dimension is ([1,7,6 ]).
Fourthly, calculating global features:
[ its dimension is (1, 6).
The fifth step is initialization:
w [ [ [ [ -0.2551,0.1770, -0.2642, -0.1486,0.2632, -0.3471], [0.3415,0.0906, -0.0688,0.3749, -0.0906, -0.0022], [0.0060, -0.3362, -0.1600, -0.2831,0.0946, -0.3743], [ -0.2553, -0.0699, -0.1703,0.2189,0.2910, -0.2692], [0.2505,0.2865,0.1543, -0.1312,0.2121, -0.1092], [0.2638, -0.3451,0.1210,0.2282, 0.3422, -0.0979] ], the dimension of which is ([1,6,6 ]).
The discriminator score is calculated as:
[1.7562,2.3045,2.9415,1.3836,1.1912,2.5069,2.5062,2.3998,1.1779,1.3023,2.6828,1.5789,2.8675,2.0953]. The Loss is calculated to be 1.483.
And (3) training for the second time: the first step A, X,In the same way, the first and second,in the second stepAlso, θ becomes:
[[-0.5682,-0.6477,0.3349],[-0.1285,-0.4918,-0.1825],[-0.7865,-0.4863,-0.3574],[0.7253,0.3373,0.4252],[0.2737,0.5968,-0.5188],[0.6419,-0.4815,-0.2130]]its dimension is ([6, 3)]) The obtained H [ [ [ [ -0.7035, -0.5294, -1.1803,4.3039,1.2742,0.4384],[-0.9422,-0.5762,-1.3290,4.7432,2.0984,0.8967],[-1.2459,-0.6958,-1.6004,5.6259,3.0346,1.2170],[-0.5631,-0.5292,-1.1212,4.1545,0.7558,0.0162],[-0.5158,-0.4664,-0.9555,3.5047,0.8211,-0.0206],[-1.0114,-0.6612,-1.5147,5.4508,2.1122,0.8895],[-1.1007,-0.6147,-1.3776,4.8128,2.7545,0.9375]]]Its dimension is ([1,7, 6)]). The third step of the same processThe fourth stepAnd may also be obtained by calculation.
The fifth step learns the weight matrix W to become:
[ [ [ [ [ -0.2541,0.1780, -0.2632, -0.1476,0.2642, -0.3461], [0.3425,0.0916, -0.0678,0.3759, -0.0896, -0.0012], [0.0070, -0.3352, -0.1590, -0.2821,0.0956, -0.3733], [ -0.2563, -0.0709, -0.1713,0.2179,0.2900, -0.2702], [0.2495,0.2855,0.1533, -0.1322,0.2111, -0.1102], [0.2628, -0.3461,0.1200,0.2272,0.3412, -0.0989] ] its dimension is ([1,6,6 ]). A new discriminator score [1.7128,2.2531,2.8782,1.3445,1.1597,2.4496,2.4513,1.3324,2.7238,2.8783,2.3160,1.1949,2.5691,1.1294] may be calculated, resulting in an updated Loss of 1.1571.
The 3 rd training resulted in Loss being 1.1187, 1.1330 …, 0.6888 at 54 th, 0.6988 at 55 th, 0.6938 at 56 th, 0.6944 at 57 th, 0.6992 at 58 th, 0.6973 at 59 th. When the pass setting was 5, that is, 55 to 59 passes of Loss was not as good as 54, the Loss was considered to converge.
And (3) taking the converged text node feature vectors (the first three line vectors obtained by the 54 th training) [ -0.0229,0.0709,0.1803,0.1003, -0.0238,0.3006], [ -0.0212,0.0884,0.1985,0.1419, -0.0237,0.3395], [ -0.0267, -0.0027,0.0808, -0.0097, -0.0214,0.1106], so as to generate the unsupervised GNN-based text representation vector.
The principle is as follows: firstly, a large amount of text corpora are collected, and stop word corpora are downloaded. And performing stop word processing on all the collected text corpora by using the stop word corpora. And then calculating and storing the word frequency of the words in each document, taking the words with the word frequency larger than n as keywords, calculating and storing a text keyword weight TF-IDF and a weight PMI among the keywords, and defining network nodes and node edge weights to obtain a text-keyword network adjacency matrix. Secondly, the trained word vectors are used as word node characteristics, and initial node characteristics of the text are calculated by using keywords appearing in the document to obtain a text-keyword network characteristic matrix. And then constructing a negative sample adjacency matrix and a characteristic matrix corresponding to the positive sample, utilizing the defined loss function and the constructed network model step, utilizing gradient descent to make loss convergence, and taking the converged text node characteristic vector to obtain a text expression vector based on the unsupervised GNN. By adopting the method, the data is easy to obtain, the keyword processing process is simple and effective, the text-keyword complex network adjacency matrix node characteristic matrix is convenient to obtain, the negative sample is easy to construct, the graph neural network model is easy to calculate, the discontinuous global word co-occurrence and long-distance semantics in the corpus and the total correlation of a single document to all document-keyword sets are fully considered, and the user can conveniently extract the required information from mass data.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.
Claims (3)
1. A text vector generation method based on an unsupervised graph neural network structure is characterized by comprising the following steps:
step one, obtaining keywords: performing word segmentation processing and stop word processing on all texts in a corpus to obtain a document set, then calculating and storing word frequency of each word in the document set, and selecting times with the word frequency being more than or equal to n as keywords, wherein n is more than 1;
step two, judging the relevance of word semantics in the corpus: defining m continuous words of an unprocessed single document as a word co-occurrence window, wherein m is greater than 1, the unprocessed words are words which are subjected to word segmentation processing but are not subjected to word deactivation processing, setting # W (i) as the number of word co-occurrence windows containing the occurrence of a word i in all documents in a corpus, # W (i, j) as the number of windows containing the occurrence of the words i and j in all documents in the corpus and appearing in the same co-occurrence window simultaneously, # W as the total number of the word co-occurrence windows in all the documents in the corpus, and the formula of the inter-keyword mutual information PMI (i, j) is as follows:
wherein p (i) represents the proportion of the window containing the word i to all the co-occurrence windows, and p (i, j) represents the proportion of the window containing the words i and j to all the co-occurrence windows simultaneously; PMI (i, j) >0 represents high semantic relevance of words in the corpus, PMI (i, j) <0 represents little or no semantic relevance in the corpus;
step three, calculating and storing the weight TF-IDF word frequency-inverse text frequency of the document-word, wherein the formula is as follows:
TF-IDF=tf(t,Di)×idf(t)
wherein, tf (t, D)i) Representing the word frequency of the keyword t in the ith document, M representing the total number of documents, ntRepresenting the number of documents with keywords t in the document set, and IDF representing the frequency of calculating the reverse text;
step four, constructing an adjacency matrix of the document-word complex network;
step five, training all keywords in the text set, expressing the keywords by word vectors, and setting the vectors of the initial document i to be equal to the sum of the word vectors of all the keywords in the document i divided by the number of the keywords in the document i;
constructing a node characteristic matrix X of the document-word complex network, wherein a row characteristic column is a node, a keyword node characteristic is a word vector of a keyword, and a document node characteristic is an initial document vector; defining a node characteristic matrix X as a positive sample characteristic matrix, and A as a positive sample adjacency matrix; the negative samples use the feature matrix mixed by lines as the feature matrix, and the adjacent matrix uses the same adjacent matrix as the positive samples, i.e. the negative samples
Step six, defining a loss function L:
wherein N represents the number of positive sample nodes, M represents the number of negative sample nodes,which represents a network of positive samples,a network of negative examples is represented,the representative vectors of the nodes after extracting the local features for the positive samples,extracting a representation vector of a node after local features are extracted for the negative sample; the local characteristics of the positive and negative samples are processed by the same convolution method epsilon, epsilon (X, A) represents a node characteristic matrix after the positive samples are processed,representing the global characteristics of the positive sample, wherein R represents a processing process of the node characteristic matrix after the positive sample is processed;a representation discriminator, discriminatingWhether or not they are similar to each other, ifApproaching 1 indicates similarity, ifApproaching 0 indicates dissimilarity;
step seven, constructing a graph neural network model
Processing the positive sample node feature matrix X and the adjacency matrix A to obtain a negative sample feature matrixAdjacency matrixExtracting local features of each node of the positive sample and the negative sample to obtain the global features of the positive sample, and determining the global features of the positive sample by a discriminatorCalculating a loss function, and updating epsilon and R, D by using the loss function with gradient reduction until loss is converged;
and taking the converged text node feature vector to generate an unsupervised GNN-based text representation vector.
2. The unsupervised graph neural network structure-based text vector generation method of claim 1, wherein: the process of the fourth step is specifically as follows: firstly, taking each document as a node in a network, and taking each keyword as a node; then, edges among the nodes are constructed, and the edge weight between the node i and the node j is defined as AijThe formula is as follows:
3. the unsupervised graph neural network structure-based text vector generation method of claim 1, wherein: the concrete process of the step six is as follows:
local features are extracted by applying a single-layer GCN structure: the formula of the node feature matrix epsilon (X, A) after the positive sample processing is as follows:
for global features, averaging is performed on all node features of the positive sample by using an averaging method:
wherein sigma is a nonlinear Sigmoid function, and N represents the number of nodes;
for the discriminator, a simple bilinear scoring function is applied:
where W is a learnable scoring matrix and σ is a nonlinear Sigmoid function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910905090.1A CN110705260B (en) | 2019-09-24 | 2019-09-24 | Text vector generation method based on unsupervised graph neural network structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910905090.1A CN110705260B (en) | 2019-09-24 | 2019-09-24 | Text vector generation method based on unsupervised graph neural network structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110705260A true CN110705260A (en) | 2020-01-17 |
CN110705260B CN110705260B (en) | 2023-04-18 |
Family
ID=69196022
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910905090.1A Expired - Fee Related CN110705260B (en) | 2019-09-24 | 2019-09-24 | Text vector generation method based on unsupervised graph neural network structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110705260B (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461301A (en) * | 2020-03-30 | 2020-07-28 | 北京沃东天骏信息技术有限公司 | Serialized data processing method and device, and text processing method and device |
CN111492370A (en) * | 2020-03-19 | 2020-08-04 | 香港应用科技研究院有限公司 | Device and method for recognizing text images of a structured layout |
CN111552803A (en) * | 2020-04-08 | 2020-08-18 | 西安工程大学 | Text classification method based on graph wavelet network model |
CN111694957A (en) * | 2020-05-29 | 2020-09-22 | 新华三大数据技术有限公司 | Question list classification method and device based on graph neural network and storage medium |
CN112000788A (en) * | 2020-08-19 | 2020-11-27 | 腾讯云计算(长沙)有限责任公司 | Data processing method and device and computer readable storage medium |
CN112016438A (en) * | 2020-08-26 | 2020-12-01 | 北京嘀嘀无限科技发展有限公司 | Method and system for identifying certificate based on graph neural network |
CN112214993A (en) * | 2020-09-03 | 2021-01-12 | 拓尔思信息技术股份有限公司 | Graph neural network-based document processing method and device and storage medium |
CN112364141A (en) * | 2020-11-05 | 2021-02-12 | 天津大学 | Scientific literature key content potential association mining method based on graph neural network |
CN112465006A (en) * | 2020-11-24 | 2021-03-09 | 中国人民解放军海军航空大学 | Graph neural network target tracking method and device |
CN112860897A (en) * | 2021-03-12 | 2021-05-28 | 广西师范大学 | Text classification method based on improved ClusterGCN |
CN113220884A (en) * | 2021-05-19 | 2021-08-06 | 西北工业大学 | Graph neural network text emotion classification method based on double sliding windows |
WO2021184396A1 (en) * | 2020-03-19 | 2021-09-23 | Hong Kong Applied Science and Technology Research Institute Company Limited | Apparatus and method for recognizing image-based content presented in a structured layout |
CN114119191A (en) * | 2020-08-28 | 2022-03-01 | 马上消费金融股份有限公司 | Wind control method, overdue prediction method, model training method and related equipment |
CN114357271A (en) * | 2021-12-03 | 2022-04-15 | 天津大学 | Microblog popularity prediction method based on text and time sequence information fused by graph neural network |
CN114818737A (en) * | 2022-06-29 | 2022-07-29 | 北京邮电大学 | Method, system and storage medium for extracting semantic features of scientific and technological paper data text |
CN115759183A (en) * | 2023-01-06 | 2023-03-07 | 浪潮电子信息产业股份有限公司 | Related method and related device for multi-structure text graph neural network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334605A (en) * | 2018-02-01 | 2018-07-27 | 腾讯科技(深圳)有限公司 | File classification method, device, computer equipment and storage medium |
CN108572961A (en) * | 2017-03-08 | 2018-09-25 | 北京嘀嘀无限科技发展有限公司 | A kind of the vectorization method and device of text |
US20180357531A1 (en) * | 2015-11-27 | 2018-12-13 | Devanathan GIRIDHARI | Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof |
CN109299270A (en) * | 2018-10-30 | 2019-02-01 | 云南电网有限责任公司信息中心 | A kind of text data unsupervised clustering based on convolutional neural networks |
CN110134786A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text classification method based on theme term vector and convolutional neural networks |
-
2019
- 2019-09-24 CN CN201910905090.1A patent/CN110705260B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180357531A1 (en) * | 2015-11-27 | 2018-12-13 | Devanathan GIRIDHARI | Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof |
CN108572961A (en) * | 2017-03-08 | 2018-09-25 | 北京嘀嘀无限科技发展有限公司 | A kind of the vectorization method and device of text |
CN108334605A (en) * | 2018-02-01 | 2018-07-27 | 腾讯科技(深圳)有限公司 | File classification method, device, computer equipment and storage medium |
CN109299270A (en) * | 2018-10-30 | 2019-02-01 | 云南电网有限责任公司信息中心 | A kind of text data unsupervised clustering based on convolutional neural networks |
CN110134786A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text classification method based on theme term vector and convolutional neural networks |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11403488B2 (en) | 2020-03-19 | 2022-08-02 | Hong Kong Applied Science and Technology Research Institute Company Limited | Apparatus and method for recognizing image-based content presented in a structured layout |
CN111492370A (en) * | 2020-03-19 | 2020-08-04 | 香港应用科技研究院有限公司 | Device and method for recognizing text images of a structured layout |
WO2021184396A1 (en) * | 2020-03-19 | 2021-09-23 | Hong Kong Applied Science and Technology Research Institute Company Limited | Apparatus and method for recognizing image-based content presented in a structured layout |
CN111492370B (en) * | 2020-03-19 | 2023-05-26 | 香港应用科技研究院有限公司 | Apparatus and method for recognizing text image of structured layout |
CN111461301A (en) * | 2020-03-30 | 2020-07-28 | 北京沃东天骏信息技术有限公司 | Serialized data processing method and device, and text processing method and device |
CN111552803B (en) * | 2020-04-08 | 2023-03-24 | 西安工程大学 | Text classification method based on graph wavelet network model |
CN111552803A (en) * | 2020-04-08 | 2020-08-18 | 西安工程大学 | Text classification method based on graph wavelet network model |
CN111694957B (en) * | 2020-05-29 | 2024-03-12 | 新华三大数据技术有限公司 | Method, equipment and storage medium for classifying problem sheets based on graph neural network |
CN111694957A (en) * | 2020-05-29 | 2020-09-22 | 新华三大数据技术有限公司 | Question list classification method and device based on graph neural network and storage medium |
CN112000788A (en) * | 2020-08-19 | 2020-11-27 | 腾讯云计算(长沙)有限责任公司 | Data processing method and device and computer readable storage medium |
CN112000788B (en) * | 2020-08-19 | 2024-02-09 | 腾讯云计算(长沙)有限责任公司 | Data processing method, device and computer readable storage medium |
CN112016438A (en) * | 2020-08-26 | 2020-12-01 | 北京嘀嘀无限科技发展有限公司 | Method and system for identifying certificate based on graph neural network |
CN112016438B (en) * | 2020-08-26 | 2021-08-10 | 北京嘀嘀无限科技发展有限公司 | Method and system for identifying certificate based on graph neural network |
CN114119191A (en) * | 2020-08-28 | 2022-03-01 | 马上消费金融股份有限公司 | Wind control method, overdue prediction method, model training method and related equipment |
CN112214993A (en) * | 2020-09-03 | 2021-01-12 | 拓尔思信息技术股份有限公司 | Graph neural network-based document processing method and device and storage medium |
CN112214993B (en) * | 2020-09-03 | 2024-02-06 | 拓尔思信息技术股份有限公司 | File processing method, device and storage medium based on graphic neural network |
CN112364141A (en) * | 2020-11-05 | 2021-02-12 | 天津大学 | Scientific literature key content potential association mining method based on graph neural network |
CN112465006A (en) * | 2020-11-24 | 2021-03-09 | 中国人民解放军海军航空大学 | Graph neural network target tracking method and device |
CN112860897A (en) * | 2021-03-12 | 2021-05-28 | 广西师范大学 | Text classification method based on improved ClusterGCN |
CN113220884B (en) * | 2021-05-19 | 2023-01-31 | 西北工业大学 | Graph neural network text emotion classification method based on double sliding windows |
CN113220884A (en) * | 2021-05-19 | 2021-08-06 | 西北工业大学 | Graph neural network text emotion classification method based on double sliding windows |
CN114357271A (en) * | 2021-12-03 | 2022-04-15 | 天津大学 | Microblog popularity prediction method based on text and time sequence information fused by graph neural network |
CN114357271B (en) * | 2021-12-03 | 2024-09-03 | 天津大学 | Microblog heat prediction method based on fusion text of graphic neural network and time sequence information |
CN114818737A (en) * | 2022-06-29 | 2022-07-29 | 北京邮电大学 | Method, system and storage medium for extracting semantic features of scientific and technological paper data text |
CN114818737B (en) * | 2022-06-29 | 2022-11-18 | 北京邮电大学 | Method, system and storage medium for extracting semantic features of scientific and technological paper data text |
CN115759183A (en) * | 2023-01-06 | 2023-03-07 | 浪潮电子信息产业股份有限公司 | Related method and related device for multi-structure text graph neural network |
Also Published As
Publication number | Publication date |
---|---|
CN110705260B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110705260B (en) | Text vector generation method based on unsupervised graph neural network structure | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN108052588B (en) | Method for constructing automatic document question-answering system based on convolutional neural network | |
CN104615767B (en) | Training method, search processing method and the device of searching order model | |
CN107168954B (en) | Text keyword generation method and device, electronic equipment and readable storage medium | |
CN104102626B (en) | A kind of method for short text Semantic Similarity Measurement | |
CN102622338B (en) | Computer-assisted computing method of semantic distance between short texts | |
WO2017090051A1 (en) | A method for text classification and feature selection using class vectors and the system thereof | |
CN111125367B (en) | Multi-character relation extraction method based on multi-level attention mechanism | |
CN110717042A (en) | Method for constructing document-keyword heterogeneous network model | |
CN107577671A (en) | A kind of key phrases extraction method based on multi-feature fusion | |
CN111552803A (en) | Text classification method based on graph wavelet network model | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
CN109710916A (en) | A kind of tag extraction method, apparatus, electronic equipment and storage medium | |
Mahmoud et al. | A text semantic similarity approach for Arabic paraphrase detection | |
CN112818113A (en) | Automatic text summarization method based on heteromorphic graph network | |
CN112784602B (en) | News emotion entity extraction method based on remote supervision | |
CN112966523A (en) | Word vector correction method based on semantic relation constraint and computing system | |
Qun et al. | End-to-end neural text classification for tibetan | |
Wint et al. | Deep learning based sentiment classification in social network services datasets | |
CN111353032B (en) | Community question and answer oriented question classification method and system | |
WO2022228127A1 (en) | Element text processing method and apparatus, electronic device, and storage medium | |
Han et al. | CNN-BiLSTM-CRF model for term extraction in Chinese corpus | |
Jia et al. | Attention in character-based BiLSTM-CRF for Chinese named entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230418 |