CN112214993A - Graph neural network-based document processing method and device and storage medium - Google Patents
Graph neural network-based document processing method and device and storage medium Download PDFInfo
- Publication number
- CN112214993A CN112214993A CN202010916293.3A CN202010916293A CN112214993A CN 112214993 A CN112214993 A CN 112214993A CN 202010916293 A CN202010916293 A CN 202010916293A CN 112214993 A CN112214993 A CN 112214993A
- Authority
- CN
- China
- Prior art keywords
- document
- semantic
- graph
- neural network
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 71
- 238000003672 processing method Methods 0.000 title claims abstract description 24
- 239000013598 vector Substances 0.000 claims abstract description 75
- 238000000034 method Methods 0.000 claims abstract description 41
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims description 17
- 238000010586 diagram Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 17
- 230000000694 effects Effects 0.000 abstract description 5
- 238000003062 neural network model Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 229910000831 Steel Inorganic materials 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 239000010959 steel Substances 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of document processing and retrieval, and aims to solve the technical problems that the semantic relation among words, sentences and documents cannot be mined and the retrieval effect is poor in the conventional keyword retrieval technology; the invention relates to a document processing method, a device, an electronic device and a nonvolatile computer storage medium based on a graph neural network, wherein the method adopts a graph neural network technology based on supervised learning to generate a depth semantic vector from a semantic word graph, applies a binarization encoder technology to convert the semantic vector into a binary encoding form so as to generate a character feature vector and construct an inverted index; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and retrieval technology in the retrieval process, and the relevance of semantic retrieval results is effectively improved.
Description
Technical Field
The invention relates to the technical field of document processing and retrieval, in particular to a document processing method and device based on a graph neural network, an electronic device and a nonvolatile computer storage medium.
Background
Currently, deep learning techniques have made great progress in the field of information retrieval. The Word2vec and other models can capture semantic relations among different words, and effectively solve the matching problem of similar words; the model of Bert and the like can carry out better semantic coding and retrieval on sentence-level and paragraph-level texts (within thousand characters).
A semantic indexing and retrieval technology is built in an open-source Elastic Search retrieval system, continuous texts can be encoded by adopting Encoder models such as word2vec average, LSA, Infer Sent, Universal sequence Encoder, ELMo and BERT, dense vectors are formed, then the dense vectors are encoded into a character string form, dense vector calculation is converted into a retrieval problem of the character string, and therefore a traditional engine is used for achieving retrieval of the deep dense vectors.
For example, the patent application with chinese patent publication No. CN101576904A discloses a method for calculating text content similarity based on a weighted graph, and the invention provides a system and a method for calculating text content similarity based on a weighted graph, wherein the system includes an input unit for inputting a document set whose similarity needs to be calculated; a construction unit for constructing an authorized graph; a calculating unit, configured to calculate a similarity between any two nodes in the graph according to the weighted graph obtained in the constructing unit; and the output unit is used for returning the similarity result to the user. However, the invention needs to construct semantic relations between documents based on document sets, and does not consider and solve the semantic relations between synonyms.
Although the deep learning technology can perform better semantic coding and retrieval on words, sentences and paragraphs, the current deep neural network technology cannot well solve the coding and retrieval problems of texts such as patents and papers. And the dense vector representation of the long text can be obtained by directly applying the technologies of word vector averaging, BERT coding and the like to the long text, but the effect is poor.
Disclosure of Invention
The method aims to solve the technical problems that semantic relations among words, sentences and documents cannot be mined and the retrieval effect is poor in the conventional keyword retrieval technology; the invention relates to a document processing method, a document processing device, an electronic device and a nonvolatile computer storage medium based on a graph neural network, which enable the processed document to be subjected to high-performance retrieval and semantic matching based on a character feature index and retrieval technology in the retrieval process, and effectively improve the correlation of semantic retrieval results.
The invention provides a document processing method based on a graph neural network, which is characterized by comprising the following steps:
extracting a group of keywords representing document semantics from the document, calculating context co-occurrence relations among the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input to a graph convolution neural network and a binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector code;
grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
In a preferred implementation manner of the embodiment of the present invention, the extracting a group of keywords representing document semantics from a document includes: and searching words appearing in the document based on a word library corresponding to the dictionary, calculating the weight of the words according to at least one of the information of the part of speech, the appearance times, the position and the frequency of the historical document, and selecting a plurality of words with the highest weight as keywords.
In a further preferred implementation manner of the embodiment of the present invention, the calculating context co-occurrence relationships between the keywords and generating the semantic word graph of the document includes: taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords; for short texts, edges of nodes are established between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
In a preferred implementation manner of the embodiment of the present invention, the method further includes: training the neural network of the graph convolution and the neural network corresponding to the binarization encoder, wherein the training process comprises the following steps: semantic word diagrams between the query document and the comparison document are respectively transmitted into a semantic word diagram encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word diagram encoder are transmitted into the binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for carrying out binarization processing on the semantic vector X generated by the graph convolution neural network to generate phi (X) and generate binarization encoding, and the decoder is used for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relation between document pairs through a matching function.
In a further preferred implementation manner of the embodiment of the present invention, the sequentially inputting the semantic word graph into the graph convolution neural network, and the converting into the depth semantic vector includes: the graph convolution neural network comprises two layers of neural networks, and vectors of nodes are introduced based on attribute information of adjacent nodes of the nodes; the semantic word graph encoder inputs a semantic word graph and a pre-training model; the binary encoder is used for converting the semantic embedded vector represented by the real number into a binary encoding form and keeping semantic information unchanged or at least partially lost.
In a preferred implementation manner of the embodiment of the present invention, the storing the obtained character feature string as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine specifically includes: the character feature code obtained after coding is regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
In a preferred implementation manner of the embodiment of the present invention, the method further includes: when the full-text retrieval engine is used for semantic retrieval based on the character feature codes, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input to generate character feature codes, then, semantic query sentences are constructed, and the full-text retrieval engine is submitted to obtain retrieval results.
The second aspect of the present invention also provides a graph neural network-based document processing apparatus, including:
a semantic word graph generating unit for extracting a group of keywords representing document semantics from the document, calculating context cooccurrence relation between the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input into the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and binarization vector code;
and the character feature document processing part is used for grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
The third aspect of the present invention also provides an electronic device, including: a memory, a processor, and a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement any of the graph neural network-based document processing methods provided in the first aspect.
The fourth aspect of the present invention also provides a nonvolatile computer storage medium characterized by a computer program stored thereon; the computer program is executed by a processor to implement the graph neural network-based document processing method according to any one of the first aspect.
The method comprises the steps of generating a deep semantic vector from a semantic word map by adopting a map neural network technology based on supervised learning, converting the semantic vector into a binary coding form by applying a binarization encoder technology, further generating a character feature vector, and constructing an inverted index; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and retrieval technology in the retrieval process, and the relevance of semantic retrieval results is effectively improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure and/or process particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
Fig. 1 is a flowchart corresponding to a document processing method based on a graph neural network according to an embodiment of the present invention.
FIG. 2 is a flowchart corresponding to another method for processing documents based on a graph neural network according to an embodiment of the present invention.
Fig. 3 is a system architecture diagram of semantic retrieval based on deep learning according to an embodiment of the present invention.
FIG. 4 is a semantic word graph of a case document corresponding to a document processing method based on a graph neural network according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a model training process of a deep neural network according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a graph convolution neural network according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a binarization encoder according to an embodiment of the present invention.
Fig. 8 is a flowchart of fast semantic retrieval based on character feature codes according to an embodiment of the present invention.
FIG. 9 is a block diagram of a document processing apparatus based on a graph neural network according to an embodiment of the present invention.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that the detailed description is only for the purpose of making the invention easier and clearer for those skilled in the art, and is not intended to be a limiting explanation of the invention; moreover, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are all within the scope of the present invention.
Additionally, the steps illustrated in the flow charts of the drawings may be performed in a control system such as a set of controller-executable instructions and, although a logical ordering is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than that illustrated herein.
The technical scheme of the invention is described in detail by the figures and the specific embodiments as follows:
examples
As shown in fig. 1, the present embodiment provides a document processing method based on a graph neural network, the document processing method including:
s110, extracting a group of keywords representing document semantics from the document, calculating context co-occurrence relations among the keywords, and generating a semantic word graph of the document;
s120, the semantic word graph is sequentially input into a graph convolution neural network and a binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector code;
and S130, grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text retrieval engine, and establishing an index of the document in the full-text retrieval engine.
In a preferred implementation manner of this embodiment, in the step S110, extracting a group of keywords representing document semantics from the document includes: and searching words appearing in the document based on a word library corresponding to the dictionary, calculating the weight of the words according to at least one of the information of the part of speech, the appearance times, the positions and the frequency of the historical documents of the words, and selecting a plurality of words with the highest weight as keywords.
In a further preferred implementation manner of this embodiment, in step S110, the calculating a context co-occurrence relationship between the keywords, and generating the semantic word graph of the document includes: taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords; for short texts, edges of nodes are established between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
Therefore, in the technical scheme provided by this embodiment, a deep semantic vector is generated from a semantic word map by using a supervised learning-based map neural network technology, a binarization encoder technology is applied to convert the semantic vector into a binary encoding form, so as to generate a character feature vector, and an inverted index is constructed; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and retrieval technology in the retrieval process, and the relevance of semantic retrieval results is effectively improved.
As shown in fig. 2, the present embodiment provides a graph neural network-based document processing method, which, in addition to the steps S110, S120, and S130 mentioned in the corresponding document processing method of fig. 1, further includes:
s140, training the neural network corresponding to the graph convolution neural network and the binarization encoder, wherein the training process comprises the following steps: semantic word graphs between the query document and the comparison document are respectively transmitted into a semantic word graph encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word graph encoder are transmitted into a binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for carrying out binarization processing on a semantic vector X generated by the graph convolution neural network to generate phi (X) and generate binarization encoding, and the decoder is used for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relation between the document pairs through a matching function.
In a preferred embodiment, the semantic word graph is sequentially input to the graph convolution neural network, and the converting into the depth semantic vector includes: the graph convolution neural network comprises two layers of neural networks, and vectors of nodes are introduced based on attribute information of adjacent nodes of the nodes; the semantic word graph encoder inputs a semantic word graph and a pre-training model; the binary encoder is used for converting the semantic embedded vector represented by the real number into a binary encoding form, and semantic information is kept unchanged or at least partially lost.
In a preferred embodiment, the method for creating an index of a document in a full-text search engine includes the steps of: the character feature code obtained after coding is regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
In addition, the graph neural network-based document processing method provided by the embodiment is mainly an improvement on document processing; the graph neural network-based document processing method provided by the embodiment can further improve subsequent execution retrieval, for example:
in a preferred implementation manner of the embodiment of the present invention, the method further includes: when the full text retrieval engine is used for semantic retrieval based on the character feature codes, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input to generate character feature codes, then, semantic query sentences are constructed, the full text retrieval engine is submitted, and retrieval results are obtained.
Because of the document processing based on the graph neural network corresponding to the previous figures 1 and 2, the semantic vector is converted into a binary coding form, so that a character feature vector is generated, and an inverted index is constructed; therefore, in the process of retrieval, high-performance retrieval and semantic matching can be carried out based on the character feature index and retrieval technology, and the correlation of semantic retrieval results is effectively improved.
In order to make the technical solution of the present embodiment easier to understand by those skilled in the art, the above-mentioned graph neural network-based document processing method is further explained with reference to specific embodiments in conjunction with fig. 3-8.
As shown in fig. 3, the system architecture for semantic retrieval based on deep learning provided by this embodiment includes:
the method comprises the steps of inputting a document library 110 to be retrieved and a query document 120 into a semantic word graph building module 130, and a deep neural network model 140 connected with the semantic word graph building module 130, wherein the deep neural network model 140 also receives a related document training set 150 with labels; the deep neural network model 140 processes and inputs the contents of the document library 110 to be retrieved and the query document 120 to the corresponding document semantic vector character feature coding module 160 and the corresponding document semantic vector character feature coding module 170, and then outputs the retrieval result through the deep meaning retrieval module 190 after the full-text retrieval database 180 is retrieved. More specifically: 1.
1. firstly, extracting a group of key words representing document semantics from a document;
the extraction method of the keywords comprises the steps of preparing a large dictionary, matching words appearing in the documents in the dictionary, calculating the weights of the words according to information such as the parts of speech, the appearance times, the positions and the frequency of historical documents of the words, and selecting TOP N words with the highest weights as the keywords.
For short text such as titles, a maximum of 10 keywords can be selected; for text such as summaries, 25 keywords can be selected; for long text, 50 keywords may be selected.
The following example is a patent abstract text and its keywords:
key words:
horizontal plate shape; a parallel transistor; a base plate; an insulating elastic block; an upper cover plate; a vertical support plate; connecting blocks; an outer sidewall; an inner sidewall; clamping the ball plunger; a transistor; clamping the block; a steel ball; concave holes; a bottom surface; a top surface; fixing; nesting; two sides; a middle part; a box cover; disassembling; clinging to
Abstract and key words:
the invention discloses a box cover type horizontal plate-shaped parallel transistor device which comprises a connecting bottom plate, wherein vertical supporting plates are fixed in the middle of two sides of the top surface of the connecting bottom plate, a plurality of lower insulating elastic blocks are fixed on the connecting bottom plate between the two vertical supporting plates, a connecting block is fixed in the middle of the outer side walls of the two vertical supporting plates, a clamping ball plunger is screwed on the outer side wall of the connecting block, clamping blocks are arranged on two sides of the bottom surface of an upper cover plate and tightly attached to the outer side wall of the connecting block, a steel ball of the clamping ball plunger is nested in a concave hole formed in the inner side wall of the clamping block, the upper cover plate is positioned above all the lower insulating elastic blocks and the two vertical supporting plates, and upper insulating elastic blocks corresponding to the lower insulating elastic blocks are fixed on the. The transistor can be horizontally placed at a position needing to be placed, so that the placing requirement is met, meanwhile, the transistor can be installed or replaced only by upwards pulling the upper cover plate, and the transistor is convenient to install and detach.
2) And calculating context co-occurrence relation among the keywords to generate a semantic word graph of the document.
The generation method of the semantic word graph comprises the following steps: firstly, taking extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords;
for short texts, edges of nodes are established between adjacent words;
for long text, edges that establish nodes between words that will appear within a fixed context window; the size of the context window can be preferably 3-5; the specific schematic diagram is shown in fig. 4.
2. The model training process of the deep neural network comprises the following steps: the goal of the model training process is to learn a document semantic code model that maps a semantic word graph to a fixed-length vector representation. This vector is an encoding of semantic information of the documents, and can be used to calculate similarity between documents or search for similar documents.
The whole neural network is a supervised learning architecture facing a text relevance analysis task, and the input training corpora are document pairs with relevant or irrelevant semantic tags.
The neural network comprises a semantic word graph encoder based on a convolutional neural network (GCN) and a binary encoder based on a self-encoder structure.
The whole training process, as shown in fig. 5, includes:
1) semantic word diagrams of a query document and a comparison document (both) are respectively transmitted into a graph convolution neural network (GCN) based semantic word diagram encoder sharing parameters, and semantic vectors (128 dimensions or 256 dimensions) generated by the encoder are transmitted into a binary encoder;
2) the binary encoder comprises an encoder and a decoder, wherein the encoder is responsible for carrying out binary processing on the semantic vector X generated by the GCN to generate phi (X), and the phi (X) generates the final binary code; the decoder is responsible for reconstructing the semantic embedded vector Y from Φ (x).
3) Semantic vectors Y generated by reconstructing the query document and the comparison document through a binarization encoder are U and V respectively. u and v are subjected to 3 matching functions: connection, vector difference (absolute value) of element-wise, vector product of element-wise. After which it passes through the fully connected layer to a binary prediction function (e.g., 3-way softmax) to make predictions of semantic relationships between pairs of documents.
3. A semantic word graph encoder based on a graph convolution neural network (GCN) is shown in FIG. 6, the GCN used is a two-layer neural network, and Embedding of nodes is introduced based on attribute information of adjacent nodes of the nodes.
The input of the GCN coder is a semantic word graph and a pre-training model. The pre-training models herein include, but are not limited to, Word Embbeddings, BERT, and the like.
The output of the GCN encoder is a 128-dimensional semantic embedded vector as the input of a binary encoder.
4. In the binarization encoder model, as shown in fig. 7, the semantic embedded vector generated by the graph neural network encoder is a dense vector representation stored in a real number form, and the representation form is not suitable for being processed by a traditional full-text retrieval engine.
The goal of the binary encoder is to convert the real number representation of the semantic embedded vector into binary encoded form and to keep the semantic information substantially unchanged or with little loss, each vector requiring only 128 or 256 bits (Bit).
The model is based on an encoder-decoder architecture, consisting of two parts:
the encoder is responsible for performing binarization processing on the semantic embedded vector X to generate a binarization code phi (X);
the decoder is responsible for reconstructing the decoder of the semantic embedded vector Y from the binarized encoding Φ (x).
5. And (3) carrying out binary encoding to generation of character feature codes: after the binary coding needs to be processed in groups, the binary coding can be better stored and retrieved by a full-text engine.
The processing method is that binary codes of 128 bits (or 256 bits) are divided into a group according to 4 bits, and the group is 32 groups (or 64 groups).
Each group is assigned a subscript value from 1 to 32 (or 64) in order. The value range of each group is 0-15, and the 16-system form coding is adopted.
Thus, the character encoding format is that of subscript +16, the coded value, e.g., 1-0F, 32-00.
6. And storing the character feature codes, namely, the character feature codes obtained after encoding can be regarded as a section of character feature text, each character feature is separated by a specific symbol and serves as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
7. Fast semantic retrieval based on character feature codes
The fast semantic retrieval based on the character feature code comprises the steps of firstly generating a semantic word graph for a query document, inputting a neural network encoder model, generating character feature codes, then constructing a semantic query sentence by combining other retrieval conditions, submitting a full-text retrieval engine, and obtaining a retrieval result, wherein the specific flow is shown in fig. 8 and comprises the following steps:
and inputting a document to be queried.
And (4) executing semantic word graph construction on the document to be queried according to the method.
And inputting the constructed semantic word graph into the (trained) deep neural network model.
And after the deep neural network model is processed, the semantic vector character features of the document are coded.
And (3) encoding the encoded semantic vector character features, and constructing a semantic query statement by combining the attachment retrieval conditions (such as merging semantic query conditions).
And comparing the semantic query statement with data in the full-text retrieval database.
And inputting a semantic retrieval result based on the comparison in the full-text retrieval database.
As shown in fig. 9, the present embodiment further provides a graph neural network-based document processing apparatus 200, where the document processing apparatus 200 includes:
a semantic word graph generating unit 210 for extracting a set of keywords representing document semantics from a document, calculating context cooccurrence relationships between the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input into the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and binarization vector code;
the character feature document processing unit 230 is configured to group the binarized vector codes to generate a group of character feature strings, store the obtained character feature strings as one character feature document in the full-text search engine, and create an index of the document in the full-text search engine.
It should be noted that, each module mentioned in the document processing apparatus 200 provided in this embodiment may execute the function mentioned in the method for processing a document based on a graph neural network corresponding to fig. 1 to fig. 8, and specific processes and technical effects may refer to the above description, which is not described herein again.
As shown in fig. 10, the present embodiment further provides an electronic device 300, where the electronic device 300 includes: memory 310, processor 320, and computer programs; wherein the computer program is stored in the memory 310 and configured to be executed by the processor 320 to implement any of the graph neural network-based document processing methods provided above.
In addition, the present embodiment also provides a nonvolatile computer storage medium on which a computer program is stored; the computer program is executed by a processor to implement any of the graph neural network-based document processing methods provided above.
Those of ordinary skill in the art will understand that: the above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC, an FPGA, or an SoC. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
Finally, it should be understood that the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Those skilled in the art can make many changes and simple substitutions to the technical solution of the present invention without departing from the technical solution of the present invention, and the technical solution of the present invention is protected by the following claims.
Claims (10)
1. A document processing method based on a graph neural network is characterized by comprising the following steps:
extracting a group of keywords representing document semantics from the document, calculating context co-occurrence relations among the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input to a graph convolution neural network and a binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector code;
grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
2. The method of claim 1, wherein extracting a set of keywords from the document that represent the semantics of the document comprises: and searching words appearing in the document based on a word library corresponding to the dictionary, calculating the weight of the words according to at least one of the information of the part of speech, the appearance times, the position and the frequency of the historical document, and selecting a plurality of words with the highest weight as keywords.
3. The method of claim 2, wherein calculating context co-occurrence relationships between keywords and generating semantic word graphs of documents comprises: taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords; for short texts, edges of nodes are established between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
4. The method of claim 1, further comprising: training the neural network of the graph convolution and the neural network corresponding to the binarization encoder, wherein the training process comprises the following steps: semantic word diagrams between the query document and the comparison document are respectively transmitted into a semantic word diagram encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word diagram encoder are transmitted into the binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for carrying out binarization processing on the semantic vector X generated by the graph convolution neural network to generate phi (X) and generate binarization encoding, and the decoder is used for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relation between document pairs through a matching function.
5. The method of claim 4, wherein the semantic word graph is input to a graph convolution neural network in sequence and the converting into the deep semantic vector comprises: the graph convolution neural network comprises two layers of neural networks, and vectors of nodes are introduced based on attribute information of adjacent nodes of the nodes; the semantic word graph encoder inputs a semantic word graph and a pre-training model; the binary encoder is used for converting the semantic embedded vector represented by the real number into a binary encoding form and keeping semantic information unchanged or at least partially lost.
6. The method according to claim 1, wherein the step of storing the obtained character feature string as a character feature document in a full-text search engine and establishing an index of the document in the full-text search engine specifically comprises: the character feature code obtained after coding is regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
7. The method according to any one of claims 1-6, further comprising: when the full-text retrieval engine is used for semantic retrieval based on the character feature codes, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input to generate character feature codes, then, semantic query sentences are constructed, and the full-text retrieval engine is submitted to obtain retrieval results.
8. A graph neural network-based document processing apparatus, comprising:
a semantic word graph generating unit for extracting a group of keywords representing document semantics from the document, calculating context cooccurrence relation between the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input into the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and binarization vector code;
and the character feature document processing part is used for grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
9. An electronic device, comprising: a memory, a processor, and a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the graph neural network-based document processing method of any one of claims 1-7.
10. A non-volatile computer storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the graph neural network-based document processing method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010916293.3A CN112214993B (en) | 2020-09-03 | 2020-09-03 | File processing method, device and storage medium based on graphic neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010916293.3A CN112214993B (en) | 2020-09-03 | 2020-09-03 | File processing method, device and storage medium based on graphic neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112214993A true CN112214993A (en) | 2021-01-12 |
CN112214993B CN112214993B (en) | 2024-02-06 |
Family
ID=74049139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010916293.3A Active CN112214993B (en) | 2020-09-03 | 2020-09-03 | File processing method, device and storage medium based on graphic neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112214993B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158643A (en) * | 2021-04-27 | 2021-07-23 | 广东外语外贸大学 | Novel text readability assessment method and system |
CN113282726A (en) * | 2021-05-27 | 2021-08-20 | 成都数之联科技有限公司 | Data processing method, system, device, medium and data analysis method |
CN117496542A (en) * | 2023-12-29 | 2024-02-02 | 恒生电子股份有限公司 | Document information extraction method, device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838833A (en) * | 2014-02-24 | 2014-06-04 | 华中师范大学 | Full-text retrieval system based on semantic analysis of relevant words |
CN110222160A (en) * | 2019-05-06 | 2019-09-10 | 平安科技(深圳)有限公司 | Intelligent semantic document recommendation method, device and computer readable storage medium |
CN110705260A (en) * | 2019-09-24 | 2020-01-17 | 北京工商大学 | Text vector generation method based on unsupervised graph neural network structure |
US20200065389A1 (en) * | 2017-10-10 | 2020-02-27 | Tencent Technology (Shenzhen) Company Limited | Semantic analysis method and apparatus, and storage medium |
-
2020
- 2020-09-03 CN CN202010916293.3A patent/CN112214993B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838833A (en) * | 2014-02-24 | 2014-06-04 | 华中师范大学 | Full-text retrieval system based on semantic analysis of relevant words |
US20200065389A1 (en) * | 2017-10-10 | 2020-02-27 | Tencent Technology (Shenzhen) Company Limited | Semantic analysis method and apparatus, and storage medium |
CN110222160A (en) * | 2019-05-06 | 2019-09-10 | 平安科技(深圳)有限公司 | Intelligent semantic document recommendation method, device and computer readable storage medium |
CN110705260A (en) * | 2019-09-24 | 2020-01-17 | 北京工商大学 | Text vector generation method based on unsupervised graph neural network structure |
Non-Patent Citations (1)
Title |
---|
LIANG YAO ET AL.: "Graph Convolutional Networks for Text Classification", 《ARXIV》, pages 1 - 9 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158643A (en) * | 2021-04-27 | 2021-07-23 | 广东外语外贸大学 | Novel text readability assessment method and system |
CN113158643B (en) * | 2021-04-27 | 2024-05-28 | 广东外语外贸大学 | Novel text readability evaluation method and system |
CN113282726A (en) * | 2021-05-27 | 2021-08-20 | 成都数之联科技有限公司 | Data processing method, system, device, medium and data analysis method |
CN117496542A (en) * | 2023-12-29 | 2024-02-02 | 恒生电子股份有限公司 | Document information extraction method, device, electronic equipment and storage medium |
CN117496542B (en) * | 2023-12-29 | 2024-03-15 | 恒生电子股份有限公司 | Document information extraction method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112214993B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108829722B (en) | Remote supervision Dual-Attention relation classification method and system | |
CN109582789B (en) | Text multi-label classification method based on semantic unit information | |
CN111309971B (en) | Multi-level coding-based text-to-video cross-modal retrieval method | |
CN110275936B (en) | Similar legal case retrieval method based on self-coding neural network | |
CN112214993B (en) | File processing method, device and storage medium based on graphic neural network | |
CN110688854B (en) | Named entity recognition method, device and computer readable storage medium | |
CN109359297B (en) | Relationship extraction method and system | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN111639171A (en) | Knowledge graph question-answering method and device | |
CN111858932A (en) | Multiple-feature Chinese and English emotion classification method and system based on Transformer | |
CN111160031A (en) | Social media named entity identification method based on affix perception | |
CN113011189A (en) | Method, device and equipment for extracting open entity relationship and storage medium | |
CN111738007A (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN110390049B (en) | Automatic answer generation method for software development questions | |
CN113157886B (en) | Automatic question and answer generation method, system, terminal and readable storage medium | |
CN112632224B (en) | Case recommendation method and device based on case knowledge graph and electronic equipment | |
CN112380319A (en) | Model training method and related device | |
CN113569001A (en) | Text processing method and device, computer equipment and computer readable storage medium | |
CN113190656A (en) | Chinese named entity extraction method based on multi-label framework and fusion features | |
CN114298287A (en) | Knowledge distillation-based prediction method and device, electronic equipment and storage medium | |
CN116303977B (en) | Question-answering method and system based on feature classification | |
CN113656561A (en) | Entity word recognition method, apparatus, device, storage medium and program product | |
CN111814477A (en) | Dispute focus discovery method and device based on dispute focus entity and terminal | |
CN113705191A (en) | Method, device and equipment for generating sample statement and storage medium | |
CN117609421A (en) | Electric power professional knowledge intelligent question-answering system construction method based on large language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |