CN112214993A - Graph neural network-based document processing method and device and storage medium - Google Patents

Graph neural network-based document processing method and device and storage medium Download PDF

Info

Publication number
CN112214993A
CN112214993A CN202010916293.3A CN202010916293A CN112214993A CN 112214993 A CN112214993 A CN 112214993A CN 202010916293 A CN202010916293 A CN 202010916293A CN 112214993 A CN112214993 A CN 112214993A
Authority
CN
China
Prior art keywords
document
semantic
graph
neural network
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010916293.3A
Other languages
Chinese (zh)
Other versions
CN112214993B (en
Inventor
王洪俊
肖诗斌
施水才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tols Information Technology Co ltd
Original Assignee
Tols Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tols Information Technology Co ltd filed Critical Tols Information Technology Co ltd
Priority to CN202010916293.3A priority Critical patent/CN112214993B/en
Publication of CN112214993A publication Critical patent/CN112214993A/en
Application granted granted Critical
Publication of CN112214993B publication Critical patent/CN112214993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of document processing and retrieval, and aims to solve the technical problems that the semantic relation among words, sentences and documents cannot be mined and the retrieval effect is poor in the conventional keyword retrieval technology; the invention relates to a document processing method, a device, an electronic device and a nonvolatile computer storage medium based on a graph neural network, wherein the method adopts a graph neural network technology based on supervised learning to generate a depth semantic vector from a semantic word graph, applies a binarization encoder technology to convert the semantic vector into a binary encoding form so as to generate a character feature vector and construct an inverted index; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and retrieval technology in the retrieval process, and the relevance of semantic retrieval results is effectively improved.

Description

Graph neural network-based document processing method and device and storage medium
Technical Field
The invention relates to the technical field of document processing and retrieval, in particular to a document processing method and device based on a graph neural network, an electronic device and a nonvolatile computer storage medium.
Background
Currently, deep learning techniques have made great progress in the field of information retrieval. The Word2vec and other models can capture semantic relations among different words, and effectively solve the matching problem of similar words; the model of Bert and the like can carry out better semantic coding and retrieval on sentence-level and paragraph-level texts (within thousand characters).
A semantic indexing and retrieval technology is built in an open-source Elastic Search retrieval system, continuous texts can be encoded by adopting Encoder models such as word2vec average, LSA, Infer Sent, Universal sequence Encoder, ELMo and BERT, dense vectors are formed, then the dense vectors are encoded into a character string form, dense vector calculation is converted into a retrieval problem of the character string, and therefore a traditional engine is used for achieving retrieval of the deep dense vectors.
For example, the patent application with chinese patent publication No. CN101576904A discloses a method for calculating text content similarity based on a weighted graph, and the invention provides a system and a method for calculating text content similarity based on a weighted graph, wherein the system includes an input unit for inputting a document set whose similarity needs to be calculated; a construction unit for constructing an authorized graph; a calculating unit, configured to calculate a similarity between any two nodes in the graph according to the weighted graph obtained in the constructing unit; and the output unit is used for returning the similarity result to the user. However, the invention needs to construct semantic relations between documents based on document sets, and does not consider and solve the semantic relations between synonyms.
Although the deep learning technology can perform better semantic coding and retrieval on words, sentences and paragraphs, the current deep neural network technology cannot well solve the coding and retrieval problems of texts such as patents and papers. And the dense vector representation of the long text can be obtained by directly applying the technologies of word vector averaging, BERT coding and the like to the long text, but the effect is poor.
Disclosure of Invention
The method aims to solve the technical problems that semantic relations among words, sentences and documents cannot be mined and the retrieval effect is poor in the conventional keyword retrieval technology; the invention relates to a document processing method, a document processing device, an electronic device and a nonvolatile computer storage medium based on a graph neural network, which enable the processed document to be subjected to high-performance retrieval and semantic matching based on a character feature index and retrieval technology in the retrieval process, and effectively improve the correlation of semantic retrieval results.
The invention provides a document processing method based on a graph neural network, which is characterized by comprising the following steps:
extracting a group of keywords representing document semantics from the document, calculating context co-occurrence relations among the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input to a graph convolution neural network and a binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector code;
grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
In a preferred implementation manner of the embodiment of the present invention, the extracting a group of keywords representing document semantics from a document includes: and searching words appearing in the document based on a word library corresponding to the dictionary, calculating the weight of the words according to at least one of the information of the part of speech, the appearance times, the position and the frequency of the historical document, and selecting a plurality of words with the highest weight as keywords.
In a further preferred implementation manner of the embodiment of the present invention, the calculating context co-occurrence relationships between the keywords and generating the semantic word graph of the document includes: taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords; for short texts, edges of nodes are established between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
In a preferred implementation manner of the embodiment of the present invention, the method further includes: training the neural network of the graph convolution and the neural network corresponding to the binarization encoder, wherein the training process comprises the following steps: semantic word diagrams between the query document and the comparison document are respectively transmitted into a semantic word diagram encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word diagram encoder are transmitted into the binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for carrying out binarization processing on the semantic vector X generated by the graph convolution neural network to generate phi (X) and generate binarization encoding, and the decoder is used for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relation between document pairs through a matching function.
In a further preferred implementation manner of the embodiment of the present invention, the sequentially inputting the semantic word graph into the graph convolution neural network, and the converting into the depth semantic vector includes: the graph convolution neural network comprises two layers of neural networks, and vectors of nodes are introduced based on attribute information of adjacent nodes of the nodes; the semantic word graph encoder inputs a semantic word graph and a pre-training model; the binary encoder is used for converting the semantic embedded vector represented by the real number into a binary encoding form and keeping semantic information unchanged or at least partially lost.
In a preferred implementation manner of the embodiment of the present invention, the storing the obtained character feature string as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine specifically includes: the character feature code obtained after coding is regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
In a preferred implementation manner of the embodiment of the present invention, the method further includes: when the full-text retrieval engine is used for semantic retrieval based on the character feature codes, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input to generate character feature codes, then, semantic query sentences are constructed, and the full-text retrieval engine is submitted to obtain retrieval results.
The second aspect of the present invention also provides a graph neural network-based document processing apparatus, including:
a semantic word graph generating unit for extracting a group of keywords representing document semantics from the document, calculating context cooccurrence relation between the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input into the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and binarization vector code;
and the character feature document processing part is used for grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
The third aspect of the present invention also provides an electronic device, including: a memory, a processor, and a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement any of the graph neural network-based document processing methods provided in the first aspect.
The fourth aspect of the present invention also provides a nonvolatile computer storage medium characterized by a computer program stored thereon; the computer program is executed by a processor to implement the graph neural network-based document processing method according to any one of the first aspect.
The method comprises the steps of generating a deep semantic vector from a semantic word map by adopting a map neural network technology based on supervised learning, converting the semantic vector into a binary coding form by applying a binarization encoder technology, further generating a character feature vector, and constructing an inverted index; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and retrieval technology in the retrieval process, and the relevance of semantic retrieval results is effectively improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure and/or process particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
Fig. 1 is a flowchart corresponding to a document processing method based on a graph neural network according to an embodiment of the present invention.
FIG. 2 is a flowchart corresponding to another method for processing documents based on a graph neural network according to an embodiment of the present invention.
Fig. 3 is a system architecture diagram of semantic retrieval based on deep learning according to an embodiment of the present invention.
FIG. 4 is a semantic word graph of a case document corresponding to a document processing method based on a graph neural network according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a model training process of a deep neural network according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a graph convolution neural network according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a binarization encoder according to an embodiment of the present invention.
Fig. 8 is a flowchart of fast semantic retrieval based on character feature codes according to an embodiment of the present invention.
FIG. 9 is a block diagram of a document processing apparatus based on a graph neural network according to an embodiment of the present invention.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that the detailed description is only for the purpose of making the invention easier and clearer for those skilled in the art, and is not intended to be a limiting explanation of the invention; moreover, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are all within the scope of the present invention.
Additionally, the steps illustrated in the flow charts of the drawings may be performed in a control system such as a set of controller-executable instructions and, although a logical ordering is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than that illustrated herein.
The technical scheme of the invention is described in detail by the figures and the specific embodiments as follows:
examples
As shown in fig. 1, the present embodiment provides a document processing method based on a graph neural network, the document processing method including:
s110, extracting a group of keywords representing document semantics from the document, calculating context co-occurrence relations among the keywords, and generating a semantic word graph of the document;
s120, the semantic word graph is sequentially input into a graph convolution neural network and a binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector code;
and S130, grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text retrieval engine, and establishing an index of the document in the full-text retrieval engine.
In a preferred implementation manner of this embodiment, in the step S110, extracting a group of keywords representing document semantics from the document includes: and searching words appearing in the document based on a word library corresponding to the dictionary, calculating the weight of the words according to at least one of the information of the part of speech, the appearance times, the positions and the frequency of the historical documents of the words, and selecting a plurality of words with the highest weight as keywords.
In a further preferred implementation manner of this embodiment, in step S110, the calculating a context co-occurrence relationship between the keywords, and generating the semantic word graph of the document includes: taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords; for short texts, edges of nodes are established between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
Therefore, in the technical scheme provided by this embodiment, a deep semantic vector is generated from a semantic word map by using a supervised learning-based map neural network technology, a binarization encoder technology is applied to convert the semantic vector into a binary encoding form, so as to generate a character feature vector, and an inverted index is constructed; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and retrieval technology in the retrieval process, and the relevance of semantic retrieval results is effectively improved.
As shown in fig. 2, the present embodiment provides a graph neural network-based document processing method, which, in addition to the steps S110, S120, and S130 mentioned in the corresponding document processing method of fig. 1, further includes:
s140, training the neural network corresponding to the graph convolution neural network and the binarization encoder, wherein the training process comprises the following steps: semantic word graphs between the query document and the comparison document are respectively transmitted into a semantic word graph encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word graph encoder are transmitted into a binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for carrying out binarization processing on a semantic vector X generated by the graph convolution neural network to generate phi (X) and generate binarization encoding, and the decoder is used for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relation between the document pairs through a matching function.
In a preferred embodiment, the semantic word graph is sequentially input to the graph convolution neural network, and the converting into the depth semantic vector includes: the graph convolution neural network comprises two layers of neural networks, and vectors of nodes are introduced based on attribute information of adjacent nodes of the nodes; the semantic word graph encoder inputs a semantic word graph and a pre-training model; the binary encoder is used for converting the semantic embedded vector represented by the real number into a binary encoding form, and semantic information is kept unchanged or at least partially lost.
In a preferred embodiment, the method for creating an index of a document in a full-text search engine includes the steps of: the character feature code obtained after coding is regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
In addition, the graph neural network-based document processing method provided by the embodiment is mainly an improvement on document processing; the graph neural network-based document processing method provided by the embodiment can further improve subsequent execution retrieval, for example:
in a preferred implementation manner of the embodiment of the present invention, the method further includes: when the full text retrieval engine is used for semantic retrieval based on the character feature codes, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input to generate character feature codes, then, semantic query sentences are constructed, the full text retrieval engine is submitted, and retrieval results are obtained.
Because of the document processing based on the graph neural network corresponding to the previous figures 1 and 2, the semantic vector is converted into a binary coding form, so that a character feature vector is generated, and an inverted index is constructed; therefore, in the process of retrieval, high-performance retrieval and semantic matching can be carried out based on the character feature index and retrieval technology, and the correlation of semantic retrieval results is effectively improved.
In order to make the technical solution of the present embodiment easier to understand by those skilled in the art, the above-mentioned graph neural network-based document processing method is further explained with reference to specific embodiments in conjunction with fig. 3-8.
As shown in fig. 3, the system architecture for semantic retrieval based on deep learning provided by this embodiment includes:
the method comprises the steps of inputting a document library 110 to be retrieved and a query document 120 into a semantic word graph building module 130, and a deep neural network model 140 connected with the semantic word graph building module 130, wherein the deep neural network model 140 also receives a related document training set 150 with labels; the deep neural network model 140 processes and inputs the contents of the document library 110 to be retrieved and the query document 120 to the corresponding document semantic vector character feature coding module 160 and the corresponding document semantic vector character feature coding module 170, and then outputs the retrieval result through the deep meaning retrieval module 190 after the full-text retrieval database 180 is retrieved. More specifically: 1.
1. firstly, extracting a group of key words representing document semantics from a document;
the extraction method of the keywords comprises the steps of preparing a large dictionary, matching words appearing in the documents in the dictionary, calculating the weights of the words according to information such as the parts of speech, the appearance times, the positions and the frequency of historical documents of the words, and selecting TOP N words with the highest weights as the keywords.
For short text such as titles, a maximum of 10 keywords can be selected; for text such as summaries, 25 keywords can be selected; for long text, 50 keywords may be selected.
The following example is a patent abstract text and its keywords:
key words:
horizontal plate shape; a parallel transistor; a base plate; an insulating elastic block; an upper cover plate; a vertical support plate; connecting blocks; an outer sidewall; an inner sidewall; clamping the ball plunger; a transistor; clamping the block; a steel ball; concave holes; a bottom surface; a top surface; fixing; nesting; two sides; a middle part; a box cover; disassembling; clinging to
Abstract and key words:
the invention discloses a box cover type horizontal plate-shaped parallel transistor device which comprises a connecting bottom plate, wherein vertical supporting plates are fixed in the middle of two sides of the top surface of the connecting bottom plate, a plurality of lower insulating elastic blocks are fixed on the connecting bottom plate between the two vertical supporting plates, a connecting block is fixed in the middle of the outer side walls of the two vertical supporting plates, a clamping ball plunger is screwed on the outer side wall of the connecting block, clamping blocks are arranged on two sides of the bottom surface of an upper cover plate and tightly attached to the outer side wall of the connecting block, a steel ball of the clamping ball plunger is nested in a concave hole formed in the inner side wall of the clamping block, the upper cover plate is positioned above all the lower insulating elastic blocks and the two vertical supporting plates, and upper insulating elastic blocks corresponding to the lower insulating elastic blocks are fixed on the. The transistor can be horizontally placed at a position needing to be placed, so that the placing requirement is met, meanwhile, the transistor can be installed or replaced only by upwards pulling the upper cover plate, and the transistor is convenient to install and detach.
2) And calculating context co-occurrence relation among the keywords to generate a semantic word graph of the document.
The generation method of the semantic word graph comprises the following steps: firstly, taking extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords;
for short texts, edges of nodes are established between adjacent words;
for long text, edges that establish nodes between words that will appear within a fixed context window; the size of the context window can be preferably 3-5; the specific schematic diagram is shown in fig. 4.
2. The model training process of the deep neural network comprises the following steps: the goal of the model training process is to learn a document semantic code model that maps a semantic word graph to a fixed-length vector representation. This vector is an encoding of semantic information of the documents, and can be used to calculate similarity between documents or search for similar documents.
The whole neural network is a supervised learning architecture facing a text relevance analysis task, and the input training corpora are document pairs with relevant or irrelevant semantic tags.
The neural network comprises a semantic word graph encoder based on a convolutional neural network (GCN) and a binary encoder based on a self-encoder structure.
The whole training process, as shown in fig. 5, includes:
1) semantic word diagrams of a query document and a comparison document (both) are respectively transmitted into a graph convolution neural network (GCN) based semantic word diagram encoder sharing parameters, and semantic vectors (128 dimensions or 256 dimensions) generated by the encoder are transmitted into a binary encoder;
2) the binary encoder comprises an encoder and a decoder, wherein the encoder is responsible for carrying out binary processing on the semantic vector X generated by the GCN to generate phi (X), and the phi (X) generates the final binary code; the decoder is responsible for reconstructing the semantic embedded vector Y from Φ (x).
3) Semantic vectors Y generated by reconstructing the query document and the comparison document through a binarization encoder are U and V respectively. u and v are subjected to 3 matching functions: connection, vector difference (absolute value) of element-wise, vector product of element-wise. After which it passes through the fully connected layer to a binary prediction function (e.g., 3-way softmax) to make predictions of semantic relationships between pairs of documents.
3. A semantic word graph encoder based on a graph convolution neural network (GCN) is shown in FIG. 6, the GCN used is a two-layer neural network, and Embedding of nodes is introduced based on attribute information of adjacent nodes of the nodes.
The input of the GCN coder is a semantic word graph and a pre-training model. The pre-training models herein include, but are not limited to, Word Embbeddings, BERT, and the like.
The output of the GCN encoder is a 128-dimensional semantic embedded vector as the input of a binary encoder.
4. In the binarization encoder model, as shown in fig. 7, the semantic embedded vector generated by the graph neural network encoder is a dense vector representation stored in a real number form, and the representation form is not suitable for being processed by a traditional full-text retrieval engine.
The goal of the binary encoder is to convert the real number representation of the semantic embedded vector into binary encoded form and to keep the semantic information substantially unchanged or with little loss, each vector requiring only 128 or 256 bits (Bit).
The model is based on an encoder-decoder architecture, consisting of two parts:
the encoder is responsible for performing binarization processing on the semantic embedded vector X to generate a binarization code phi (X);
the decoder is responsible for reconstructing the decoder of the semantic embedded vector Y from the binarized encoding Φ (x).
5. And (3) carrying out binary encoding to generation of character feature codes: after the binary coding needs to be processed in groups, the binary coding can be better stored and retrieved by a full-text engine.
The processing method is that binary codes of 128 bits (or 256 bits) are divided into a group according to 4 bits, and the group is 32 groups (or 64 groups).
Each group is assigned a subscript value from 1 to 32 (or 64) in order. The value range of each group is 0-15, and the 16-system form coding is adopted.
Thus, the character encoding format is that of subscript +16, the coded value, e.g., 1-0F, 32-00.
6. And storing the character feature codes, namely, the character feature codes obtained after encoding can be regarded as a section of character feature text, each character feature is separated by a specific symbol and serves as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
7. Fast semantic retrieval based on character feature codes
The fast semantic retrieval based on the character feature code comprises the steps of firstly generating a semantic word graph for a query document, inputting a neural network encoder model, generating character feature codes, then constructing a semantic query sentence by combining other retrieval conditions, submitting a full-text retrieval engine, and obtaining a retrieval result, wherein the specific flow is shown in fig. 8 and comprises the following steps:
and inputting a document to be queried.
And (4) executing semantic word graph construction on the document to be queried according to the method.
And inputting the constructed semantic word graph into the (trained) deep neural network model.
And after the deep neural network model is processed, the semantic vector character features of the document are coded.
And (3) encoding the encoded semantic vector character features, and constructing a semantic query statement by combining the attachment retrieval conditions (such as merging semantic query conditions).
And comparing the semantic query statement with data in the full-text retrieval database.
And inputting a semantic retrieval result based on the comparison in the full-text retrieval database.
As shown in fig. 9, the present embodiment further provides a graph neural network-based document processing apparatus 200, where the document processing apparatus 200 includes:
a semantic word graph generating unit 210 for extracting a set of keywords representing document semantics from a document, calculating context cooccurrence relationships between the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input into the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and binarization vector code;
the character feature document processing unit 230 is configured to group the binarized vector codes to generate a group of character feature strings, store the obtained character feature strings as one character feature document in the full-text search engine, and create an index of the document in the full-text search engine.
It should be noted that, each module mentioned in the document processing apparatus 200 provided in this embodiment may execute the function mentioned in the method for processing a document based on a graph neural network corresponding to fig. 1 to fig. 8, and specific processes and technical effects may refer to the above description, which is not described herein again.
As shown in fig. 10, the present embodiment further provides an electronic device 300, where the electronic device 300 includes: memory 310, processor 320, and computer programs; wherein the computer program is stored in the memory 310 and configured to be executed by the processor 320 to implement any of the graph neural network-based document processing methods provided above.
In addition, the present embodiment also provides a nonvolatile computer storage medium on which a computer program is stored; the computer program is executed by a processor to implement any of the graph neural network-based document processing methods provided above.
Those of ordinary skill in the art will understand that: the above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC, an FPGA, or an SoC. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
Finally, it should be understood that the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Those skilled in the art can make many changes and simple substitutions to the technical solution of the present invention without departing from the technical solution of the present invention, and the technical solution of the present invention is protected by the following claims.

Claims (10)

1. A document processing method based on a graph neural network is characterized by comprising the following steps:
extracting a group of keywords representing document semantics from the document, calculating context co-occurrence relations among the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input to a graph convolution neural network and a binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector code;
grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
2. The method of claim 1, wherein extracting a set of keywords from the document that represent the semantics of the document comprises: and searching words appearing in the document based on a word library corresponding to the dictionary, calculating the weight of the words according to at least one of the information of the part of speech, the appearance times, the position and the frequency of the historical document, and selecting a plurality of words with the highest weight as keywords.
3. The method of claim 2, wherein calculating context co-occurrence relationships between keywords and generating semantic word graphs of documents comprises: taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window cooccurrence relations between the keywords; for short texts, edges of nodes are established between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
4. The method of claim 1, further comprising: training the neural network of the graph convolution and the neural network corresponding to the binarization encoder, wherein the training process comprises the following steps: semantic word diagrams between the query document and the comparison document are respectively transmitted into a semantic word diagram encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word diagram encoder are transmitted into the binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for carrying out binarization processing on the semantic vector X generated by the graph convolution neural network to generate phi (X) and generate binarization encoding, and the decoder is used for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relation between document pairs through a matching function.
5. The method of claim 4, wherein the semantic word graph is input to a graph convolution neural network in sequence and the converting into the deep semantic vector comprises: the graph convolution neural network comprises two layers of neural networks, and vectors of nodes are introduced based on attribute information of adjacent nodes of the nodes; the semantic word graph encoder inputs a semantic word graph and a pre-training model; the binary encoder is used for converting the semantic embedded vector represented by the real number into a binary encoding form and keeping semantic information unchanged or at least partially lost.
6. The method according to claim 1, wherein the step of storing the obtained character feature string as a character feature document in a full-text search engine and establishing an index of the document in the full-text search engine specifically comprises: the character feature code obtained after coding is regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word and respectively establishes indexes.
7. The method according to any one of claims 1-6, further comprising: when the full-text retrieval engine is used for semantic retrieval based on the character feature codes, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input to generate character feature codes, then, semantic query sentences are constructed, and the full-text retrieval engine is submitted to obtain retrieval results.
8. A graph neural network-based document processing apparatus, comprising:
a semantic word graph generating unit for extracting a group of keywords representing document semantics from the document, calculating context cooccurrence relation between the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input into the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and binarization vector code;
and the character feature document processing part is used for grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings as a character feature document in a full-text search engine, and establishing an index of the document in the full-text search engine.
9. An electronic device, comprising: a memory, a processor, and a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the graph neural network-based document processing method of any one of claims 1-7.
10. A non-volatile computer storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the graph neural network-based document processing method of any one of claims 1-7.
CN202010916293.3A 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network Active CN112214993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010916293.3A CN112214993B (en) 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010916293.3A CN112214993B (en) 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network

Publications (2)

Publication Number Publication Date
CN112214993A true CN112214993A (en) 2021-01-12
CN112214993B CN112214993B (en) 2024-02-06

Family

ID=74049139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010916293.3A Active CN112214993B (en) 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network

Country Status (1)

Country Link
CN (1) CN112214993B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158643A (en) * 2021-04-27 2021-07-23 广东外语外贸大学 Novel text readability assessment method and system
CN113282726A (en) * 2021-05-27 2021-08-20 成都数之联科技有限公司 Data processing method, system, device, medium and data analysis method
CN117496542A (en) * 2023-12-29 2024-02-02 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure
US20200065389A1 (en) * 2017-10-10 2020-02-27 Tencent Technology (Shenzhen) Company Limited Semantic analysis method and apparatus, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
US20200065389A1 (en) * 2017-10-10 2020-02-27 Tencent Technology (Shenzhen) Company Limited Semantic analysis method and apparatus, and storage medium
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIANG YAO ET AL.: "Graph Convolutional Networks for Text Classification", 《ARXIV》, pages 1 - 9 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158643A (en) * 2021-04-27 2021-07-23 广东外语外贸大学 Novel text readability assessment method and system
CN113158643B (en) * 2021-04-27 2024-05-28 广东外语外贸大学 Novel text readability evaluation method and system
CN113282726A (en) * 2021-05-27 2021-08-20 成都数之联科技有限公司 Data processing method, system, device, medium and data analysis method
CN117496542A (en) * 2023-12-29 2024-02-02 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium
CN117496542B (en) * 2023-12-29 2024-03-15 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112214993B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN108829722B (en) Remote supervision Dual-Attention relation classification method and system
CN109582789B (en) Text multi-label classification method based on semantic unit information
CN111309971B (en) Multi-level coding-based text-to-video cross-modal retrieval method
CN110275936B (en) Similar legal case retrieval method based on self-coding neural network
CN112214993B (en) File processing method, device and storage medium based on graphic neural network
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN109359297B (en) Relationship extraction method and system
CN111291188B (en) Intelligent information extraction method and system
CN111639171A (en) Knowledge graph question-answering method and device
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN111160031A (en) Social media named entity identification method based on affix perception
CN113011189A (en) Method, device and equipment for extracting open entity relationship and storage medium
CN111738007A (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN110390049B (en) Automatic answer generation method for software development questions
CN113157886B (en) Automatic question and answer generation method, system, terminal and readable storage medium
CN112632224B (en) Case recommendation method and device based on case knowledge graph and electronic equipment
CN112380319A (en) Model training method and related device
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN114298287A (en) Knowledge distillation-based prediction method and device, electronic equipment and storage medium
CN116303977B (en) Question-answering method and system based on feature classification
CN113656561A (en) Entity word recognition method, apparatus, device, storage medium and program product
CN111814477A (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN113705191A (en) Method, device and equipment for generating sample statement and storage medium
CN117609421A (en) Electric power professional knowledge intelligent question-answering system construction method based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant