CN112214993B - File processing method, device and storage medium based on graphic neural network - Google Patents

File processing method, device and storage medium based on graphic neural network Download PDF

Info

Publication number
CN112214993B
CN112214993B CN202010916293.3A CN202010916293A CN112214993B CN 112214993 B CN112214993 B CN 112214993B CN 202010916293 A CN202010916293 A CN 202010916293A CN 112214993 B CN112214993 B CN 112214993B
Authority
CN
China
Prior art keywords
semantic
document
graph
neural network
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010916293.3A
Other languages
Chinese (zh)
Other versions
CN112214993A (en
Inventor
王洪俊
肖诗斌
施水才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tols Information Technology Co ltd
Original Assignee
Tols Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tols Information Technology Co ltd filed Critical Tols Information Technology Co ltd
Priority to CN202010916293.3A priority Critical patent/CN112214993B/en
Publication of CN112214993A publication Critical patent/CN112214993A/en
Application granted granted Critical
Publication of CN112214993B publication Critical patent/CN112214993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of document processing and retrieval, and aims to solve the technical problems that the semantic relation among words, sentences and documents cannot be mined and the retrieval effect is poor in the prior art based on the traditional keyword retrieval technology; the invention relates to a document processing method, a device, an electronic device and a nonvolatile computer storage medium based on a graph neural network, wherein the method adopts a graph neural network technology based on supervised learning to generate a deep semantic vector from a semantic word graph, applies a binary encoder technology to convert the semantic vector into a binary encoding form, further generates a character feature vector and constructs an inverted index; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and the retrieval technology in the retrieval process, and the correlation of semantic retrieval results is effectively improved.

Description

File processing method, device and storage medium based on graphic neural network
Technical Field
The present invention relates to the field of document processing and searching technologies, and in particular, to a method and an apparatus for processing a document based on a neural network, an electronic device, and a non-volatile computer storage medium.
Background
Currently, deep learning techniques have made great progress in the field of information retrieval. The Word2vec and other models can capture the semantic relation among different words, so that the matching problem of the same-meaning words is effectively solved; the Bert model can perform better semantic coding and retrieval on sentence-level and paragraph-level texts (within thousand words).
The open-source Elastic Search retrieval system is internally provided with a semantic indexing and retrieval technology, and can encode continuous texts by using word2vec average, LSA, infer set, universal Sentence Encoder, ELMo, BERT and other encoder models to form dense vectors, then encode the dense vectors into a character string form, and calculate and convert the dense vectors into a character string retrieval problem, so that deep dense vector retrieval is realized by using a traditional engine.
For example, the patent application of Chinese patent publication No. CN101576904A discloses a method for calculating the similarity of text contents based on a weight map, and the invention provides a system and a method for calculating the similarity of text contents based on the weight map, wherein the system comprises an input unit, a data processing unit and a data processing unit, wherein the input unit is used for inputting a document set of which the similarity needs to be calculated; a construction unit for constructing a rights graph; a calculation unit for calculating the similarity between any two nodes in the graph according to the weighted graph obtained in the construction unit; and the output unit is used for returning the similarity result to the user. However, the invention requires the construction of semantic relationships between documents based on a collection of documents, and does not consider and resolve semantic relationships between homonyms.
Although the deep learning technology can perform better semantic coding and searching on words, sentences and paragraphs, the current deep neural network technology cannot solve the coding and searching problems of patent, paper and other texts. And the techniques such as word vector averaging and BERT coding are directly applied to the long text, so that dense vector representation of the long text can be obtained, but the effect is poor.
Disclosure of Invention
The method aims at solving the technical problems that the existing keyword retrieval technology cannot mine semantic relations among words, sentences and documents and has poor retrieval effect; according to the document processing method, the device, the electronic device and the nonvolatile computer storage medium based on the graphic neural network, the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and the retrieval technology in the retrieval process, and the correlation of semantic retrieval results is effectively improved.
The first aspect of the present invention provides a document processing method based on a graph neural network, which is characterized by comprising:
extracting a group of keywords representing the semantics of the document from the document, calculating the context co-occurrence relationship among the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input to a graph convolution neural network and a binary encoder, and the semantic word graph of the document is converted into a deep semantic vector and a binary vector code;
grouping the binary vector codes to generate a group of character characteristic strings, storing the obtained character characteristic strings as a character characteristic document into a full text search engine, and establishing an index of the document in the full text search engine.
In a preferred implementation manner of the embodiment of the present invention, the extracting a set of keywords representing the semantics of the document from the document includes: based on the dictionary corresponding word library, searching the words appearing in the document, calculating the weights of the words according to at least one of the word parts of speech, the number of occurrences, the position and the frequency of the historical document, and selecting a plurality of words with the highest weights as keywords.
In a further preferred implementation manner of the embodiment of the present invention, the calculating the context co-occurrence relationship between the keywords, and generating the semantic word graph of the document include: the extracted keywords are used as nodes of a semantic word graph, and edges between the nodes are constructed through context adjacency or window concurrence relations among the keywords; for short text, establishing edges of nodes between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
In a preferred implementation manner of the embodiment of the present invention, the method further includes: training the graph convolution neural network and the neural network corresponding to the binary encoder, wherein the training process comprises the following steps: semantic word graphs between the query document and the comparison document are respectively transmitted into semantic word graph encoders based on graph convolution neural networks, and semantic vectors generated by the semantic word graph encoders are transmitted into the binarization encoders; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for binarizing a semantic vector X generated by the graph convolution neural network to generate phi (X), generating a binarization code and the decoder is responsible for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relationship between the document pairs through a matching function.
In a further preferred implementation manner of the embodiment of the present invention, the semantic word graph is sequentially input to a graph convolution neural network, and the converting into the deep semantic vector includes: the graph convolution neural network comprises two layers of neural networks, and a vector of a node is introduced based on attribute information of adjacent nodes of the node; the semantic word graph encoder inputs a semantic word graph and a pre-training model; wherein the binarization encoder is configured to convert the semantic embedded vector of the real representation into a binary encoded form, and to keep the semantic information unchanged or at least partially lost.
In a preferred implementation manner of the embodiment of the present invention, the storing the obtained character string as a character feature document in a full text search engine, and establishing an index of the document in the full text search engine specifically includes: the character feature codes obtained after coding are regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text search engine, and each feature code can be regarded as an independent word to respectively establish indexes.
In a preferred implementation manner of the embodiment of the present invention, the method further includes: when the semantic search is based on character feature codes in the full-text search engine, firstly, generating a semantic word graph for a query document, inputting a neural network encoder model, generating character feature codes, then constructing a semantic query sentence, and submitting the full-text search engine to obtain a search result.
The second aspect of the present invention also provides a document processing device based on a neural network, which is characterized by comprising:
a semantic word graph generating part for extracting a group of keywords representing the semantics of the document from the document, calculating the context co-occurrence relationship among the keywords, and generating a semantic word graph of the document;
the semantic word graph is sequentially input to the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector for encoding;
and the character characteristic document processing part is used for grouping the binary vector codes to generate a group of character characteristic strings, storing the obtained character characteristic strings into a full-text search engine as a character characteristic document, and establishing an index of the document in the full-text search engine.
The third aspect of the present invention also provides an electronic device, which is characterized by comprising: a memory, a processor, and a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement any one of the neural network-based document processing methods as provided in the first aspect.
The fourth aspect of the present invention also provides a nonvolatile computer storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement any one of the graph neural network-based document processing methods as provided in the first aspect.
The invention adopts a graph neural network technology based on supervised learning to generate a depth semantic vector from a semantic word graph, applies a binarization encoder technology to convert the semantic vector into a binary coding form, further generates a character feature vector, and constructs an inverted index; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and the retrieval technology in the retrieval process, and the correlation of semantic retrieval results is effectively improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure and/or process particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
Fig. 1 is a flowchart corresponding to a document processing method based on a neural network according to an embodiment of the present invention.
Fig. 2 is a flowchart corresponding to another document processing method based on a neural network according to an embodiment of the present invention.
Fig. 3 is a system architecture diagram for providing a semantic search based on deep learning according to an embodiment of the present invention.
Fig. 4 is a semantic word graph corresponding to a use case document in a document processing method based on a graph neural network according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a model training process of a deep neural network according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a graph roll-up neural network according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a binarization encoder according to an embodiment of the present invention.
Fig. 8 is a flowchart of a quick semantic search based on character feature codes according to an embodiment of the present invention.
Fig. 9 is a block diagram of a structure corresponding to a document processing device based on a neural network according to an embodiment of the present invention.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following will describe embodiments of the present invention in detail with reference to the drawings and examples, thereby solving the technical problems by applying technical means to the present invention, and realizing the technical effects can be fully understood and implemented accordingly. It should be noted that these specific descriptions are only for easy and clear understanding of the present invention by those skilled in the art, and are not meant to be limiting; and as long as no conflict is formed, each embodiment of the present invention and each feature of each embodiment may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.
Additionally, the steps illustrated in the flowcharts of the figures may be performed in a control system such as a set of controller-executable instructions, and although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that herein.
The following describes the technical scheme of the invention in detail through the attached drawings and specific embodiments:
examples
As shown in fig. 1, the present embodiment provides a document processing method based on a graph neural network, the document processing method including:
s110, extracting a group of keywords representing the semantics of the document from the document, calculating the context co-occurrence relationship among the keywords, and generating a semantic word graph of the document;
s120, sequentially inputting the semantic word graphs into a graph convolution neural network and a binary encoder, and converting the semantic word graphs of the documents into deep semantic vectors and binary vector codes;
s130, grouping the binary vector codes to generate a group of character feature strings, storing the obtained character feature strings serving as a character feature document into a full-text search engine, and establishing an index of the document in the full-text search engine.
In a preferred implementation manner of this embodiment, in S110, extracting a set of keywords representing semantics of the document from the document includes: based on the dictionary corresponding word library, searching the words appearing in the document, calculating the weights of the words according to at least one of the word parts of speech, the number of occurrences, the position and the historical document frequency, and selecting a plurality of words with the highest weights as keywords.
In a further preferred implementation manner of the present embodiment, in S110, calculating a context co-occurrence relationship between the keywords, and generating a semantic word graph of the document includes: the extracted keywords are used as nodes of a semantic word graph, and edges between the nodes are constructed through context adjacency or window concurrence relations among the keywords; for short text, establishing edges of nodes between adjacent words; for long text, edges will appear that establish nodes between words within the fixed context window.
Therefore, in the technical scheme provided by the embodiment, a graph neural network technology based on supervised learning is adopted to generate a depth semantic vector from a semantic word graph, a binary encoder technology is applied to convert the semantic vector into a binary encoding form, and then a character feature vector is generated, and an inverted index is constructed; the processed document can be subjected to high-performance retrieval and semantic matching based on the character feature index and the retrieval technology in the retrieval process, and the correlation of semantic retrieval results is effectively improved.
As shown in fig. 2, the present embodiment provides a document processing method based on a graph neural network, which includes, in addition to S110, S120, S130 mentioned in the document processing method corresponding to fig. 1:
s140, training the graph convolution neural network and the neural network corresponding to the binary encoder, wherein the training process comprises the following steps: semantic word graphs between the query document and the comparison document are respectively transmitted into a semantic word graph encoder based on a graph convolution neural network, and semantic vectors generated by the semantic word graph encoder are transmitted into a binarization encoder; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for binarizing a semantic vector X generated by the graph convolution neural network to generate phi (X), generating a binarization code and reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relationship between the document pairs through a matching function.
In a preferred embodiment, the semantic word graphs are sequentially input into a graph convolution neural network, and the conversion into the deep semantic vector comprises: the graph convolution neural network comprises two layers of neural networks, and a vector of a node is introduced based on attribute information of adjacent nodes of the node; the semantic word graph encoder inputs a semantic word graph and a pre-training model; wherein the binarization encoder is configured to convert the semantic embedded vector of the real representation into a binary encoded form, and to leave the semantic information unchanged or at least partially lost.
In a preferred embodiment, the obtained character feature string is stored as a character feature document in a full text search engine, and an index of the document is built in the full text search engine, which specifically includes: the character feature codes obtained after coding are regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text search engine, and each feature code can be regarded as an independent word to respectively establish indexes.
In addition, the document processing method based on the graph neural network provided by the embodiment mainly improves document processing; the document processing method based on the graph neural network provided in the embodiment may further improve the search performed subsequently, for example:
in a preferred implementation manner of the embodiment of the present invention, the method further includes: when semantic retrieval based on character feature codes is performed in a full-text retrieval engine, firstly, a semantic word graph is generated for a query document, a neural network encoder model is input, character feature codes are generated, then, a semantic query sentence is constructed, and the full-text retrieval engine is submitted to obtain a retrieval result.
Because of the previous document processing based on the neural network corresponding to fig. 1 and 2, the semantic vector is converted into a binary coding form, so that character feature vectors are generated, and an inverted index is constructed; therefore, in the process of executing the search, high-performance search and semantic matching can be performed based on the character feature index and the search technology, and the correlation of semantic search results is effectively improved.
In order to make it easier for the person skilled in the art to understand the technical solution of the present embodiment, the above document processing method based on the graph neural network is further explained below with reference to fig. 3 to 8 in conjunction with the specific embodiment.
As shown in fig. 3, the system architecture for deep learning-based semantic retrieval provided in this embodiment includes:
the document library 110 to be searched and the query document 120 which are input to the semantic word graph construction module 130, and the deep neural network model 140 connected with the semantic word graph construction module 130, wherein the deep neural network model 140 also receives a relevant document training set 150 with labels; the deep neural network model 140 processes the contents of the document library 110 to be searched and the query document 120, inputs the processed contents to the corresponding document semantic vector character feature encoding module 160 and the corresponding document semantic vector character feature encoding module 170, and then performs searching on the full-text searching database 180 and outputs the searching result through the deep meaning searching module 190. More specifically: 1.
1. firstly, extracting a group of keywords representing the semantics of a document from the document;
the keyword extraction method comprises the steps of firstly preparing a large dictionary, then matching words appearing in the documents in the dictionary, calculating word weights according to the information such as word parts, the number of occurrences, the position, the historical document frequency and the like of the words, and selecting TOP N words with highest weights as keywords.
For short texts such as titles, a maximum of 10 keywords can be selected; for text such as abstract, 25 keywords can be selected; for long text, 50 keywords may be selected.
The following example is a patent abstract text and its keywords:
keyword:
a horizontal plate shape; a parallel transistor; a bottom plate; an insulating elastic block; an upper cover plate; a vertical support plate; a connecting block; an outer sidewall; an inner sidewall; clamping the ball plunger; a transistor; a clamping block; a steel ball; concave holes; a bottom surface; a top surface; fixing; nesting; two sides; a middle part; a box cover; disassembling; cling to
Summary and keywords:
the invention discloses a box cover type horizontal platy parallel transistor device which comprises a connecting bottom plate, wherein vertical supporting plates are fixed in the middle of two sides of the top surface of the connecting bottom plate, a plurality of lower insulating elastic blocks are fixed on the connecting bottom plate between the two vertical supporting plates, connecting blocks are fixed in the middle of the outer side walls of the two vertical supporting plates, clamping ball plungers are connected to the outer side walls of the connecting blocks in a threaded mode, clamping blocks are arranged on two sides of the bottom surface of an upper cover plate, the clamping blocks are clung to the outer side walls of the connecting blocks, steel balls for clamping the ball plungers are nested in concave holes formed in the inner side walls of the clamping blocks, the upper cover plate is positioned above all the lower insulating elastic blocks and the two vertical supporting plates, and upper insulating elastic blocks corresponding to the lower insulating elastic blocks are fixed on the bottom surface of the upper cover plate. The invention can be horizontally placed at the position to be placed, meets the placement requirement, and can be used for installing or replacing the transistor by only pulling the upper cover plate upwards, and is convenient to install and detach.
2) And calculating the context co-occurrence relation among the keywords, and generating a semantic word graph of the document.
The generation method of the semantic word graph comprises the following steps: firstly, taking the extracted keywords as nodes of a semantic word graph, and then constructing edges between the nodes through context adjacency or window concurrence relations among the keywords;
for short text, establishing edges of nodes between adjacent words;
for long text, edges of nodes will be established between words that appear within the fixed context window; the size of the contextual window may preferably be 3-5; a specific schematic diagram is shown in fig. 4.
2. The model training process of the deep neural network comprises the following steps: the goal of the model training process is to learn a document semantic coding model that can map a semantic word graph to a fixed length vector representation. This vector is an encoding of semantic information of the documents, which can be used to calculate the similarity between documents or search for similar documents.
The whole neural network is a supervised learning architecture oriented to text relativity analysis tasks, and the input training corpus is a document pair with related or uncorrelated semantic tags.
The neural network includes a semantic word graph encoder based on a convolutional neural network (GCN) and a binarization encoder based on a self-encoder structure.
The whole training process, as shown in fig. 5, includes:
1) Semantic word graphs of the query document and the comparison document (both) are respectively transmitted into semantic word graph encoders based on graph convolution neural networks (GCNs) with shared parameters, and semantic vectors (128-dimensional or 256-dimensional) generated by the encoders are transmitted into a binarization encoder;
2) The binarization encoder comprises an encoder and a decoder, wherein the encoder is responsible for binarizing semantic vectors X generated by GCN to generate phi (X), and the final binarization encoding is generated; the decoder is responsible for reconstructing the semantic embedded vector Y from Φ (x).
3) Semantic vectors Y generated by reconstructing the query document and the comparison document through the binarization encoder are U and V respectively. u and v go through 3 matching functions: the connection, element-wise vector difference (absolute value), element-wise vector product. After which the semantic relationship between the document pairs is predicted by passing through the full connection layer to a binary prediction function (e.g., 3-way softmax).
3. Semantic word graph encoder based on graph convolution neural network (GCN), as shown in FIG. 6, the GCN is a two-layer neural network, and the node's Embedding is introduced based on the attribute information of the node's neighboring nodes.
The input of the GCN encoder is a semantic word graph and a pre-training model. Pre-trained models herein include, but are not limited to, word Embbeddings, BERT, and the like.
The output of the GCN encoder is a 128-dimensional semantic embedded vector, which is used as an input of the binarization encoder.
4. In the binarized encoder model, as shown in FIG. 7, the semantic embedded vectors generated by the graphic neural network encoder are dense vector representations that are preserved in real form, which are not suitable for processing by conventional full text search engines.
The goal of a binarization encoder is to convert the semantic embedded vectors of real representations into binary coded form, with little or no loss of semantic information, requiring only 128 or 256 bits (bits) per vector.
The model is based on an encoder-decoder architecture, consisting of two parts:
the encoder is responsible for binarizing the semantic embedded vector X to generate a binarization code phi (X);
the decoder is responsible for reconstructing the semantic embedded vector Y from the binarized encoding Φ (x).
5. Binarization encoding to character feature code generation: the binary codes can be better stored and retrieved by the full text engine after packet processing is required.
The processing method is that the binary codes of 128Bit (or 256 Bit) are divided into one group according to 4 bits, and 32 groups (or 64 groups) are added.
Each group is assigned a subscript value from 1 to 32 (or 64) in order. The value range of each group is 0-15, and 16-system form coding is adopted.
Thus, the character encoding is in the form of a subscript+16 binary encoded value, such as 1-0F,32-00.
6. The character feature codes obtained after coding can be regarded as a section of character feature text, each character feature is separated by a specific symbol and used as a feature field of an input document to be stored in a full-text retrieval engine, and each feature code can be regarded as an independent word to respectively establish indexes.
7. Quick semantic retrieval based on character feature codes
The quick semantic retrieval based on character feature codes firstly generates a semantic word graph for a query document, inputs a neural network encoder model to generate character feature codes, then constructs a semantic query sentence by combining other retrieval conditions, submits a full text retrieval engine to obtain a retrieval result, and the specific flow is shown in figure 8 and comprises the following steps:
and inputting the document to be queried.
And executing the semantic word graph construction of the document to be queried according to the mentioned method.
And inputting the constructed semantic word graph into a (trained) deep neural network model.
And after processing based on the deep neural network model, encoding the semantic vector character characteristics of the document.
And encoding the character characteristics of the encoded semantic vector, and combining the search conditions of the attachment (for example, combining the search conditions of the semantic query) to construct a semantic query statement.
And comparing the semantic query statement with the data in the full-text retrieval database.
Based on the comparison in the full text search database, a semantic search result is input.
As shown in fig. 9, the present embodiment further provides a document processing apparatus 200 based on a graph neural network, the document processing apparatus 200 including:
a semantic word graph generating unit 210 that extracts a set of keywords representing the semantics of a document from the document, calculates the context co-occurrence relationship between the keywords, and generates a semantic word graph of the document;
the graph convolutional neural network and the binary encoder 220, semantic word graphs are sequentially input into the graph convolutional neural network and the binary encoder, and the semantic word graphs of the documents are converted into deep semantic vectors and binary vector codes;
the character feature document processing part 230 is configured to group binary vector codes to generate a set of character feature strings, store the obtained character feature strings as one character feature document in the full-text search engine, and build an index of the document in the full-text search engine.
It should be noted that, each module mentioned in the document processing device 200 provided in this embodiment may perform the functions mentioned in the document processing method based on the neural network corresponding to fig. 1 to 8, and specific processes and technical effects may refer to the above description and are not repeated herein.
As shown in fig. 10, the present embodiment further provides an electronic device 300, where the electronic device 300 includes: memory 310, processor 320, and computer programs; wherein the computer program is stored in the memory 310 and configured to be executed by the processor 320 to implement any of the graph neural network based document processing methods as provided above.
In addition, the present embodiment also provides a nonvolatile computer storage medium having a computer program stored thereon; the computer program is executed by the processor to implement any of the neural network-based document processing methods provided above.
Those of ordinary skill in the art will appreciate that: the above-described methods according to embodiments of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the methods described herein may be processed by such software on a recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC, FPGA, or SoC. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the processing methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.
Those of ordinary skill in the art will appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present invention.
Finally, it should be noted that the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Any person skilled in the art can make many possible variations and simple substitutions to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the scope of the technical solution of the present invention, and these all fall into the scope of protection of the technical solution of the present invention.

Claims (9)

1. A document processing method based on a graph neural network, comprising:
extracting a group of keywords representing the semantics of the document from the document, calculating the context co-occurrence relation among the keywords, and generating a semantic word graph of the document, wherein the semantic word graph comprises: the extracted keywords are used as nodes of a semantic word graph, and edges between the nodes are constructed through context adjacency or window concurrence relations among the keywords; for short text, establishing edges of nodes between adjacent words; for long text, edges of nodes will be established between words that appear within the fixed context window;
the semantic word graph is sequentially input to a graph convolution neural network and a binary encoder, and the semantic word graph of the document is converted into a deep semantic vector and a binary vector code;
grouping the binary vector codes to generate a group of character characteristic strings, storing the obtained character characteristic strings as a character characteristic document into a full text search engine, and establishing an index of the document in the full text search engine.
2. The method of claim 1, wherein extracting a set of keywords from the document that represent document semantics comprises: based on the dictionary corresponding word library, searching the words appearing in the document, calculating the weights of the words according to at least one of the word parts of speech, the number of occurrences, the position and the frequency of the historical document, and selecting a plurality of words with the highest weights as keywords.
3. The method as recited in claim 1, further comprising: training the graph convolution neural network and the neural network corresponding to the binary encoder, wherein the training process comprises the following steps: semantic word graphs between the query document and the comparison document are respectively transmitted into semantic word graph encoders based on graph convolution neural networks, and semantic vectors generated by the semantic word graph encoders are transmitted into the binarization encoders; the binarization encoder comprises an encoder and a decoder, wherein the encoder is used for binarizing a semantic vector X generated by the graph convolution neural network to generate phi (X), generating a binarization code and the decoder is responsible for reconstructing a semantic embedded vector Y from the phi (X); and reconstructing a semantic embedded vector Y based on the query document and the comparison document, and predicting the semantic relationship between the document pairs through a matching function.
4. A method according to claim 3, wherein the semantic word graphs are sequentially input to a graph convolution neural network, the converting to deep semantic vectors comprising: the graph convolution neural network comprises two layers of neural networks, and a vector of a node is introduced based on attribute information of adjacent nodes of the node; the semantic word graph encoder inputs a semantic word graph and a pre-training model; wherein the binarization encoder is configured to convert the semantic embedded vector of the real representation into a binary encoded form, and to keep the semantic information unchanged or at least partially lost.
5. The method according to claim 1, wherein the storing the obtained character string as a character feature document in a full text search engine and creating an index of the document in the full text search engine specifically comprises: the character feature codes obtained after coding are regarded as a section of character feature text, each character feature is separated by a specific symbol and is used as a feature field of an input document to be stored in a full-text search engine, and each feature code can be regarded as an independent word to respectively establish indexes.
6. The method of any one of claims 1-5, further comprising: when the semantic search is based on character feature codes in the full-text search engine, firstly, generating a semantic word graph for a query document, inputting a neural network encoder model, generating character feature codes, then constructing a semantic query sentence, and submitting the full-text search engine to obtain a search result.
7. A graph neural network-based document processing apparatus, comprising:
a semantic word graph generating unit for extracting a group of keywords representing the semantics of a document from the document, calculating the context co-occurrence relationship between the keywords, and generating a semantic word graph of the document, the semantic word graph comprising: the extracted keywords are used as nodes of a semantic word graph, and edges between the nodes are constructed through context adjacency or window concurrence relations among the keywords; for short text, establishing edges of nodes between adjacent words; for long text, edges of nodes will be established between words that appear within the fixed context window;
the semantic word graph is sequentially input to the graph convolution neural network and the binarization encoder, and the semantic word graph of the document is converted into a depth semantic vector and a binarization vector for encoding;
and the character characteristic document processing part is used for grouping the binary vector codes to generate a group of character characteristic strings, storing the obtained character characteristic strings into a full-text search engine as a character characteristic document, and establishing an index of the document in the full-text search engine.
8. An electronic device, comprising: a memory, a processor, and a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the graph neural network based document processing method of any one of claims 1-6.
9. A non-volatile computer storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the graph neural network-based document processing method of any one of claims 1 to 6.
CN202010916293.3A 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network Active CN112214993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010916293.3A CN112214993B (en) 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010916293.3A CN112214993B (en) 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network

Publications (2)

Publication Number Publication Date
CN112214993A CN112214993A (en) 2021-01-12
CN112214993B true CN112214993B (en) 2024-02-06

Family

ID=74049139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010916293.3A Active CN112214993B (en) 2020-09-03 2020-09-03 File processing method, device and storage medium based on graphic neural network

Country Status (1)

Country Link
CN (1) CN112214993B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158643B (en) * 2021-04-27 2024-05-28 广东外语外贸大学 Novel text readability evaluation method and system
CN113282726B (en) * 2021-05-27 2022-05-17 成都数之联科技股份有限公司 Data processing method, system, device, medium and data analysis method
CN117496542B (en) * 2023-12-29 2024-03-15 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509411B (en) * 2017-10-10 2021-05-11 腾讯科技(深圳)有限公司 Semantic analysis method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Graph Convolutional Networks for Text Classification;Liang Yao et al.;《arXiv》;第1-9页 *

Also Published As

Publication number Publication date
CN112214993A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112214993B (en) File processing method, device and storage medium based on graphic neural network
CN108829722B (en) Remote supervision Dual-Attention relation classification method and system
CN109033068B (en) Method and device for reading and understanding based on attention mechanism and electronic equipment
CN109359297B (en) Relationship extraction method and system
CN110275936B (en) Similar legal case retrieval method based on self-coding neural network
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN111639171A (en) Knowledge graph question-answering method and device
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN111160031A (en) Social media named entity identification method based on affix perception
CN112632224B (en) Case recommendation method and device based on case knowledge graph and electronic equipment
CN113157886B (en) Automatic question and answer generation method, system, terminal and readable storage medium
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN114881042B (en) Chinese emotion analysis method based on graph-convolution network fusion of syntactic dependency and part of speech
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN112860930A (en) Text-to-commodity image retrieval method based on hierarchical similarity learning
CN115687571A (en) Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN110298046B (en) Translation model training method, text translation method and related device
CN115906857A (en) Chinese medicine text named entity recognition method based on vocabulary enhancement
CN115019142A (en) Image title generation method and system based on fusion features and electronic equipment
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN111581964A (en) Theme analysis method for Chinese ancient books
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN116680575B (en) Model processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant