CN112597277A

CN112597277A - Document query method and device, storage medium and electronic equipment

Info

Publication number: CN112597277A
Application number: CN202011570077.4A
Authority: CN
Inventors: 俞宣伊; 黄荣; 刘俊峰; 谭文静; 孙丽黎; 初娜; 熊浩
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-02

Abstract

The embodiment of the invention provides a document query method, a document query device, a storage medium and electronic equipment, which can obtain a target phrase input by a user; obtaining similar phrases of the target phrases, and determining the similar phrases and the target phrases as phrases to be inquired; inquiring a keyword node corresponding to the phrase to be inquired in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be inquired is inquired, obtaining a document node which has a direct connection relation with the inquired keyword node; and determining the obtained document corresponding to at least one file node as a query result. The invention does not need to carry out full text query and has higher query speed.

Description

Document query method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of document query technologies, and in particular, to a document query method, an apparatus, a storage medium, and an electronic device.

Background

With the popularization of electronic offices, various documents are increasing. Users often need to query for certain documents.

The current query to the document is generally to directly perform full text query in the document according to the search term input by the user, and when a certain document includes the search term, the document is output as a query result.

However, full-text queries are slow to query.

Disclosure of Invention

The embodiment of the invention aims to provide a document query method, a document query device, a storage medium and electronic equipment, so as to improve the query speed. The specific technical scheme is as follows:

a document query method, comprising:

obtaining a target phrase input by a user;

obtaining similar phrases of the target phrases, and determining the similar phrases and the target phrases as phrases to be inquired;

inquiring a keyword node corresponding to the phrase to be inquired in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be inquired is inquired, obtaining a document node which has a direct connection relation with the inquired keyword node;

and determining the obtained document corresponding to at least one file node as a query result.

Optionally, the obtaining of the similar phrases of the target phrase includes:

obtaining a word vector of the target phrase;

and in a preset word vector dictionary in the field corresponding to the target phrase, obtaining a phrase of which the similarity with the word vector of the target phrase meets the preset similarity requirement, and determining the phrase of which the preset similarity requirement is required as the similar phrase of the target phrase.

Optionally, the pre-constructed knowledge graph is a knowledge graph of a field corresponding to the target phrase, and/or the keyword node is located in a document corresponding to a document node having a direct connection relationship with the keyword node.

Optionally, the querying a keyword node corresponding to the phrase to be queried in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be queried is queried, obtaining a document node having a direct connection relationship with the queried keyword node, includes:

using the phrases to be queried to construct a knowledge graph query statement, and executing the knowledge graph query statement, wherein the knowledge graph query statement is used for:

and inquiring a keyword node corresponding to the phrase to be inquired in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be inquired is inquired, obtaining a document node which has a direct connection relation with the inquired keyword node.

Optionally, the process of constructing the pre-constructed knowledge graph includes:

obtaining a plurality of documents;

performing word segmentation processing on the document to obtain a plurality of phrases;

removing stop words in the plurality of phrases;

extracting keywords from the plurality of phrases from which the stop words are removed through a preset keyword extraction algorithm;

establishing a triple according to the inclusion relation between the plurality of documents and the keyword;

and establishing the keyword nodes, the document nodes and the direct connection relation in a knowledge graph according to the triples.

A document querying device, comprising: a target phrase obtaining unit, a similar phrase obtaining unit, a node inquiring unit and a result determining unit,

the target phrase obtaining unit is used for obtaining a target phrase input by a user;

the similar phrase obtaining unit is used for obtaining similar phrases of the target phrases, and determining the similar phrases and the target phrases as phrases to be inquired;

the node query unit is used for querying keyword nodes corresponding to the phrases to be queried in a pre-constructed knowledge graph, and when the keyword nodes corresponding to the phrases to be queried are queried, obtaining document nodes which have direct connection relation with the queried keyword nodes;

and the result determining unit is used for determining the obtained document corresponding to the at least one file node as a query result.

Optionally, the similar phrase obtaining unit obtains a similar phrase of the target phrase, and is specifically configured to:

obtaining a word vector of the target phrase; and in a preset word vector dictionary in the field corresponding to the target phrase, obtaining a phrase of which the similarity with the word vector of the target phrase meets the preset similarity requirement, and determining the phrase of which the preset similarity requirement is required as the similar phrase of the target phrase.

A storage medium having stored thereon a program which, when executed by a processor, implements any of the above-described document querying methods.

An electronic device comprising at least one processor, and at least one memory, bus connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory to execute any one of the document query methods.

The document query method, the document query device, the storage medium and the electronic equipment provided by the embodiment of the invention can obtain the target phrase input by a user; obtaining similar phrases of the target phrases, and determining the similar phrases and the target phrases as phrases to be inquired; inquiring a keyword node corresponding to the phrase to be inquired in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be inquired is inquired, obtaining a document node which has a direct connection relation with the inquired keyword node; and determining the obtained document corresponding to at least one file node as a query result. The invention does not need to carry out full text query and has higher query speed. Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a document query method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a knowledge-graph provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a knowledge-graph building process according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a method for implementing a document query according to the present invention based on python according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a document querying device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a document query method provided in an embodiment of the present invention may include:

and S100, obtaining a target phrase input by a user.

The target phrase input by the user may be one or more phrases, and when the target phrase is a plurality of phrases, the user may segment each phrase by a separating symbol (e.g., a space, a pause, a semicolon, a comma, etc.).

Of course, in other embodiments, the present invention may also automatically recognize the content input by the user and perform segmentation to obtain at least one target phrase. Optionally, the invention can cut the content input by the user through word segmentation technology.

And S200, obtaining similar phrases of the target phrases, and determining the similar phrases and the target phrases as phrases to be inquired.

It can be understood that, if the query is performed only through the target phrase input by the user, the query range is narrow, and the documents required to be queried cannot be effectively covered. Through the target phrase and the similar phrases, the method and the device can not only inquire the document comprising the target phrase, but also inquire the document comprising the similar phrases of the target phrase, and effectively improve the coverage and the accuracy of the inquiry result.

Optionally, obtaining a similar phrase of the target phrase may specifically include:

obtaining a word vector of a target phrase;

and in a preset word vector dictionary in the field corresponding to the target phrase, obtaining a phrase of which the similarity with the word vector of the target phrase meets the preset similarity requirement, and determining the phrase of which the similarity requirement is met as the similar phrase of the target phrase.

Optionally, the invention may obtain the Word vector of the target phrase through a Word2vec model. Word2vec is a model used to generate Word vectors, which are mapped to Word vectors by training a corpus using neural networks.

Optionally, the invention can train the Word2vec model to obtain the Word2vec model corresponding to a certain field. Optionally, the present invention may obtain a plurality of corpora in the field (e.g., obtaining corpora from encyclopedia, network data, textbooks, and papers in the field), obtain keywords in the field according to the obtained corpora, and train a Word2vec model according to the keywords.

Optionally, after obtaining the word vector of the target phrase, the present invention may convert the word vector into a csv format, so as to be loaded as a dictionary when in use, thereby improving the access efficiency.

The preset word vector dictionary can correspond to the field, can effectively improve the pertinence to the field, and further improves the query accuracy. The similarity of the word vectors may be cosine similarity. The preset similarity requirement may be: the similarity is higher than a preset threshold, or the similarity ranking is positioned at the top N.

S300, searching keyword nodes corresponding to the phrases to be searched in the pre-constructed knowledge graph, and when the keyword nodes corresponding to the phrases to be searched are searched, obtaining document nodes which have direct connection relation with the searched keyword nodes.

The pre-constructed knowledge graph of the invention can comprise: document nodes and keyword nodes. Fig. 2 is a schematic diagram of a knowledge graph according to an embodiment of the present invention. As shown in fig. 2, the document nodes and the keyword nodes may have a direct connection relationship as shown by the arrows in fig. 2, such as an arrow between the node corresponding to the first document and the node corresponding to the first keyword. The meaning of the arrow is: the first keyword is located in a first document. Optionally, one document node may be connected to a plurality of keyword nodes, and one keyword node may also be connected to a plurality of document nodes.

The invention can embody the relation between the document and the key words through the knowledge graph. Meanwhile, the expansibility of the knowledge graph is strong, so that the knowledge graph can be expanded by adding key word nodes, document nodes and connection relations.

Optionally, document nodes of the knowledge graph of the present invention may have a connection relationship before, and keyword nodes may have a connection relationship before. The connection relationship between document nodes can be various, such as: document similarity relationships, document inclusion relationships, and the like. The connection relationship between the keyword nodes can be various, such as: meaning similar relation, meaning opposite relation, meaning including and included relation, and the like.

The querying of the keyword node corresponding to the phrase to be queried in the pre-constructed knowledge graph may include:

and searching keywords similar to the phrase to be inquired in keywords corresponding to the keyword nodes contained in the pre-constructed knowledge graph, and determining the keyword nodes corresponding to the searched keywords as the keyword nodes corresponding to the phrase to be inquired.

Optionally, step S300 may include:

using the phrases to be queried to construct a knowledge graph query statement and executing the knowledge graph query statement, wherein the knowledge graph query statement is used for:

Optionally, the knowledge graph query statement in the present invention may be a SPARQL query statement.

SPARQL is a query language and data acquisition protocol that can be used for knowledge graphs, provides functionality similar to SQL statements, and implements a variety of operations based on graph algorithms.

Through the SPARQL query statement, the method and the system can quickly find the corresponding keyword node, so that the corresponding document node is found.

S400, determining the obtained document corresponding to at least one file node as a query result.

The invention can inquire through the target phrase and the similar phrase without full text retrieval, thereby effectively improving the inquiry speed.

The words we search for may only have semantically similar relationships to the keywords present in the document, while the characters do not match. The method for calculating the similarity of the terms can effectively support fuzzy query, and most possible query input terms can be covered by selecting a large corpus for training.

As shown in fig. 3, an embodiment of the present invention further provides a process for constructing a pre-constructed knowledge-graph, which may include:

s001, obtaining a plurality of documents;

s002, performing word segmentation processing on the document to obtain a plurality of phrases;

s003, removing stop words in the phrases;

s004, extracting keywords from the plurality of phrases from which the stop words are removed through a preset keyword extraction algorithm;

the preset keyword extraction algorithm may be: TF-IDF (term frequency-inverse document frequency), document topic generation model LDA (Laten Dirichlet allocation), TextRank, etc.

S005, establishing a triple according to the inclusion relation between the plurality of documents and the keyword;

s006, establishing a keyword node, a document node and a direct connection relation in the knowledge graph according to the triples.

FIG. 4 is a diagram illustrating an alternative embodiment of the present invention that provides a method for implementing a document query according to the present invention based on python.

In fig. 4, both dashed boxes need to be preprocessed, and we use a jieba tool to perform chinese word segmentation on a corpus text, and some words may not be separated during word segmentation by setting jieba. Such as "routing switch/protocol" instead of "routing/switch/protocol". The Word2vec text preprocessing stage does not need to process stop words, while the LDA text preprocessing stage needs to remove stop words (high-frequency nonsense words such as conjunctions or mood assist words) to be compared with the common chinese stop Word list. Py, we will input words that find out the corresponding words of the similar n vectors through the word vector dictionary, and then find out the keywords in the word vector dictionary by referring to the keyword set table. These keywords are then used to create a SPARQL query statement to query the Fuseki database and return the results.

Corresponding to the document query method, the embodiment of the invention also provides a document query device.

As shown in fig. 5, a document querying device provided in an embodiment of the present invention may include: a target phrase obtaining unit 100, a similar phrase obtaining unit 200, a node query unit 300 and a result determination unit 400,

a target phrase obtaining unit 100, configured to obtain a target phrase input by a user;

a similar phrase obtaining unit 200, configured to obtain a similar phrase of the target phrase, and determine the similar phrase and the target phrase as a phrase to be queried;

optionally, the similar phrase obtaining unit 200 obtains the similar phrase of the target phrase, and may be specifically configured to:

obtaining a word vector of a target phrase; and in a preset word vector dictionary in the field corresponding to the target phrase, obtaining a phrase of which the similarity with the word vector of the target phrase meets the preset similarity requirement, and determining the phrase of which the similarity requirement is met as the similar phrase of the target phrase.

The node query unit 300 is configured to query a keyword node corresponding to a phrase to be queried in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be queried is queried, obtain a document node having a direct connection relationship with the queried keyword node;

and a result determining unit 400, configured to determine a document corresponding to the obtained at least one file node as a query result.

Optionally, the node querying unit 300 may specifically be configured to:

Optionally, the document querying device shown in fig. 5 may further include: the map building unit is used for building the knowledge map, and comprises: a first obtaining unit, a word segmentation unit, a word removing unit, an extraction unit, a triple establishing unit and a node establishing unit,

a first obtaining unit configured to obtain a plurality of documents;

the word segmentation unit is used for carrying out word segmentation processing on the document to obtain a plurality of word groups;

the word removing unit is used for removing stop words in the plurality of word groups;

the extraction unit is used for extracting keywords from the plurality of phrases from which the stop words are removed through a preset keyword extraction algorithm;

the triple establishing unit is used for establishing a triple according to the inclusion relation between the plurality of documents and the keyword;

and the node establishing unit is used for establishing the keyword nodes, the document nodes and the direct connection relation in the knowledge graph according to the triples.

The document inquiry apparatus includes a processor and a memory, the target phrase obtaining unit 100, the similar phrase obtaining unit 200, the node inquiry unit 300, the result determining unit 400, and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more, and the document is inquired by adjusting the kernel parameters.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the document query method when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the document query method is executed when the program runs.

As shown in fig. 6, an embodiment of the present invention provides an electronic device 70, where the electronic device 70 includes at least one processor 701, at least one memory 702 connected to the processor 701, and a bus 703; the processor 701 and the memory 702 complete mutual communication through a bus 703; the processor 701 is configured to call program instructions in the memory 702 to perform the document query method described above. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application also provides a computer program product adapted to execute a program initialized with the steps comprised in the document querying method described above, when executed on a data processing device.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A document query method, comprising:

obtaining a target phrase input by a user;

2. The method of claim 1, wherein the obtaining similar phrases of the target phrase comprises:

obtaining a word vector of the target phrase;

3. The method according to claim 1, wherein the pre-constructed knowledge graph is a knowledge graph of a domain corresponding to the target phrase, and/or the keyword node is located in a document corresponding to a document node having a direct connection relationship with the keyword node.

4. The method according to claim 1, wherein the querying a keyword node corresponding to the phrase to be queried in a pre-constructed knowledge graph, and when the keyword node corresponding to the phrase to be queried is queried, obtaining a document node having a direct connection relationship with the queried keyword node, comprises:

5. The method of claim 1, wherein the pre-constructed knowledge-graph is constructed by a process comprising:

obtaining a plurality of documents;

removing stop words in the plurality of phrases;

6. A document querying device, comprising: a target phrase obtaining unit, a similar phrase obtaining unit, a node inquiring unit and a result determining unit,

7. The document querying device according to claim 6, wherein the similar phrase obtaining unit obtains the similar phrases of the target phrase, and is specifically configured to:

8. The apparatus according to claim 6, wherein the pre-constructed knowledge graph is a knowledge graph of a domain corresponding to the target phrase, and/or the keyword node is located in a document corresponding to a document node having a direct connection relationship with the keyword node.

9. A storage medium on which a program is stored, the program implementing the document query method of any one of claims 1 to 5 when executed by a processor.

10. An electronic device comprising at least one processor, and at least one memory, bus connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the document query method of any of claims 1 to 5.