CN111949679A

CN111949679A - Document retrieval system and method

Info

Publication number: CN111949679A
Application number: CN201910413341.4A
Authority: CN
Inventors: 万江; 王小乐
Original assignee: Shanghai Geji Network Technology Co ltd
Current assignee: Shanghai Geji Network Technology Co ltd
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2020-11-17

Abstract

The invention belongs to the technical field of computers, and particularly relates to a document retrieval system and a document retrieval method. An input unit which inputs a query sentence; a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other; the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result; the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result; and the output unit is used for outputting the second retrieval result to obtain a final result. The method has the advantages of accurate retrieval result and high retrieval efficiency.

Description

Document retrieval system and method

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a document retrieval system and a document retrieval method.

Background

With the advent of the information age, the number of documents that can be retrieved has grown. How to effectively find useful information in a large number of documents becomes critical.

Information Retrieval (IR) techniques may be used to search a collection of documents for particular information, which may be further subdivided into: searching for information contained in documents, searching for documents themselves, searching for metadata describing documents, searching for text, sound, images, or data in a database (whether a relational standalone database or a hypertext networked database, such as an ethernet or content/document management system).

In performing document retrieval, the document retrieval system has two main tasks: first, finding relevant documents for a user query; second, the matching results are evaluated and the documents are ranked according to their relevance. Many conventional document retrieval systems rely on keyword searching. These systems primarily perform document retrieval taking into account several specific factors, such as the frequency and location of occurrences of queries in documents, hyperlinks to documents, document access information, and so forth.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a document retrieval system and method, which have the advantages of accurate retrieval result and high retrieval efficiency.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a document retrieval system, the system comprising:

an input unit which inputs a query sentence;

a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;

the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;

the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;

and the output unit is used for outputting the second retrieval result to obtain a final result.

Further, the document is processed as follows before:

the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;

storing the key information into the document in an additional data form;

when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;

further, the first retrieval unit includes:

the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;

the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.

Further, the second retrieval unit includes:

a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;

a document sorting subunit configured to, based on the hypergraph constructed by the hypergraph construction unit, search in the target document set for a specific query and sort the search result,

further, the hypergraph construction subunit includes:

a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;

a hypergraph construction sub-module configured to construct an initial hypergraph for the document;

a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and

a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.

A document retrieval method, the method performing the steps of:

inputting a query statement;

performing word segmentation on the query sentence to obtain a plurality of words different from each other;

performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;

searching for the second time on the basis of the first searching unit to obtain a second searching result;

and outputting the second retrieval result to obtain a final result.

Further, the method for performing the first retrieval according to the plurality of words divided by the word segmentation unit to obtain the first retrieval result performs the following steps:

storing the key information into the document in an additional data form;

automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.

Further, the method for obtaining the second retrieval result by the second retrieval on the basis of the first retrieval unit executes the following steps:

constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and

based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:

extracting concepts from the document using the domain ontology information and calculating weights of the concepts;

constructing an initial hypergraph for the document;

refining the initial hypergraph using domain ontology information; and

weights are assigned to nodes and edges in the improved hypergraph.

The document retrieval system and the document retrieval method have the following beneficial effects: through two times of retrieval, the accuracy of document retrieval is improved, and therefore the actual retrieval requirements of users can be better met.

Drawings

FIG. 1 is a system configuration diagram of a document retrieval system according to the present invention.

Detailed Description

The method of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments of the invention.

As shown in fig. 1, a document retrieval system, the system comprising:

an input unit which inputs a query sentence;

Further, the document is processed as follows before:

storing the key information into the document in an additional data form;

further, the first retrieval unit includes:

Further, the second retrieval unit includes:

further, the hypergraph construction subunit includes:

A document retrieval method, the method performing the steps of:

inputting a query statement;

and outputting the second retrieval result to obtain a final result.

storing the key information into the document in an additional data form;

constructing an initial hypergraph for the document;

refining the initial hypergraph using domain ontology information; and

weights are assigned to nodes and edges in the improved hypergraph.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the foregoing functional sub-units, and in practical applications, the foregoing functional allocation may be completed by different functional sub-units according to needs, that is, sub-units or steps in the embodiment of the present invention are further decomposed or combined, for example, the sub-units in the foregoing embodiment may be combined into one sub-unit, or may be further split into multiple sub-units, so as to complete all or part of the functions described above. The names of the sub-units and the steps involved in the embodiments of the present invention are only for distinguishing the sub-units or the steps, and are not to be construed as unduly limiting the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A document retrieval system, the system comprising:

an input unit which inputs a query sentence;

2. The document retrieval system of claim 1, wherein the document is to be processed as follows before:

storing the key information into the document in an additional data form;

when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited.

3. The document retrieval system according to claim 2, wherein the first retrieval unit includes:

4. The document retrieval system of claim 3, wherein the second retrieval unit includes:

and the document sorting subunit is configured to search in the target document set aiming at the specific query and sort the search result based on the hypergraph constructed by the hypergraph construction unit.

5. The document retrieval system of claim 4, wherein the hypergraph construction subunit comprises:

6. A method of document retrieval, the method comprising the steps of:

inputting a query statement;

and outputting the second retrieval result to obtain a final result.

7. The document retrieval method of claim 6, wherein the method of performing a first retrieval on a plurality of words divided according to the word segmentation unit to obtain a first retrieval result performs the steps of:

storing the key information into the document in an additional data form;

8. The document retrieval method of claim 7, wherein the method of retrieving a second retrieval result on the basis of the first retrieval unit performs the steps of:

constructing an initial hypergraph for the document;

refining the initial hypergraph using domain ontology information; and

weights are assigned to nodes and edges in the improved hypergraph.