CN111949679A - Document retrieval system and method - Google Patents
Document retrieval system and method Download PDFInfo
- Publication number
- CN111949679A CN111949679A CN201910413341.4A CN201910413341A CN111949679A CN 111949679 A CN111949679 A CN 111949679A CN 201910413341 A CN201910413341 A CN 201910413341A CN 111949679 A CN111949679 A CN 111949679A
- Authority
- CN
- China
- Prior art keywords
- document
- retrieval
- hypergraph
- additional data
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000011218 segmentation Effects 0.000 claims abstract description 27
- 238000010276 construction Methods 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000007670 refining Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 108010001267 Protein Subunits Proteins 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/243—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of computers, and particularly relates to a document retrieval system and a document retrieval method. An input unit which inputs a query sentence; a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other; the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result; the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result; and the output unit is used for outputting the second retrieval result to obtain a final result. The method has the advantages of accurate retrieval result and high retrieval efficiency.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a document retrieval system and a document retrieval method.
Background
With the advent of the information age, the number of documents that can be retrieved has grown. How to effectively find useful information in a large number of documents becomes critical.
Information Retrieval (IR) techniques may be used to search a collection of documents for particular information, which may be further subdivided into: searching for information contained in documents, searching for documents themselves, searching for metadata describing documents, searching for text, sound, images, or data in a database (whether a relational standalone database or a hypertext networked database, such as an ethernet or content/document management system).
In performing document retrieval, the document retrieval system has two main tasks: first, finding relevant documents for a user query; second, the matching results are evaluated and the documents are ranked according to their relevance. Many conventional document retrieval systems rely on keyword searching. These systems primarily perform document retrieval taking into account several specific factors, such as the frequency and location of occurrences of queries in documents, hyperlinks to documents, document access information, and so forth.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a document retrieval system and method, which have the advantages of accurate retrieval result and high retrieval efficiency.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a document retrieval system, the system comprising:
an input unit which inputs a query sentence;
a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;
the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;
the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;
and the output unit is used for outputting the second retrieval result to obtain a final result.
Further, the document is processed as follows before:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
further, the first retrieval unit includes:
the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;
the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the second retrieval unit includes:
a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;
a document sorting subunit configured to, based on the hypergraph constructed by the hypergraph construction unit, search in the target document set for a specific query and sort the search result,
further, the hypergraph construction subunit includes:
a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;
a hypergraph construction sub-module configured to construct an initial hypergraph for the document;
a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and
a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.
A document retrieval method, the method performing the steps of:
inputting a query statement;
performing word segmentation on the query sentence to obtain a plurality of words different from each other;
performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;
searching for the second time on the basis of the first searching unit to obtain a second searching result;
and outputting the second retrieval result to obtain a final result.
Further, the method for performing the first retrieval according to the plurality of words divided by the word segmentation unit to obtain the first retrieval result performs the following steps:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the method for obtaining the second retrieval result by the second retrieval on the basis of the first retrieval unit executes the following steps:
constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and
based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:
extracting concepts from the document using the domain ontology information and calculating weights of the concepts;
constructing an initial hypergraph for the document;
refining the initial hypergraph using domain ontology information; and
weights are assigned to nodes and edges in the improved hypergraph.
The document retrieval system and the document retrieval method have the following beneficial effects: through two times of retrieval, the accuracy of document retrieval is improved, and therefore the actual retrieval requirements of users can be better met.
Drawings
FIG. 1 is a system configuration diagram of a document retrieval system according to the present invention.
Detailed Description
The method of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments of the invention.
As shown in fig. 1, a document retrieval system, the system comprising:
an input unit which inputs a query sentence;
a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;
the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;
the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;
and the output unit is used for outputting the second retrieval result to obtain a final result.
Further, the document is processed as follows before:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
further, the first retrieval unit includes:
the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;
the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the second retrieval unit includes:
a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;
a document sorting subunit configured to, based on the hypergraph constructed by the hypergraph construction unit, search in the target document set for a specific query and sort the search result,
further, the hypergraph construction subunit includes:
a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;
a hypergraph construction sub-module configured to construct an initial hypergraph for the document;
a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and
a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.
A document retrieval method, the method performing the steps of:
inputting a query statement;
performing word segmentation on the query sentence to obtain a plurality of words different from each other;
performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;
searching for the second time on the basis of the first searching unit to obtain a second searching result;
and outputting the second retrieval result to obtain a final result.
Further, the method for performing the first retrieval according to the plurality of words divided by the word segmentation unit to obtain the first retrieval result performs the following steps:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the method for obtaining the second retrieval result by the second retrieval on the basis of the first retrieval unit executes the following steps:
constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and
based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:
extracting concepts from the document using the domain ontology information and calculating weights of the concepts;
constructing an initial hypergraph for the document;
refining the initial hypergraph using domain ontology information; and
weights are assigned to nodes and edges in the improved hypergraph.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the foregoing functional sub-units, and in practical applications, the foregoing functional allocation may be completed by different functional sub-units according to needs, that is, sub-units or steps in the embodiment of the present invention are further decomposed or combined, for example, the sub-units in the foregoing embodiment may be combined into one sub-unit, or may be further split into multiple sub-units, so as to complete all or part of the functions described above. The names of the sub-units and the steps involved in the embodiments of the present invention are only for distinguishing the sub-units or the steps, and are not to be construed as unduly limiting the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (8)
1. A document retrieval system, the system comprising:
an input unit which inputs a query sentence;
a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;
the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;
the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;
and the output unit is used for outputting the second retrieval result to obtain a final result.
2. The document retrieval system of claim 1, wherein the document is to be processed as follows before:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited.
3. The document retrieval system according to claim 2, wherein the first retrieval unit includes:
the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;
the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
4. The document retrieval system of claim 3, wherein the second retrieval unit includes:
a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;
and the document sorting subunit is configured to search in the target document set aiming at the specific query and sort the search result based on the hypergraph constructed by the hypergraph construction unit.
5. The document retrieval system of claim 4, wherein the hypergraph construction subunit comprises:
a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;
a hypergraph construction sub-module configured to construct an initial hypergraph for the document;
a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and
a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.
6. A method of document retrieval, the method comprising the steps of:
inputting a query statement;
performing word segmentation on the query sentence to obtain a plurality of words different from each other;
performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;
searching for the second time on the basis of the first searching unit to obtain a second searching result;
and outputting the second retrieval result to obtain a final result.
7. The document retrieval method of claim 6, wherein the method of performing a first retrieval on a plurality of words divided according to the word segmentation unit to obtain a first retrieval result performs the steps of:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
8. The document retrieval method of claim 7, wherein the method of retrieving a second retrieval result on the basis of the first retrieval unit performs the steps of:
constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and
based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:
extracting concepts from the document using the domain ontology information and calculating weights of the concepts;
constructing an initial hypergraph for the document;
refining the initial hypergraph using domain ontology information; and
weights are assigned to nodes and edges in the improved hypergraph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910413341.4A CN111949679A (en) | 2019-05-17 | 2019-05-17 | Document retrieval system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910413341.4A CN111949679A (en) | 2019-05-17 | 2019-05-17 | Document retrieval system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111949679A true CN111949679A (en) | 2020-11-17 |
Family
ID=73336110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910413341.4A Pending CN111949679A (en) | 2019-05-17 | 2019-05-17 | Document retrieval system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111949679A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101777046A (en) * | 2009-01-09 | 2010-07-14 | 佳能株式会社 | Searching method and system |
CN102915304A (en) * | 2011-08-01 | 2013-02-06 | 日电(中国)有限公司 | Document retrieval device and document retrieval method |
CN103678412A (en) * | 2012-09-21 | 2014-03-26 | 北京大学 | Document retrieval method and device |
CN103838833A (en) * | 2014-02-24 | 2014-06-04 | 华中师范大学 | Full-text retrieval system based on semantic analysis of relevant words |
CN107145530A (en) * | 2017-04-18 | 2017-09-08 | 北京明朝万达科技股份有限公司 | A kind of document retrieval method and system based on additional data |
-
2019
- 2019-05-17 CN CN201910413341.4A patent/CN111949679A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101777046A (en) * | 2009-01-09 | 2010-07-14 | 佳能株式会社 | Searching method and system |
CN102915304A (en) * | 2011-08-01 | 2013-02-06 | 日电(中国)有限公司 | Document retrieval device and document retrieval method |
CN103678412A (en) * | 2012-09-21 | 2014-03-26 | 北京大学 | Document retrieval method and device |
CN103838833A (en) * | 2014-02-24 | 2014-06-04 | 华中师范大学 | Full-text retrieval system based on semantic analysis of relevant words |
CN107145530A (en) * | 2017-04-18 | 2017-09-08 | 北京明朝万达科技股份有限公司 | A kind of document retrieval method and system based on additional data |
Non-Patent Citations (1)
Title |
---|
胡川洌等: "基于领域本体的语义查询扩展", 《计算机系统应用》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8341112B2 (en) | Annotation by search | |
US20040249808A1 (en) | Query expansion using query logs | |
US8510312B1 (en) | Automatic metadata identification | |
US20050102251A1 (en) | Method of document searching | |
CN107844493B (en) | File association method and system | |
US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
CN111125086B (en) | Method, device, storage medium and processor for acquiring data resources | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
CN111026710A (en) | Data set retrieval method and system | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
US20040122660A1 (en) | Creating taxonomies and training data in multiple languages | |
KR101472451B1 (en) | System and Method for Managing Digital Contents | |
CN110990003B (en) | API recommendation method based on word embedding technology | |
CN111104437A (en) | Test data unified retrieval method and system based on object model | |
US20120239657A1 (en) | Category classification processing device and method | |
CN102314464B (en) | Lyrics searching method and lyrics searching engine | |
JP4426041B2 (en) | Information retrieval method by category factor | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
Wu et al. | Searching online book documents and analyzing book citations | |
Zhang et al. | Semantic image retrieval using region based inverted file | |
CN115687579B (en) | Document tag generation and matching method, device and computer equipment | |
Aquino et al. | Analysis on the use of Latent Semantic Indexing (LSI) for document classification and retrieval system of PNP files | |
CN114706938A (en) | Document tag determination method and device, electronic equipment and storage medium | |
CN111949679A (en) | Document retrieval system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201117 |
|
WD01 | Invention patent application deemed withdrawn after publication |