CN111949679A - Document retrieval system and method - Google Patents

Document retrieval system and method Download PDF

Info

Publication number
CN111949679A
CN111949679A CN201910413341.4A CN201910413341A CN111949679A CN 111949679 A CN111949679 A CN 111949679A CN 201910413341 A CN201910413341 A CN 201910413341A CN 111949679 A CN111949679 A CN 111949679A
Authority
CN
China
Prior art keywords
document
retrieval
hypergraph
additional data
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910413341.4A
Other languages
Chinese (zh)
Inventor
万江
王小乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Geji Network Technology Co ltd
Original Assignee
Shanghai Geji Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Geji Network Technology Co ltd filed Critical Shanghai Geji Network Technology Co ltd
Priority to CN201910413341.4A priority Critical patent/CN111949679A/en
Publication of CN111949679A publication Critical patent/CN111949679A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of computers, and particularly relates to a document retrieval system and a document retrieval method. An input unit which inputs a query sentence; a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other; the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result; the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result; and the output unit is used for outputting the second retrieval result to obtain a final result. The method has the advantages of accurate retrieval result and high retrieval efficiency.

Description

Document retrieval system and method
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a document retrieval system and a document retrieval method.
Background
With the advent of the information age, the number of documents that can be retrieved has grown. How to effectively find useful information in a large number of documents becomes critical.
Information Retrieval (IR) techniques may be used to search a collection of documents for particular information, which may be further subdivided into: searching for information contained in documents, searching for documents themselves, searching for metadata describing documents, searching for text, sound, images, or data in a database (whether a relational standalone database or a hypertext networked database, such as an ethernet or content/document management system).
In performing document retrieval, the document retrieval system has two main tasks: first, finding relevant documents for a user query; second, the matching results are evaluated and the documents are ranked according to their relevance. Many conventional document retrieval systems rely on keyword searching. These systems primarily perform document retrieval taking into account several specific factors, such as the frequency and location of occurrences of queries in documents, hyperlinks to documents, document access information, and so forth.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a document retrieval system and method, which have the advantages of accurate retrieval result and high retrieval efficiency.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a document retrieval system, the system comprising:
an input unit which inputs a query sentence;
a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;
the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;
the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;
and the output unit is used for outputting the second retrieval result to obtain a final result.
Further, the document is processed as follows before:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
further, the first retrieval unit includes:
the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;
the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the second retrieval unit includes:
a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;
a document sorting subunit configured to, based on the hypergraph constructed by the hypergraph construction unit, search in the target document set for a specific query and sort the search result,
further, the hypergraph construction subunit includes:
a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;
a hypergraph construction sub-module configured to construct an initial hypergraph for the document;
a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and
a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.
A document retrieval method, the method performing the steps of:
inputting a query statement;
performing word segmentation on the query sentence to obtain a plurality of words different from each other;
performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;
searching for the second time on the basis of the first searching unit to obtain a second searching result;
and outputting the second retrieval result to obtain a final result.
Further, the method for performing the first retrieval according to the plurality of words divided by the word segmentation unit to obtain the first retrieval result performs the following steps:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the method for obtaining the second retrieval result by the second retrieval on the basis of the first retrieval unit executes the following steps:
constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and
based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:
extracting concepts from the document using the domain ontology information and calculating weights of the concepts;
constructing an initial hypergraph for the document;
refining the initial hypergraph using domain ontology information; and
weights are assigned to nodes and edges in the improved hypergraph.
The document retrieval system and the document retrieval method have the following beneficial effects: through two times of retrieval, the accuracy of document retrieval is improved, and therefore the actual retrieval requirements of users can be better met.
Drawings
FIG. 1 is a system configuration diagram of a document retrieval system according to the present invention.
Detailed Description
The method of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments of the invention.
As shown in fig. 1, a document retrieval system, the system comprising:
an input unit which inputs a query sentence;
a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;
the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;
the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;
and the output unit is used for outputting the second retrieval result to obtain a final result.
Further, the document is processed as follows before:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
further, the first retrieval unit includes:
the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;
the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the second retrieval unit includes:
a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;
a document sorting subunit configured to, based on the hypergraph constructed by the hypergraph construction unit, search in the target document set for a specific query and sort the search result,
further, the hypergraph construction subunit includes:
a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;
a hypergraph construction sub-module configured to construct an initial hypergraph for the document;
a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and
a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.
A document retrieval method, the method performing the steps of:
inputting a query statement;
performing word segmentation on the query sentence to obtain a plurality of words different from each other;
performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;
searching for the second time on the basis of the first searching unit to obtain a second searching result;
and outputting the second retrieval result to obtain a final result.
Further, the method for performing the first retrieval according to the plurality of words divided by the word segmentation unit to obtain the first retrieval result performs the following steps:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
Further, the method for obtaining the second retrieval result by the second retrieval on the basis of the first retrieval unit executes the following steps:
constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and
based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:
extracting concepts from the document using the domain ontology information and calculating weights of the concepts;
constructing an initial hypergraph for the document;
refining the initial hypergraph using domain ontology information; and
weights are assigned to nodes and edges in the improved hypergraph.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the foregoing functional sub-units, and in practical applications, the foregoing functional allocation may be completed by different functional sub-units according to needs, that is, sub-units or steps in the embodiment of the present invention are further decomposed or combined, for example, the sub-units in the foregoing embodiment may be combined into one sub-unit, or may be further split into multiple sub-units, so as to complete all or part of the functions described above. The names of the sub-units and the steps involved in the embodiments of the present invention are only for distinguishing the sub-units or the steps, and are not to be construed as unduly limiting the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (8)

1. A document retrieval system, the system comprising:
an input unit which inputs a query sentence;
a word segmentation unit which performs word segmentation on the query sentence to obtain a plurality of words different from each other;
the first retrieval unit is used for carrying out first retrieval according to the plurality of words divided by the word segmentation unit to obtain a first retrieval result;
the second retrieval unit is used for carrying out second retrieval on the basis of the first retrieval unit to obtain a second retrieval result;
and the output unit is used for outputting the second retrieval result to obtain a final result.
2. The document retrieval system of claim 1, wherein the document is to be processed as follows before:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited.
3. The document retrieval system according to claim 2, wherein the first retrieval unit includes:
the key information extraction subunit is used for automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words;
the document information retrieval subunit judges whether the document has additional data matched with the retrieval word, and if so, performs content-based retrieval on the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
4. The document retrieval system of claim 3, wherein the second retrieval unit includes:
a hypergraph construction subunit configured to construct a hypergraph for a document in a target document set to describe implicit semantic information contained in the document;
and the document sorting subunit is configured to search in the target document set aiming at the specific query and sort the search result based on the hypergraph constructed by the hypergraph construction unit.
5. The document retrieval system of claim 4, wherein the hypergraph construction subunit comprises:
a concept extraction submodule configured to extract concepts from the document using the domain ontology information and to calculate weights of the concepts;
a hypergraph construction sub-module configured to construct an initial hypergraph for the document;
a hypergraph improvement submodule configured to improve the initial hypergraph using the domain ontology information; and
a weight assignment submodule configured to assign weights to nodes and edges in the improved hypergraph.
6. A method of document retrieval, the method comprising the steps of:
inputting a query statement;
performing word segmentation on the query sentence to obtain a plurality of words different from each other;
performing first retrieval according to a plurality of words divided by the word segmentation unit to obtain a first retrieval result;
searching for the second time on the basis of the first searching unit to obtain a second searching result;
and outputting the second retrieval result to obtain a final result.
7. The document retrieval method of claim 6, wherein the method of performing a first retrieval on a plurality of words divided according to the word segmentation unit to obtain a first retrieval result performs the steps of:
the user inputs key information based on the understanding of the document content, or automatically scans the document and extracts the key information;
storing the key information into the document in an additional data form;
when the document is opened/edited, skipping over the additional data, and reading and writing from the initial position of the real document data; when the document is saved, the additional data still exists and can be edited;
automatically scanning the additional data of the document according to a plurality of words divided by the word segmentation unit as search words; judging whether the document has additional data matched with the search terms or not, and if so, searching based on content aiming at the additional data; if no additional data is present, it is retrieved or skipped in a binary manner.
8. The document retrieval method of claim 7, wherein the method of retrieving a second retrieval result on the basis of the first retrieval unit performs the steps of:
constructing a hypergraph aiming at the documents in the target document set so as to describe the implicit semantic information contained in the documents; and
based on the constructed hypergraph, searching in the target document set aiming at the specific query, and sequencing the search result, wherein the step of constructing the hypergraph comprises the following steps:
extracting concepts from the document using the domain ontology information and calculating weights of the concepts;
constructing an initial hypergraph for the document;
refining the initial hypergraph using domain ontology information; and
weights are assigned to nodes and edges in the improved hypergraph.
CN201910413341.4A 2019-05-17 2019-05-17 Document retrieval system and method Pending CN111949679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910413341.4A CN111949679A (en) 2019-05-17 2019-05-17 Document retrieval system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910413341.4A CN111949679A (en) 2019-05-17 2019-05-17 Document retrieval system and method

Publications (1)

Publication Number Publication Date
CN111949679A true CN111949679A (en) 2020-11-17

Family

ID=73336110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910413341.4A Pending CN111949679A (en) 2019-05-17 2019-05-17 Document retrieval system and method

Country Status (1)

Country Link
CN (1) CN111949679A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777046A (en) * 2009-01-09 2010-07-14 佳能株式会社 Searching method and system
CN102915304A (en) * 2011-08-01 2013-02-06 日电(中国)有限公司 Document retrieval device and document retrieval method
CN103678412A (en) * 2012-09-21 2014-03-26 北京大学 Document retrieval method and device
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
CN107145530A (en) * 2017-04-18 2017-09-08 北京明朝万达科技股份有限公司 A kind of document retrieval method and system based on additional data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777046A (en) * 2009-01-09 2010-07-14 佳能株式会社 Searching method and system
CN102915304A (en) * 2011-08-01 2013-02-06 日电(中国)有限公司 Document retrieval device and document retrieval method
CN103678412A (en) * 2012-09-21 2014-03-26 北京大学 Document retrieval method and device
CN103838833A (en) * 2014-02-24 2014-06-04 华中师范大学 Full-text retrieval system based on semantic analysis of relevant words
CN107145530A (en) * 2017-04-18 2017-09-08 北京明朝万达科技股份有限公司 A kind of document retrieval method and system based on additional data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡川洌等: "基于领域本体的语义查询扩展", 《计算机系统应用》 *

Similar Documents

Publication Publication Date Title
US8341112B2 (en) Annotation by search
US20040249808A1 (en) Query expansion using query logs
US8510312B1 (en) Automatic metadata identification
US20050102251A1 (en) Method of document searching
CN107844493B (en) File association method and system
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN111125086B (en) Method, device, storage medium and processor for acquiring data resources
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN111026710A (en) Data set retrieval method and system
CN106844482B (en) Search engine-based retrieval information matching method and device
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
US20040122660A1 (en) Creating taxonomies and training data in multiple languages
KR101472451B1 (en) System and Method for Managing Digital Contents
CN110990003B (en) API recommendation method based on word embedding technology
CN111104437A (en) Test data unified retrieval method and system based on object model
US20120239657A1 (en) Category classification processing device and method
CN102314464B (en) Lyrics searching method and lyrics searching engine
JP4426041B2 (en) Information retrieval method by category factor
CN108345694B (en) Document retrieval method and system based on theme database
Wu et al. Searching online book documents and analyzing book citations
Zhang et al. Semantic image retrieval using region based inverted file
CN115687579B (en) Document tag generation and matching method, device and computer equipment
Aquino et al. Analysis on the use of Latent Semantic Indexing (LSI) for document classification and retrieval system of PNP files
CN114706938A (en) Document tag determination method and device, electronic equipment and storage medium
CN111949679A (en) Document retrieval system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201117

WD01 Invention patent application deemed withdrawn after publication