CN116186133A

CN116186133A - Electronic document management method integrating forward index and backward index

Info

Publication number: CN116186133A
Application number: CN202211729747.1A
Authority: CN
Inventors: 任岩; 顾爽; 潘月浩; 张露; 徐夏; 陶昊然; 金晨; 蒙森荣
Original assignee: Suzhou Aerospace Information Research Institute
Current assignee: Suzhou Aerospace Information Research Institute
Priority date: 2022-08-29
Filing date: 2022-12-30
Publication date: 2023-05-30
Also published as: CN116127190B; CN116542326A; CN115858168B; CN116127190A; CN115858168A; CN115442242A

Abstract

The invention provides a method for managing electronic documents by fusing a forward index and an inverse index, which comprises the steps of selecting a database with a search engine as the forward index and a database with a search engine as the inverse index, and packaging a unified database API to fuse and connect the two databases; when the electronic document is stored, storing structural data of the electronic document in a forward index database, storing text data of the electronic document in an inverse index database, and correlating the data in the forward index database with the data in the inverse index database through the ID of the electronic document; when searching the document, searching is carried out in the forward index database through the structural information of the document according to different requirements, or the full-text efficient searching of the document is realized in the reverse index database through the keywords. The invention not only satisfies the structured management and storage functions of electronic document management, but also realizes the efficient retrieval function of massive text contents.

Description

Electronic document management method integrating forward index and backward index

Technical Field

The invention relates to the field of computer software, in particular to an electronic document management method integrating forward and reverse indexes.

Background

With the development of information technology, electronic document management systems are gradually being used by more and more enterprises as main management schemes of documents. However, the main functions of the current electronic document management systems are biased towards management, and little attention is paid to efficient retrieval of massive text content. Even though many electronic document management systems have retrieval functionality, it is difficult to efficiently retrieve from a vast array of text, subject to the limitations of the management systems generally employing relational structured databases (which use forward index search engines). And a simple management system taking the inverted index database as a bottom layer can carry out efficient retrieval on massive texts, but is difficult to carry out effective structured management on documents.

Disclosure of Invention

The invention aims to provide an electronic document management method integrating forward indexes and reverse indexes.

The technical solution for realizing the purpose of the invention is as follows: an electronic document management method integrating forward and reverse indexes includes the following steps:

step 1, selecting a database with a search engine as a forward index and a database with a search engine as an inverse index, coding and designing a uniform access interface, supporting uniform access operation on two databases, and realizing fusion and connection on the two databases;

step 2, when the electronic document is stored, storing structural data of the electronic document in a forward index database, storing text data of the electronic document in an backward index database, and correlating the data in the forward index database with the data in the backward index database through the ID of the electronic document;

and 3, searching in a forward index database according to the structural information of the document or realizing the full-text efficient retrieval of the document in an inverted index database through keywords according to different requirements when searching the document.

Further, step 2, when storing the electronic document, storing the structured data of the electronic document in the forward index database, storing the text data of the electronic document in the reverse index database, and associating the data in the forward index database with the data in the reverse index database by the ID of the electronic document, the specific method is as follows:

(1) Before entering data, initializing a search engine into a table structure of a database indexed in a forward direction, wherein the table structure comprises a directory table and an electronic document table, the directory table is a self-association table, and a parent directory attribute of the directory table references a main key of the table; the parent directory attribute of the electronic document table is an external key which references the main key of the directory table;

(4) Determining the category of a document to be stored, including a primary catalog, a secondary catalog and own names, uploading and analyzing the document, obtaining the title and the full text content, and generating a global ID for the document;

(5) Inquiring the ID of the direct father catalog of the document in the catalog, if the ID does not exist, establishing relevant catalog data in the catalog, and inputting the ID, the title and the ID of the father catalog of the document into a forward index database; the ID, title and full text content of the document are segmented and then input into an inverted index database, so that the data in the two databases are correlated through the ID of the electronic document.

Further, in step 3, when searching the document, searching is performed in the forward index database through structural information of the document or the full text efficient search of the document is realized in the reverse index database through keywords according to different requirements, and the specific method is as follows:

(1) If the specific name and the category information of the file are determined, the document is found in a first-level manner according to the category of the file, namely, the document is searched in a forward index database;

(2) If the specific name and the category information of the document are not determined, searching the document through the inverted index database according to a certain keyword in the document;

an electronic document management system integrating forward and reverse indexes is characterized in that electronic document management integrating forward and reverse indexes is realized based on the electronic document management method integrating forward and reverse indexes.

A computer device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein when the processor executes the computer program, the electronic document management method for fusing the forward index and the reverse index is based on the electronic document management method for fusing the forward index and the reverse index.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs electronic document management incorporating forward and reverse indexes based on the electronic document management method incorporating forward and reverse indexes.

Compared with the prior art, the invention has the remarkable advantages that: the method not only meets the structural management and storage functions of electronic document management, but also realizes the efficient retrieval function of massive text contents.

Drawings

Fig. 1 is a forward index schematic.

Fig. 2 is a schematic diagram of an inverted index.

FIG. 3 is a schematic diagram of an electronic document management method incorporating forward and reverse indexes;

FIG. 4 is a flow chart of a method of electronic document management that fuses forward and reverse indexes.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The forward index is found by key, as shown in FIG. 1. When a new document is added, a block space is newly built for the new document, and the new document is connected to the back of the original index file; when deleting the document, directly finding the index information corresponding to the document, and deleting the document. Conventional relational database search engines typically employ forward indexes. The inverted index is to find the key by value, i.e., the document by attribute value, as shown in fig. 2. Documents containing this word can be quickly obtained with the keyword through the inverted index.

Accordingly, the present invention proposes a method for managing electronic documents by fusing a forward index and a backward index, as shown in fig. 3, comprising the following steps:

step 1, a database using a forward index search engine (which can be understood as a conventional relational database) and a database using an inverse index search engine are selected, respectively. And encapsulating the call of the database to realize that the same application system is connected with two types of databases.

And 2, dividing the data to be input when the file is stored, and inputting different types of data into a corresponding database according to the need. The method comprises the following steps:

(1) Before entering data, it is first confirmed whether the search engine has been initialized to the table structure of the forward indexed database, i.e., the relational database. Mainly comprises a catalog table and an electronic document table. If not, a table needs to be established first. Wherein the directory table is a self-associated table, and its parent directory attribute references the primary key of the table; the parent directory attribute of the electronic document table is an foreign key that references the primary key of the directory table.

(6) The category of the document to be stored is determined, including a primary directory, a secondary directory, a self name, and the like. Uploading and analyzing the file to obtain the title and the full text content. A global ID is generated for the document.

(7) And inquiring the ID of the direct parent directory of the document in the directory table, and if the ID does not exist, establishing relevant directory data in the directory table. Entering the ID, title and parent directory ID of the document into a forward index database; the ID, title and full text content of the document are segmented and then input into an inverted index database. Thus, the data in the two types of databases can be correlated by the ID of the electronic document.

Step 3, when searching the document, searching in the forward index database according to different requirements through the structural information of the document; the efficient full-text retrieval of the document can be realized in the inverted index database through the keywords in the text. Comprising the following aspects:

(1) If the summary information of the file is known, the document can be conveniently found by one level by its category, which is found in the forward index database.

(2) If the specific name and category information of the document are not known, the document can be quickly searched out by a database based on the inverted index according to a certain keyword in the document.

(3) A more typical application of inverted index retrieval is when you want to find all documents in the system that contain a certain keyword, enter that keyword, you can quickly retrieve all relevant documents in a huge amount of text.

Examples

To verify the effectiveness of the inventive protocol, the following experiments were performed.

Step 1, firstly selecting a relational database with a search engine as a forward index, and then selecting a database with a bottom layer as an inverse index. And writing a program to access the two databases simultaneously, and realizing corresponding database storage and retrieval methods.

Step 2, a technical document Java technical specification needs to be input into an electronic document management system, and the specific content of the electronic document management system comprises 10 tens of thousands of words, wherein the following words are included: any magic value is not allowed to appear directly in the code. The method comprises the following steps:

(1) Firstly, confirming whether a table is built in a relational database, if not, firstly, building a directory table which is a self-association table, wherein the parent directory attribute of the table references the main key of the table; and then establishing an electronic document table, wherein the father directory attribute in the electronic document table is a foreign key which references the main key of the directory table.

(2) Determining the classification of the computer technology class, wherein the primary catalog is an industrial technology class, and the secondary catalog is a computer technology class; determining its name as "Java technical Specification"; uploading a file and obtaining 10-thousand words of full text content after file analysis; a globally unique ID is generated for the document using a snowflake algorithm.

(3) If the ID of the computer technology class in the directory table does not exist, the entry table is newly built in the directory table, the father directory of the directory is the industrial technology class, and if the industrial technology class directory does not exist, the industrial technology class directory needs to be newly built first. After obtaining the ID of the computer technology class directory, storing the ID of the Java technical specification, the title and the ID of a parent directory (computer technology class directory) of the Java technical specification into a forward index database; the ID, title and full text content of Java technical specifications are stored in an inverted index database. Thus, after the ID of Java technical specification is obtained from any type of database, the same document can be queried from another type of database.

And 3, when the electronic document of Java technical Specification needs to be found, different retrieval methods can be used for different scenes. Comprising the following aspects:

(1) If its specific category is clear, the document summary information can be found directly by "industrial technology class-computer technology class-Java technical specification" where it is retrieved by a relational database underlying the forward index search engine. The document summary information is found, so that the ID of the document summary information is obtained, and the full-text content of the document is obtained through the ID reverse index database.

(2) If the specific category of the electronic document is not clear, the related document can be obtained by searching Java or technical specification or magic value, then the document is found, and the document is searched by a database with the bottom layer being an inverted index, so that the searching of the title and the whole text is very time-consuming by a relational database under the condition of massive texts.

(3) If a document is not specifically found, but all documents related to a certain field are found, full text retrieval can be directly performed, for example, a code is input to find related documents in a programming field, and the code is contained in Java technical Specification, so that the document can be retrieved, and other documents containing the code in a title or content can be retrieved. This is retrieved in an inverted index database, which is very efficient in retrieving large amounts of text.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. An electronic document management method integrating forward and reverse indexes is characterized by comprising the following steps:

step 1, selecting a database with a search engine as a forward index and a database with a search engine as an inverse index, and packaging a unified database API to fuse and connect the two databases;

2. The method for managing electronic documents by fusing forward and reverse indexes as claimed in claim 1, wherein in the step 2, when storing the electronic documents, the structured data of the electronic documents are stored in the forward index database, the text data of the electronic documents are stored in the reverse index database, and the data in the forward index database and the data in the reverse index database are related to each other by the ID of the electronic documents, specifically comprising:

(2) Determining the category of a document to be stored, including a primary catalog, a secondary catalog and own names, uploading and analyzing the document, obtaining the title and the full text content, and generating a global ID for the document;

(3) Inquiring the ID of the direct father catalog of the document in the catalog, if the ID does not exist, establishing relevant catalog data in the catalog, and inputting the ID, the title and the ID of the father catalog of the document into a forward index database; the ID, title and full text content of the document are segmented and then input into an inverted index database, so that the data in the two databases are correlated through the ID of the electronic document.

3. The method for managing electronic documents by fusing forward and backward indexes as claimed in claim 1, wherein step 3, when searching documents, searching is performed in a forward index database through structural information of the documents or the full text efficient search of the documents is realized in the backward index database through keywords according to different requirements, and the specific method comprises the following steps:

(2) If the specific name and category information of the document are not determined, the document is retrieved through the inverted index database according to a certain keyword within the document.

4. An electronic document management system integrating forward and reverse indexes, wherein the electronic document management system integrating forward and reverse indexes is realized based on the electronic document management method integrating forward and reverse indexes as claimed in any one of claims 1 to 3.

5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing electronic document management incorporating forward and reverse indexes based on the electronic document management method incorporating forward and reverse indexes of any one of claims 1 to 3 when the computer program is executed by the processor.

6. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements electronic document management incorporating forward and reverse indexes based on the electronic document management method incorporating forward and reverse indexes as set forth in any one of claims 1 to 3.