CN101853288A - Configurable full-text retrieval service system based on document real-time monitoring - Google Patents

Configurable full-text retrieval service system based on document real-time monitoring Download PDF

Info

Publication number
CN101853288A
CN101853288A CN 201010181321 CN201010181321A CN101853288A CN 101853288 A CN101853288 A CN 101853288A CN 201010181321 CN201010181321 CN 201010181321 CN 201010181321 A CN201010181321 A CN 201010181321A CN 101853288 A CN101853288 A CN 101853288A
Authority
CN
China
Prior art keywords
index
module
document
retrieval
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010181321
Other languages
Chinese (zh)
Inventor
马晓普
张振莲
梁晶晶
李争艳
刘妍
汤澹
董勐
Original Assignee
马晓普
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 马晓普 filed Critical 马晓普
Priority to CN 201010181321 priority Critical patent/CN101853288A/en
Publication of CN101853288A publication Critical patent/CN101853288A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of computer retrieval technology and provides a configurable full-text retrieval service system based on document real-time monitoring. The system is mainly characterized by comprising a document real-time monitoring module, an index structuring module, an index optimization module, an index alternation module, a query retrieval module, a log record module, a data backup module and a data recovery module, and because the adoption of a framework structure, the system has the advantages of support of rapid application development, easy deployment and maintenance, high recall ratio, convenient use and the like.

Description

Configurable full-text retrieval service system based on document real-time monitoring
Technical Field
The invention belongs to the technical field of computer retrieval, and relates to a configurable full-text retrieval service system based on document real-time monitoring, which is mainly designed and realized from a frame angle and can perform real-time monitoring and full-text retrieval service on documents.
Background
Along with the rapid development of network technology and the increasing maturity of information technology, the convenience and rapidness brought by the network are more prominent, more and more enterprises and public institutions and individuals publish and acquire information through the network, and people acquire the information in vast network knowledge sea mainly through a search engine. The existing search engines mainly comprise Google, Baidu, Yahoo, Zhongcao, Sogou and the like, and as the expansion speed of network information is exponentially and rapidly increased, various websites and Web services need to be added with a search function to meet the requirements of users.
The indexing service provided by the full-text retrieval system is based on a powerful indexing mechanism, and the creation and management of the index are crucial to a search engine. The development process of the indexing process control part in the traditional full-text retrieval function comprises the following steps:
(1) defining and describing various attributes of the document data and related parameters in the indexing process, and selecting a proper word segmentation device.
(2) Adding data information to the index, analyzing the file, extracting index items, establishing an index table, optimizing and merging the index, and finally refreshing the index file from the memory to the disk.
(3) And aiming at CRUD operation of the document information, a corresponding index updating mechanism is provided, and index backup is also considered.
(4) Corresponding strategy codes are provided for index construction and management of a full-text retrieval system in a distributed environment and problems of synchronization, mutual exclusion and the like of index access and update in a multi-thread environment.
The development of the part is relatively independent from the business flow of the whole system, but the existing Web service system with the full-text retrieval function or the document information management software are repeatedly developed, and the control logic of the indexing process cannot be summarized and abstracted from the frame perspective. The development of the control part of the indexing process in the traditional full-text retrieval function is repeatedly developed, the control logic of the indexing process cannot be summarized and abstracted from a framework perspective, excessive parameter setting and different application environments cause that a developer spends much time on debugging, maintaining and upgrading the system, and the development method has a plurality of defects compared with the RAD (Rapid application development) idea which is promoted at present.
Disclosure of Invention
The invention aims to provide a configurable full-text retrieval service system based on document real-time monitoring, which can dynamically record the change of documents and automatically perform the index management of the documents.
The invention discloses a configurable full-text retrieval service system based on document real-time monitoring, which adopts the technical scheme that the system comprises: the system comprises a file real-time monitoring module, an index building module, an index optimizing module, an index changing module, an inquiry and retrieval module, a log recording module, a data backup module and a data recovery module. Wherein,
the file real-time monitoring module is used for monitoring the change of the file information of the server side in real time and starting the log recording module and the index changing module;
the index construction module is used for analyzing and processing unstructured documents under the specified directory, generating intermediate format files, extracting index items from the files through a word segmentation device, representing the documents, generating an index table of a document set, starting the index change module, and storing the generated indexes into an index library;
the index optimization module is used for improving the efficiency of the system during full-text indexing by combining and optimizing indexes, and acts on the indexes generated by the index construction module;
the index changing module is used for receiving changed document information, including addition, deletion and modification of documents, so as to dynamically update the index, and the index changing module acts on the index building module;
the query retrieval module is used for providing a search interface for full-text retrieval for a user, receiving query keywords from the user, segmenting the query keywords, submitting the query keywords to an index library for retrieval, and returning a result set;
the log recording module is used for recording CRUD operation of a manager on document data in a server directory and is started by the document real-time monitoring module;
the data backup module is used for backing up data information and generating backup data so as to prevent the data from being damaged or lost due to the fault of a system or malicious attack, ensure that the data information is recovered to a certain known correct state from an error state, and is started after the operation of the log recording module is finished;
and the data recovery module is used for recovering the data to a certain consistency state before the fault by using the data backup whole and the log file when the system fails in the operation process.
The file real-time monitoring module comprises:
the system comprises an initialization system engine module, an index change module and a data backup module, wherein the initialization system engine module is used for recursively traversing a file directory during system initialization, waking up the data backup module to copy data, and then triggering execution of initialization construction and index optimization of an index by the index change module;
the captured document adding module is used for monitoring the operation of adding the document by an administrator in real time;
the captured document deleting module is used for monitoring the operation of deleting the expired document by the administrator in real time;
and the captured document updating module is used for monitoring the operation of updating the document by the administrator in real time.
The index building module comprises:
a text parser for analyzing the contents of the web page and the document, unifying into a plain document or an intermediate document,
and the indexer is used for reprocessing the results of processing and analyzing the unstructured documents by the text parser, sequentially reading and analyzing the index items, establishing a linked list arranged according to the index items by using a preset index item dictionary, dynamically changing the index dictionary, and finally completing the index list, the index dictionary and the document index organized according to the index items.
The index optimization module comprises:
the offline index optimization module is used for acting on the index construction module offline when the document is updated, and calling corresponding optimization function adjustment and optimizing the offline index;
and the offline index merging module is used for merging a series of small index files into one index file so as to improve the retrieval efficiency, and the offline index merging module and the offline index optimization module jointly reconstruct and optimize the offline index to generate a final optimized index.
And the index service switching module is used for switching the offline index and the current service index, so that the normal optimized operation of the offline index is ensured, and the retrieval efficiency of the service index is not influenced.
The index change module includes:
the document adding processing module is used for receiving the added document information so as to dynamically update the index;
the document deleting processing module is used for receiving the deleted document information so as to dynamically update the index;
and the document modification processing module is used for receiving the modified document information so as to dynamically update the index.
The query retrieval module comprises:
the keyword submitting module is used for providing a full-text search keyword submitting page for a user, providing an interface for a full-text search system to acquire a query request submitted by the user, and performing word segmentation processing on the received keywords according to a certain word segmentation strategy;
the background retrieval module is used for submitting the received keywords and the user group information to an index database for retrieval, and then acquiring and sequencing retrieval results from the index database;
and the snapshot generating module is used for generating a result snapshot according to the returned result set and is responsible for generating a result page and displaying the result to the user.
The configurable full-text retrieval service system designed and developed based on document real-time monitoring basically covers the development requirements of the common full-text retrieval system due to the fact that various parameters, service forms and corresponding deployment schemes required in the indexing process and data backup are provided, provides a set of efficient and complete full-text retrieval service framework integrating real-time monitoring and data recovery functions for developers, and has the advantages of supporting rapid application development, being easy to deploy and maintain, high in recall rate, convenient to use and the like.
Drawings
FIG. 1 is a block diagram showing the system architecture of the present invention;
FIG. 2 is a block diagram of a real-time document monitoring module according to the present invention;
FIG. 3 is a block diagram illustrating the structure of an index building block according to the present invention;
FIG. 4 is a block diagram illustrating the structure of an index optimization module according to the present invention;
FIG. 5 is a block diagram illustrating the structure of an index change module according to the present invention;
FIG. 6 is a block diagram illustrating the structure of a query retrieval module according to the present invention;
FIG. 7 is a block diagram illustrating an indexer context environment of the present invention;
FIG. 8 is a block diagram illustrating an index hierarchy in the system of the present invention;
FIG. 9 is a basic flow diagram illustrating index initialization construction in the system of the present invention;
FIG. 10 is a basic flow diagram of index management in the system of the present invention;
FIG. 11 is a block diagram showing the basic structure of a multi-searcher cross-index search in the system of the present invention;
FIG. 12 is a basic flow diagram of a search query in the system of the present invention;
fig. 13 is a schematic diagram illustrating a principle of data information backup in the system of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples.
Referring to fig. 1, fig. 1 is a block diagram illustrating a configurable full-text retrieval service system based on document real-time monitoring according to the present invention, which includes a document real-time monitoring module 100, an index construction module 200, an index optimization module 300, an index change module 400, a query retrieval module 500, a logging module 600, a data backup module 700, and a data recovery module 800. When the system is initialized, an administrator needs to configure system attributes and actual application requirement parameters, such as configuration work of an index configuration file and a data backup configuration file; after the system configuration is finished, original document data of the system needs to be analyzed, an index database is constructed, and the original document data are optimized and combined into a total index; in use, after the change of document information (CRUD operation of the administrator on the document information) is captured by the file real-time monitoring module 100, the log recording module 600 is started to record the operation of the administrator, and then the operation is transferred to the data backup module 700 to copy the changed document information to update backup data, after the data backup is completed, the index change module 400 is awakened, corresponding strategy flows are executed according to different operation types, and then the index is updated and reconstructed by the index construction module 200, meanwhile, the index optimization module 300 optimizes the new index through optimization functions, index merging and other means, finally the index database is imported, a user inputs keywords through a foreground query interface, the keywords are submitted to the query retrieval module 500, a result set is obtained from the index database, a Web page is rendered and fed back to the user, when the system fails or is attacked maliciously to cause the original data to be damaged or lost, the administrator may initiate the data recovery module 800 to load log records and backup data to recover the original data.
The file real-time monitoring module 100 is a core module of the whole system, is used for monitoring changes of document information of a server side in real time, belongs to background services, captures any document change, and starts corresponding task modules to process aiming at different operations. The modules scheduled by the file real-time monitoring module 100 can be divided into two types, namely an indexing process and a data backup process according to performance. The indexing process comprises an index building module 200, an index optimizing module 300 and a cable changing module 400, and the data backup process comprises a logging module 600, a data backup module 700 and a data recovery module 800. Referring to fig. 2, the file real-time monitoring module 100 includes an initialization system engine module 210, a captured document adding module 220, a captured document deleting module 230, and a captured document updating module 240. The file real-time monitoring module 100 provides a whole set of real-time monitoring mechanisms to complete index initialization construction and trigger corresponding policy flows for the captured operations. When the system is initialized, the file real-time monitoring module 100 obtains the file directory of the document information, starts the initialization system engine module 210, recursively traverses the directory, sequentially reads all documents, simultaneously wakes up the data backup module 700 to copy data, and then is triggered by the index change module 400 to perform the initialization construction and index optimization of the index. The initialization system engine module 210 does not interact with the logging module 600 since no log records need to be created. In the later system use and maintenance process, all CRUD operations of an administrator are captured by the file real-time monitoring module 100, divided into three types of uploading and adding documents, deleting expired documents and updating documents, and respectively delivered to the captured document adding module 220, the captured document deleting module 230 and the captured document updating module 240 for completion, and before executing processing, specific information of the operations is written into logs, and changed document information is backed up for completion, so that data recovery is facilitated.
The index building module is a core module of the real-time full-text retrieval service system and is used for analyzing and processing unstructured documents under a specified directory, generating intermediate format files, extracting index items from the intermediate format files through a word splitter and representing the documents, and generating an index table of a document set. Referring to fig. 3, the index construction module 200 includes a text parser 310 and an indexer 320, which receives document change information described by the index change module 400 to complete construction of an index, where the context of the indexer is as shown in fig. 7, an original text library 710 refers to original copies of web pages obtained from the web by a web spider and locally stored texts in various formats, and the original text library 710 retains the complete original format of the texts and mainly provides a data source for subsequent parsers; the text parser 310 is used for analyzing the content of the web page and the document, and unifying the content into a plain document or an intermediate document, and the text parser 310 is often a series of analysis and processing programs with different document formats, and usually further includes Chinese word segmentation, spam filtering, and providing unified analyzable data for the indexer; the text intermediate format library 720 is a database of plain text or unified intermediate format obtained after processing by the text parser 310, has removed noise and other spam, completes the processing procedures of word segmentation and the like as required, and can provide input data for the indexer 320; the text index database 730 organizes the stored data files in an index form, and in a full-text retrieval system, an inverted index mode is usually adopted; the index dictionary 740 is a database of binary structures of words and codes, each word has a unique abbreviation correspondence, and conversion of index items is completed in the process of establishing indexes so as to reduce the usage amount of a disk and a memory.
The document contents of the full-text retrieval usually include attached information of the document, and content information of the document. Documents are generally converted into an intermediate document format by a preprocessing program and a participler program before being indexed, and then the intermediate document format is sent to the indexer 320 to generate the index dictionary 740 and the text index database 730 to be written into the index database. The indexer 320 is a core component of the system, and has the main functions of reprocessing the results of processing and analyzing the unstructured documents by the text parser 310, sequentially reading and analyzing the indexing items, establishing a linked list arranged according to the indexing items by using a preset indexing item dictionary, dynamically changing the indexing item dictionary 740, and finally completing an index list, an index dictionary and a document index organized according to the indexing items, which are used for representing documents and generating an index table of a document library. The index change module 400 preprocesses the operation by means of the index locks write. The text parser 310 and the indexer 320 work cooperatively, so that the establishment, maintenance and management of indexes are completed uniformly, and the text index library stores corresponding document indexes uniformly.
FIG. 8 illustrates an indexing hierarchy of the present invention. The system index uses a set of very efficient data structures, usually stored in a system directory in the form of a single or a series of index files, which may be stored in a hard disk or in memory. The index structure is stored in a file form, does not depend on a database or a specific platform, supports block indexing, establishes a new index for a newly added file, shortens the effective time of the index, and then establishes an integral index through index merging. The concrete description is as follows:
indexing: the index structure is finally embodied in a disk file with a specific format for storage, indexes are stored in a memory and a disk, the same logic structure is used, and the structure of each index is composed of one or more index segments. The disk file comprises a current active index section and a newly-built index file, and the sections can be combined into a uniform index section through tool arrangement.
Index segment: typically an index will contain one or more index segments. During each creation process, documents are added to a particular segment, and then the index segments are merged according to the parameters. The index segment is equivalent to a sub-index, the newly created index usually appears in a new segment, and each index system usually only contains one index segment after the merge operation.
Indexing the document: the index file is an object to which the indexer 320 can directly add. Each index may contain a number of different documents, each of which in turn manages an unequal number of domain collections, where a document is a logical concept. The documents are finally added to the index and stored in the corresponding index file, ready for retrieval. Any file that actually wishes to be added to the index must generate the index to be used.
Index field: the index field is the basic unit of composition of the index document object. Each field stores the actual index text data that internally invokes the parser's index entry results. The retrieval query of the data in the domain is finally based on the index item, and the data cannot be retrieved in a unit smaller than the index item. Generally, the english index is a word as a search unit, and the chinese index is a result of chinese word segmentation as a search unit.
The index item: the index entry is the smallest unit of index management, which is the automatic partitioning of a field's value in the background using text parser 310. Each resulting independent element serves as an index entry for building the index.
The system can be divided into a composite index format and a multi-file index format according to different index generation forms. Composite indexes are generally suitable in static indexes, while multi-file indexes are more convenient in dynamic indexes.
FIG. 9 shows a basic flow diagram of index initialization construction according to the present invention. The index library does not exist in the full-text retrieval function for the first time, so that the system needs to load the existing document information to construct an initialized index library in the initialization process.
Firstly, generating a FileReader object S901 to perform recursive traversal on existing documents in a directory, sequentially reading all documents, preparing an index directory S902, loading parameters set by configuration files, creating a standard text analyzer S903, creating an empty document object S904, analyzing document contents, extracting index items from the document contents, representing the documents, generating an index table and a file name field S905 of a document set, and adding the file name field to the documents S906; secondly, generating a file content field S907, thereby adding a file name field to the document S908, then adding a new index document S909, judging whether the addition of the index content is finished S911, if not, turning to creating an empty document object S904, and circularly executing until the document traversal is finished; finally, the index optimization module 300 is invoked to complete the related index optimization, and the index is closed S910 and is flushed from the memory into the index library of the disk.
Referring to fig. 4, the index optimization module 300 includes an offline index optimization module 410, an offline index merging module 420, and an index service switching module 430. In the index establishing process, how to increase the index establishing speed, how to reduce the resource occupation of the index, how to reasonably allocate limited memory resources in the index use, and how to increase the access speed of the resources need to be considered, and all the problems related to the index performance are handled by the index optimizing module 300;
according to different index generation forms, the index generation method can be divided into a composite index format and a multi-file index format. The process of establishing the index can process mass data, the generated index segment and index file are very huge, and different index forms need to be selected according to actual application requirements in the concrete implementation. Composite indexes are generally suitable in static indexes, while multi-file indexes are more convenient in dynamic indexes. The system supports the selection of formats of various index files so as to be suitable for different application requirements. In order to ensure that the index updating does not affect the response efficiency of the retrieval service, the system provides an index mechanism combining offline index and service index. And the updated index is processed off line, and the index service is directly switched after the update is finished, so that the retrieval efficiency is ensured.
When the document is updated, the offline index optimization module 410 acts on the index construction module 200 offline, and invokes a corresponding optimization function to adjust and optimize the offline index, and meanwhile, the offline index merging module 420 merges a series of small index files into one index file, so as to improve the retrieval efficiency. The offline indexes are transformed and optimized together to generate a final optimized index, and on the premise that the offline indexes and the final optimized index are both completed, the index service switching module 430 is awakened to switch the offline index and the current service index, so that the normal optimized operation of the offline indexes is ensured, and the retrieval efficiency of the service indexes is not influenced.
Referring to fig. 5, the index change module 400 includes a document addition processing module 510, a document deletion processing module 520, and a document modification processing module 530. Since all three modules act on the index building module 200, synchronization problems are inevitably encountered with dynamic indexing and incremental indexing in a multi-threaded environment. The system provides a series of mechanisms for full-text retrieval and concurrent access control to ensure that the index file cannot be operated by two objects at the same time, and the consistency and integrity of index synchronization and concurrency are ensured. The system provides two kinds of index locks, write. Indexer 320 checks to see if the file exists before performing the operation and if so, the subsequent operation needs to wait for the previous operation to complete. Lock is set to avoid several threads to modify an index document at the same time, and is used when the index is built, documents are added and documents are deleted; lock locks are mainly used when index segments are created, merged or read, and are automatically deleted when index or segment merging is completed.
FIG. 10 shows a basic flow diagram of index management according to the present invention. The index of the system adopts the document as a logic unit, the management of the index also corresponds to the management of the document, and the document management function is mainly embodied in various main operations of the index, including the addition of the document, the deletion of the document and the modification and the update of the document.
The creation, deletion, and updating of the index are all accomplished by indexer 320. When the document information is updated, the indexer 320 changes the index accordingly, and refreshes the index database, and the index updating process is realized by a mode of deleting the index firstly and then constructing the index. The specific process is as follows:
the index building module 200 learns the operation captured by the file real-time monitoring module 100, and determines which index operation to select S1000, which is specifically divided into: add, delete, and modify indexes.
(1) The index adding specifically executes the steps of: an index parser is first created S1001, and the system provides a variety of parsers for selection for different text and application environments, commonly including SimpleAnalyzer and Standarylyzer. Creating an index generator S1002, generating an index document S1003, then generating and adding an index field S1004, adding the index document S1005, finally waking up the index optimization module 400 to complete the optimization of the index, and closing the index S1006 and writing the index into the index library of the disk from the memory.
(2) The deletion index specifically executes the steps of: firstly, creating an index manager S1007, creating an index item for deleting S1008, analyzing the deleted document information, deleting the eligible index S1009, and closing the index S1010 and writing the index into an index library of a disk from a memory.
(3) Modifying the index: firstly, creating an index modifier S1011, for the analysis processing of the modified document information, adopting a combination mode of firstly deleting and then adding indexes to realize the modification operation S1012 of the index document, then completing the index optimization S1013, and closing the index S1014 and writing the index into the index library of the disk from the memory.
FIG. 11 is a block diagram showing the basic structure of multi-searcher cross-index search according to the present invention. In some applications, a full-text search system requires joint search from different index files. If the data is small, the data can be considered to be combined into a single index, and if the index data is large and other retrieval requirements exist, joint retrieval of a plurality of indexes must be realized for efficiency and storage cases. Aiming at the application requirement, the system provides support for the cross-index retrieval MultiSearcher, and the multi-retriever cross-index retrieval can be completed through corresponding configuration, so that the system is suitable for a distributed index environment. The MultiSearcher can search required results from different index files, sort the results according to a sorting rule, and feed the results back to the user as a uniform result set. The advent of MultiSearcher makes distributed storage of index files possible, avoiding storage and management difficulties caused by an excessively large single index file.
Referring to fig. 6, the query retrieval module 500 includes a keyword submission module 610, a background retrieval module 620, and a snapshot generation module 630. The keyword submitting module 610 provides a full-text search keyword submitting page for the user, provides an interface for the full-text search system to obtain a query request submitted by the user, and performs word segmentation processing on the received keywords according to a certain word segmentation strategy. The keyword submitting module 610 divides the keyword into a plurality of words with word senses after acquiring the keyword input by the user, and transmits the divided keyword and the user group information to the background retrieval module 620; the background retrieval module 620 is responsible for submitting the received keywords and the user group information to an index library for retrieval, and then obtaining and sorting retrieval results from the index library; the snapshot generating module 630 can generate a result snapshot according to the returned result set, so that the search result is easy to read, and is responsible for generating a result page and displaying the result to the user.
Details of the data stream passing between the modules in the query retrieval module 500 are as follows: the keyword submitting module 610 performs word segmentation with appropriate strategies on the search keywords received from the user to make the search result more accurate, and then delivers the word-segmented search words and the user group information to the background search module 620; the background retrieval module 620 receives the search terms after the word segmentation, then respectively submits the keywords to the index library to be matched with the corresponding domains in the index, returns the hit records from the index library and orders the returned result sets, and then submits the result sets and the keywords to the snapshot generation module 630, wherein each result record in the result sets comprises three domains of a webpage URL, a title and content; the snapshot generating module 630 highlights the content related to the keyword in the index title and the content field, and extracts the most relevant paragraph in the content field to display on the result return page, so that the user can read the result set more intuitively. The background retrieval module 620 sorts all the returned links according to the sorting strategy, arranges the links more important for the user in front of the list, and returns the sorted links and the snapshot result to the user, thereby completing the retrieval service.
This process is the basic flow of search query, including query keyword preprocessing, corpus matching, similarity and ranking calculation, document re-ranking and result page generation, and the overall process flow is shown in fig. 12. The retrieval and query module 500 directly uses the dictionary and the document index library in the indexing process, and the obtained data result contents are all provided by indexes. After a user query request is input S1201 and a query keyword is input, preprocessing of the query word S1202 is a purification and screening process aiming at the keyword, and generally, stop words in the keyword are filtered, an overlong query word is cut off, and finally a search term combination is obtained by utilizing a word segmentation program; query word formatting S1203 converts the search terms through an index dictionary, and converts the search terms into a final word index coding form, so that convenience is provided for subsequent processing; the text base index matching S1204 is to obtain matched retrieval results from the text inverted index base by using the formatted representation of the query word; similarity and ranking calculation S1205 is to determine the ranking order of the result documents according to the specific calculation formula of the full-text search; the result de-duplication and generation S1206 is used to determine whether the result is duplicated to avoid multiple occurrences of the same document, based on the content and the document number of the document. Several functional modules in the whole process work cooperatively, and the retrieval and result display are completed uniformly.
The user query request is usually input in a Web mode, and the full-text retrieval service obtains the content of a retrieval word remotely submitted by the user. The present invention can be realized by a program interface in a specific case such as a local search and a hard disk search.
The log recording module 600 is used for recording CRUD operation of an administrator on document data in a server directory, and the log recording mainly includes: transaction identification, operation type (add, modify, delete), operation object, old value of data before update, new value of data after update. The logging module follows two principles: firstly, the registration order strictly follows the time order of concurrent transaction execution; secondly, the log file must be written first, and then the document information must be modified.
Referring to fig. 13, the data backup module 700 automatically copies the entire document information to another disk according to the administrator's setting. The data backup module 700 automatically copies the updated data every time the main data information is updated, i.e. the system automatically ensures the consistency of the backup data and the main data information, as shown in fig. 13(a), so that, in case of a media failure, the backup data can be continuously provided for use by the backup disk, and the system automatically uses the backup disk data to recover the data information without shutting down the system and reloading the copy of the data information, as shown in fig. 13 (b). Since the backup of data is realized by copying data, the efficiency of the system in operation is naturally reduced by frequently copying data, so in practical application, backup of only key data and log files can be selected, rather than backup of the whole original data document library.
The data recovery module 800 is configured to, when a failure occurs in the system operation process, recover the data to a certain consistency state before the failure by using the data backup complete book and the log file, thereby ensuring that the foreground system can be in a relatively stable state. The recovery steps are as follows: 1. forward scanning the log file, finding out the submitted transaction before the fault occurs, recording the transaction identifier into a redo queue, simultaneously finding out the transaction which is not finished at the time of the fault occurrence, and recording the transaction identifier into a cancel queue; 2. carrying out UNDO (UNDO) processing on each transaction in the UNDO queue, reversely scanning log files, and carrying out reverse operation on the updating operation of each UNDO transaction, namely rewriting the 'value before updating' in the log record into a corresponding file directory of the server; 3. and carrying out REDO (REDO) processing on each transaction in the REDO queue, scanning the log file in the forward direction, and re-executing log file registration operation on each REDO (REDO) transaction, namely writing the updated value in the log record into a corresponding file directory of the server.
The working process of the configurable full-text retrieval service system based on document real-time monitoring is described as follows:
when the system is put into use, an administrator needs to configure system attributes and actual application demand parameters, and various parameters and service forms related to common retrieval services are provided in configuration files, including 1) completing configuration work of index configuration documents and customizing index process services; 2) and finishing the configuration work of the data backup configuration document and customizing the data backup service. The full-text retrieval service is based on a complete and sound indexing mechanism, and after the system configuration is completed, the original document data of the system needs to be analyzed to construct a service index database. In the daily use process of the system, after the change of the document information is captured by the file real-time monitoring module 100, the log recording module 600 is started to record the operation of the administrator, and then the operation is transferred to the data backup module 700 to copy the changed document information to update the backup data. After the data backup is completed, the index change module 400 is waked up, executes a corresponding policy process according to different operation types, and then delivers the policy process to the index construction module 200 to update and reconstruct the index, and meanwhile, the index optimization module 300 optimizes the new index through optimization functions, index merging and other means, and finally imports the new index into the index library. The user inputs keywords through the foreground query interface, submits the keywords to the query retrieval module 500, obtains a result set from the index library, renders the result set into a Web page, and feeds the Web page back to the user. When the system fails or is attacked by a malicious attack to cause the original data to be damaged or lost, the administrator may start the data recovery module 800, load the log records and backup the data to recover the original data. The system of the invention supports the expansion of full-text retrieval function in the original system without changing the code and document information of the original system.

Claims (6)

1. A configurable full-text retrieval service system based on document real-time monitoring, the system comprising: a file real-time monitoring module, an index construction module, an index optimization module, an index change module, a query retrieval module, a log recording module, a data backup module and a data recovery module, wherein,
the file real-time monitoring module is used for monitoring the change of the file information of the server side in real time and starting the log recording module and the index changing module;
the index construction module is used for analyzing and processing unstructured documents under the specified directory, generating intermediate format files, extracting index items from the files through a word segmentation device, representing the documents, generating an index table of a document set, starting the index change module, and storing the generated indexes into an index library;
the index optimization module is used for improving the efficiency of the system during full-text indexing by combining and optimizing indexes, and acts on the indexes generated by the index construction module;
the index changing module is used for receiving changed document information, including addition, deletion and modification of documents, so as to dynamically update the index, and the index changing module acts on the index building module;
the query retrieval module is used for providing a search interface for full-text retrieval for a user, receiving query keywords from the user, segmenting the query keywords, submitting the query keywords to an index library for retrieval, and returning a result set;
the log recording module is used for recording CRUD operation of a manager on document data in a server directory and is started by the document real-time monitoring module;
the data backup module is used for backing up data information and generating backup data so as to prevent the data from being damaged or lost due to the fault of a system or malicious attack, ensure that the data information is recovered to a certain known correct state from an error state, and is started after the operation of the log recording module is finished;
and the data recovery module is used for recovering the data to a certain consistency state before the fault by using the data backup whole and the log file when the system fails in the operation process.
2. The configurable full-text retrieval service system based on document real-time monitoring of claim 1, wherein the file real-time monitoring module comprises:
the system comprises an initialization system engine module, an index change module and a data backup module, wherein the initialization system engine module is used for recursively traversing a file directory during system initialization, waking up the data backup module to copy data, and then triggering execution of initialization construction and index optimization of an index by the index change module;
the captured document adding module is used for monitoring the operation of adding the document by an administrator in real time;
the captured document deleting module is used for monitoring the operation of deleting the expired document by the administrator in real time;
and the captured document updating module is used for monitoring the operation of updating the document by the administrator in real time.
3. The configurable full-text retrieval service system based on document real-time monitoring of claim 1, wherein the index building module comprises:
a text parser for analyzing the contents of the web page and the document, unifying into a plain document or an intermediate document,
and the indexer is used for reprocessing the results of processing and analyzing the unstructured documents by the text parser, sequentially reading and analyzing the index items, establishing a linked list arranged according to the index items by using a preset index item dictionary, dynamically changing the index dictionary, and finally completing the index list, the index dictionary and the document index organized according to the index items.
4. The configurable full-text retrieval service system based on document real-time monitoring of claim 1, wherein the index optimization module comprises:
the offline index optimization module is used for acting on the index construction module offline when the document is updated, and calling corresponding optimization function adjustment and optimizing the offline index;
and the offline index merging module is used for merging a series of small index files into one index file so as to improve the retrieval efficiency, and the offline index merging module and the offline index optimization module jointly reconstruct and optimize the offline index to generate a final optimized index.
And the index service switching module is used for switching the offline index and the current service index, so that the normal optimized operation of the offline index is ensured, and the retrieval efficiency of the service index is not influenced.
5. The configurable full-text retrieval service system based on document real-time monitoring of claim 1, wherein the index change module comprises:
the document adding processing module is used for receiving the added document information so as to dynamically update the index;
the document deleting processing module is used for receiving the deleted document information so as to dynamically update the index;
and the document modification processing module is used for receiving the modified document information so as to dynamically update the index.
6. The configurable full-text search service system based on document real-time monitoring of claim 1, wherein the query retrieval module comprises:
the keyword submitting module is used for providing a full-text search keyword submitting page for a user, providing an interface for a full-text search system to acquire a query request submitted by the user, and performing word segmentation processing on the received keywords according to a certain word segmentation strategy;
the background retrieval module is used for submitting the received keywords and the user group information to an index database for retrieval, and then acquiring and sequencing retrieval results from the index database;
and the snapshot generating module is used for generating a result snapshot according to the returned result set and is responsible for generating a result page and displaying the result to the user.
CN 201010181321 2010-05-19 2010-05-19 Configurable full-text retrieval service system based on document real-time monitoring Pending CN101853288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010181321 CN101853288A (en) 2010-05-19 2010-05-19 Configurable full-text retrieval service system based on document real-time monitoring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010181321 CN101853288A (en) 2010-05-19 2010-05-19 Configurable full-text retrieval service system based on document real-time monitoring

Publications (1)

Publication Number Publication Date
CN101853288A true CN101853288A (en) 2010-10-06

Family

ID=42804780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010181321 Pending CN101853288A (en) 2010-05-19 2010-05-19 Configurable full-text retrieval service system based on document real-time monitoring

Country Status (1)

Country Link
CN (1) CN101853288A (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004800A (en) * 2010-12-28 2011-04-06 北京数码大方科技有限公司 Data query method and device of PDM (Product Data Management) system
CN102591859A (en) * 2011-12-28 2012-07-18 华为技术有限公司 Method and relevant device for reusing industrial standard formatted files
CN102662794A (en) * 2012-03-09 2012-09-12 无锡华御信息技术有限公司 System and method for document data backup
CN102831253A (en) * 2012-09-25 2012-12-19 北京科东电力控制系统有限责任公司 Distributed full-text retrieval system
CN103678597A (en) * 2013-12-13 2014-03-26 北京奇虎科技有限公司 Optimization method and device of model essay webpage database
CN103955500A (en) * 2014-04-22 2014-07-30 广州杰赛科技股份有限公司 Cloud storage-based massive document data information structural display method and system
CN104166734A (en) * 2014-09-05 2014-11-26 上海海事大学 SVN full-text retrieval system and method
CN104750801A (en) * 2015-03-24 2015-07-01 华迪计算机集团有限公司 Generation method and system of structured document
CN105045684A (en) * 2015-07-16 2015-11-11 北京京东尚科信息技术有限公司 Method and device for switching and controlling indexes
CN105589881A (en) * 2014-10-23 2016-05-18 大唐软件技术股份有限公司 Data processing method and device
CN105677746A (en) * 2015-12-29 2016-06-15 上海爱数信息技术股份有限公司 Database transaction operation based duplicate files merging system and method
WO2016180164A1 (en) * 2015-09-29 2016-11-17 中兴通讯股份有限公司 Method and apparatus for rolling back distributed transaction
CN107103075A (en) * 2017-04-24 2017-08-29 广东浪潮大数据研究有限公司 The text searching method and device of a kind of ftp file
CN107341203A (en) * 2017-06-22 2017-11-10 北京北信源软件股份有限公司 The access control and optimiged index method and apparatus of a kind of distributed search engine
CN107451176A (en) * 2016-05-30 2017-12-08 恩芬森株式会社 Data copy method and its device
CN107861712A (en) * 2016-09-26 2018-03-30 平安科技(深圳)有限公司 Develop the generation method and system of daily record
CN108416264A (en) * 2018-01-29 2018-08-17 山东汇贸电子口岸有限公司 A kind of searching method and search module of supporting OCR to input
CN108804642A (en) * 2018-06-05 2018-11-13 中国平安人寿保险股份有限公司 Search method, device, computer equipment and storage medium
CN108804592A (en) * 2018-05-28 2018-11-13 山东浪潮商用系统有限公司 Knowledge library searching implementation method
CN108959199A (en) * 2018-06-28 2018-12-07 武汉斗鱼网络科技有限公司 A kind of log highlights method, apparatus, storage medium and android terminal
CN109145077A (en) * 2017-06-19 2019-01-04 核工业北京地质研究院 A kind of facilitation text searching method based on Open Source Framework
CN109815194A (en) * 2019-02-01 2019-05-28 北京沃东天骏信息技术有限公司 Indexing means, indexing unit, computer readable storage medium and electronic equipment
CN109885654A (en) * 2019-02-01 2019-06-14 天津字节跳动科技有限公司 Online document modifies treating method and apparatus
CN110096636A (en) * 2019-05-08 2019-08-06 上海泰豪迈能能源科技有限公司 Search engine optimization method, apparatus and electronic equipment
CN110297829A (en) * 2019-06-26 2019-10-01 重庆紫光华山智安科技有限公司 A kind of text searching method and system towards specific industry structuring business datum
CN110609844A (en) * 2018-05-29 2019-12-24 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN111221814A (en) * 2018-11-27 2020-06-02 阿里巴巴集团控股有限公司 Secondary index construction method, device and equipment
CN111242559A (en) * 2019-12-20 2020-06-05 南京南瑞信息通信科技有限公司 Data resource management platform and method
CN111388996A (en) * 2020-04-10 2020-07-10 网易(杭州)网络有限公司 Three-dimensional virtual object display method, device and system, storage medium and equipment
CN113779349A (en) * 2021-08-11 2021-12-10 中央广播电视总台 Data retrieval system, apparatus, electronic device, and readable storage medium
CN114706941A (en) * 2022-03-03 2022-07-05 广州万辉信息科技有限公司 Patent monitoring platform and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020198875A1 (en) * 2001-06-20 2002-12-26 Masters Graham S. System and method for optimizing search results
CN1731398A (en) * 2004-08-06 2006-02-08 佳能株式会社 Information processing apparatus, document search method, program, and storage medium
CN101059811A (en) * 2006-03-14 2007-10-24 佳能株式会社 Document retrieving system, document retrieving apparatus, and method thereof
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020198875A1 (en) * 2001-06-20 2002-12-26 Masters Graham S. System and method for optimizing search results
CN1731398A (en) * 2004-08-06 2006-02-08 佳能株式会社 Information processing apparatus, document search method, program, and storage medium
CN101059811A (en) * 2006-03-14 2007-10-24 佳能株式会社 Document retrieving system, document retrieving apparatus, and method thereof
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004800A (en) * 2010-12-28 2011-04-06 北京数码大方科技有限公司 Data query method and device of PDM (Product Data Management) system
CN102591859B (en) * 2011-12-28 2014-11-05 华为技术有限公司 Method and relevant device for reusing industrial standard formatted files
CN102591859A (en) * 2011-12-28 2012-07-18 华为技术有限公司 Method and relevant device for reusing industrial standard formatted files
CN102662794A (en) * 2012-03-09 2012-09-12 无锡华御信息技术有限公司 System and method for document data backup
CN102831253A (en) * 2012-09-25 2012-12-19 北京科东电力控制系统有限责任公司 Distributed full-text retrieval system
CN102831253B (en) * 2012-09-25 2015-01-21 北京科东电力控制系统有限责任公司 Distributed full-text retrieval system
CN103678597A (en) * 2013-12-13 2014-03-26 北京奇虎科技有限公司 Optimization method and device of model essay webpage database
CN103955500A (en) * 2014-04-22 2014-07-30 广州杰赛科技股份有限公司 Cloud storage-based massive document data information structural display method and system
CN104166734A (en) * 2014-09-05 2014-11-26 上海海事大学 SVN full-text retrieval system and method
CN104166734B (en) * 2014-09-05 2018-04-20 上海海事大学 A kind of SVN text retrieval systems and search method
CN105589881B (en) * 2014-10-23 2020-01-24 大唐软件技术股份有限公司 Data processing method and device
CN105589881A (en) * 2014-10-23 2016-05-18 大唐软件技术股份有限公司 Data processing method and device
CN104750801A (en) * 2015-03-24 2015-07-01 华迪计算机集团有限公司 Generation method and system of structured document
CN105045684B (en) * 2015-07-16 2018-06-15 北京京东尚科信息技术有限公司 Index switching and the method and device of index control
CN105045684A (en) * 2015-07-16 2015-11-11 北京京东尚科信息技术有限公司 Method and device for switching and controlling indexes
WO2016180164A1 (en) * 2015-09-29 2016-11-17 中兴通讯股份有限公司 Method and apparatus for rolling back distributed transaction
CN106557514A (en) * 2015-09-29 2017-04-05 中兴通讯股份有限公司 A kind of distributed transaction rollback method and device
CN105677746A (en) * 2015-12-29 2016-06-15 上海爱数信息技术股份有限公司 Database transaction operation based duplicate files merging system and method
CN107451176A (en) * 2016-05-30 2017-12-08 恩芬森株式会社 Data copy method and its device
CN107861712A (en) * 2016-09-26 2018-03-30 平安科技(深圳)有限公司 Develop the generation method and system of daily record
CN107103075A (en) * 2017-04-24 2017-08-29 广东浪潮大数据研究有限公司 The text searching method and device of a kind of ftp file
CN109145077A (en) * 2017-06-19 2019-01-04 核工业北京地质研究院 A kind of facilitation text searching method based on Open Source Framework
CN107341203A (en) * 2017-06-22 2017-11-10 北京北信源软件股份有限公司 The access control and optimiged index method and apparatus of a kind of distributed search engine
CN108416264A (en) * 2018-01-29 2018-08-17 山东汇贸电子口岸有限公司 A kind of searching method and search module of supporting OCR to input
CN108804592A (en) * 2018-05-28 2018-11-13 山东浪潮商用系统有限公司 Knowledge library searching implementation method
CN110609844B (en) * 2018-05-29 2022-05-13 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN110609844A (en) * 2018-05-29 2019-12-24 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN108804642A (en) * 2018-06-05 2018-11-13 中国平安人寿保险股份有限公司 Search method, device, computer equipment and storage medium
CN108959199B (en) * 2018-06-28 2022-08-16 武汉斗鱼网络科技有限公司 Log highlighting method and device, storage medium and android terminal
CN108959199A (en) * 2018-06-28 2018-12-07 武汉斗鱼网络科技有限公司 A kind of log highlights method, apparatus, storage medium and android terminal
CN111221814A (en) * 2018-11-27 2020-06-02 阿里巴巴集团控股有限公司 Secondary index construction method, device and equipment
CN111221814B (en) * 2018-11-27 2023-06-27 阿里巴巴集团控股有限公司 Method, device and equipment for constructing secondary index
CN109885654A (en) * 2019-02-01 2019-06-14 天津字节跳动科技有限公司 Online document modifies treating method and apparatus
CN109815194A (en) * 2019-02-01 2019-05-28 北京沃东天骏信息技术有限公司 Indexing means, indexing unit, computer readable storage medium and electronic equipment
CN110096636A (en) * 2019-05-08 2019-08-06 上海泰豪迈能能源科技有限公司 Search engine optimization method, apparatus and electronic equipment
CN110297829A (en) * 2019-06-26 2019-10-01 重庆紫光华山智安科技有限公司 A kind of text searching method and system towards specific industry structuring business datum
CN111242559A (en) * 2019-12-20 2020-06-05 南京南瑞信息通信科技有限公司 Data resource management platform and method
CN111388996A (en) * 2020-04-10 2020-07-10 网易(杭州)网络有限公司 Three-dimensional virtual object display method, device and system, storage medium and equipment
CN113779349A (en) * 2021-08-11 2021-12-10 中央广播电视总台 Data retrieval system, apparatus, electronic device, and readable storage medium
CN114706941A (en) * 2022-03-03 2022-07-05 广州万辉信息科技有限公司 Patent monitoring platform and method

Similar Documents

Publication Publication Date Title
CN101853288A (en) Configurable full-text retrieval service system based on document real-time monitoring
US10146643B2 (en) Database recovery and index rebuilds
CN100445998C (en) Transactional file system
CN102918494B (en) Data storage based on the storage of database model agnosticism, outline agnosticism and live load agnostic data and Access Model and/or search method and system
US8417746B1 (en) File system management with enhanced searchability
Lum et al. 1978 New Orleans data base design workshop report
Guo Software tools to facilitate research programming
KR100556594B1 (en) A method relating to databases
JP5233233B2 (en) Information search system, information search index registration device, information search method and program
Edwards et al. A temporal model for multi-level undo and redo
JP2009217837A (en) System and method for data quality management and control of heterogeneous data source
CN101211365A (en) Method and system for building search index
CN111488143A (en) Automatic code generation device and method based on Springboot2
WO2007083371A1 (en) Data integration device, method, and recording medium containing program
US11210266B2 (en) Methods and systems for natural language processing of metadata
US11093448B2 (en) Methods and systems for metadata tag inheritance for data tiering
CN102955792A (en) Method for implementing transaction processing for real-time full-text search engine
Lin et al. Infrastructure for supporting exploration and discovery in web archives
López et al. An efficient and scalable search engine for models
CN115552390A (en) Server-free data lake indexing subsystem and application programming interface
Pettit et al. The MySQL Workshop: A practical guide to working with data and managing databases with MySQL
US20200242078A1 (en) Methods and systems for metadata tag inheritance between multiple file systems within a storage system
CN113779215A (en) Data processing platform
Kvet et al. Enhancing Analytical Select Statements Using Reference Aliases
Pandian et al. A Unified Model for Preprocessing and Clustering Technique for Web Usage Mining.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20101006