CN104391941A

CN104391941A - Method for rapidly establishing full-text retrieval tool for common files

Info

Publication number: CN104391941A
Application number: CN201410684418.9A
Authority: CN
Inventors: 刘粉粉
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-11-25
Filing date: 2014-11-25
Publication date: 2015-03-04

Abstract

The invention discloses a method for rapidly establishing a full-text retrieval tool for common files, belonging to the field of retrieval tools. The method specifically comprises the steps that: (1) a document analysis module reads all the file analysis HTTP requests and sends the requests to a Chinese words segmentation module; (2) the Chinese words segmentation module segments attribute content in the received HTTP requests; (3) a full-text retrieval establishment module customizes a retrieval service type; (4) a retrieval module after analyzing a retrieval command performs corresponding operation and finishes establishment of the retrieval tool; (5) after a user submits search terms, the retrieval module performs word segmentation treatment on the search terms, generates an inquiry request and inquires in an index library and takes on the inquired result to the user. The method for rapidly establishing the full-text retrieval tool for common files realizes establishment of search engine dedicated to personnel and enterprises, personal retrieval requirements can be realized only by taking relatively little time and effort, and a plenty of internal files can be managed easily.

Description

A kind of method of rapid build active file full-text search instrument

Technical field

The present invention discloses a kind of method of rapid build gopher, belongs to gopher field, specifically a kind of method of rapid build active file full-text search instrument.

Background technology

Full-text search is by the arbitrary content information searching retrieval out in whole book of storage, entire article.It can to obtain in full the information such as relevant chapter, paragraph, sentence, word as required, is that is similar to and adds a label to each words of whole book, also can carry out various statistics and analysis.Solr is an independently enterprise-level search application server, and it externally provides the api interface being similar to Web-service.User can pass through http request, submits the XML file of certain format to, generating indexes to search engine server; Also can be operated by Http Get and propose search request, and obtain returning results of XML format.

The search need of now a lot of users also rests on the database stage, but when search mission charge capacity is very large, the performance of database also has limitation.And the search of content for a large amount of files, database can complete hardly, or the difficulty that complete process is suitable, and select a ripe search engine of increasing income as core, a gopher that can be user and use is built with this, it is good selection, but a practical text search tool builds very complicated, and there is no unified and simple construction method, the invention provides a kind of method of rapid build active file full-text search instrument, based on the active file gopher of the search engine solr that increases income, by file stored in search engine, structure full-text index is carried out to it, all related contents can be retrieved fast according to search keyword, finally present to user.Utilize the method, individual can be realized and enterprise builds exclusive search engine, only need spend less time and efforts, the Search Requirement of self can be reached, easily manage a large amount of internal files.

Summary of the invention

The present invention is directed to deficiency and the problem of prior art existence, a kind of method of rapid build active file full-text search instrument is provided, be applicable to individual and set up the gopher that can be retrieved the various file accumulated over a long period fast, be more suitable for enterprise and carry out managing internal heap file, file needed for fast searching.

The method of a kind of rapid build active file of the present invention full-text search instrument, the concrete scheme of proposition is:

A system for rapid build active file full-text search instrument, realize based on solr, comprise document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module;

Document parsing module is responsible for resolution file;

Chinese word segmentation module in charge uses Chinese Word Automatic Segmentation, file content is carried out full text participle, to set up full-text index;

Full-text index is set up module in charge and is carried out full-text index to the word after Chinese word-dividing mode participle;

Full-text index storehouse is responsible for data and is stored;

Retrieval module is responsible for the various retrievals realizing user.

A method for rapid build active file full-text search instrument, realize based on solr, concrete steps are

1. document parsing module is converted into XML format after reading all document analysis, each document analysis is become two attributes, and composition HTTP request sends to Chinese word segmentation module;

2. Chinese word segmentation module carries out participle to the property content received in HTTP request, and set up module through full-text index after all properties participle and set up index, segmentation methods is configured by configuration file;

3. full-text index sets up Custom modules index service type, plans the field that will store and the field that will preserve, then the index of all foundation and data are stored into full-text index storehouse in configuration file;

4., after retrieval module is resolved retrieval command, from full-text index storehouse, obtain index, retrieve accordingly, delete, revise index operation, complete the structure of gopher;

5., after submit queries word, retrieval module can carry out the process such as participle to query word, and generated query request, then inquire about in index database, and inquiry acquired results is presented to user.

Described step 1. in two attributes becoming of each document analysis be the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores.

Described step 2. in full-text index set up module and set up inverted data structure index.

Step 4. in after retrieval module resolves retrieval command, also can realize the sequence of result for retrieval, the highlighted display of keyword, search key weighted.

Described active file is word, pdf, txt form.

Usefulness of the present invention is: the active file gopher that the present invention is based on the search engine solr that increases income, by file stored in search engine, structure full-text index is carried out to it, all related contents can be retrieved fast according to search keyword, finally present to user, utilize this method, can realize individual and enterprise build exclusive search engine, only need spend less time and efforts, the Search Requirement of self can be reached, the internal file that easily management is a large amount of.

accompanying drawing illustrates:

The method flow schematic diagram of a kind of rapid build active file of Fig. 1 full-text search instrument.

Embodiment

By reference to the accompanying drawings to the present invention to further elaboration:

Embodiment 1

Based on search engine solr, build a kind of system of rapid build active file full-text search instrument, comprise document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module; Chinese word segmentation module, full-text index sets up module, full-text index storehouse, and retrieval module works based on search engine solr;

Document parsing module is responsible for resolution file;

Full-text index storehouse is responsible for data and is stored;

Retrieval module is responsible for the various retrievals realizing user.

A method for rapid build active file full-text search instrument, concrete steps are

1. document parsing module is converted into XML format after reading word document analysis, each document analysis is become two attributes, be the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores, and composition HTTP request sends to Chinese word segmentation module;

2. Chinese word segmentation module carries out participle to the property content received in HTTP request, and set up module through full-text index after all properties participle and set up inverted data structure index, segmentation methods is configured by configuration file;

Embodiment 2

Document parsing module is responsible for resolution file;

Full-text index storehouse is responsible for data and is stored;

Retrieval module is responsible for the various retrievals realizing user.

1. document parsing module reads after pdf document is resolved and is converted into XML format, each document analysis is become two attributes, be the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores, and composition HTTP request sends to Chinese word segmentation module;

4. after retrieval module is resolved retrieval command, index is obtained from full-text index storehouse, retrieve accordingly, delete, revise index operation, also can realize the sequence of result for retrieval, the highlighted display of keyword, search key weighted, complete the structure of gopher;

Claims

1. a system for rapid build active file full-text search instrument, realize based on solr, it is characterized in that comprising document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module;

Document parsing module is responsible for resolution file;

Full-text index storehouse is responsible for data and is stored;

Retrieval module is responsible for the various retrievals realizing user.

2. a method for rapid build active file full-text search instrument, utilizes the system of a kind of rapid build active file full-text search instrument as claimed in claim 1, it is characterized in that concrete steps are

3. the method for a kind of rapid build active file full-text search instrument according to claim 2, it is characterized in that two attributes that during described step 1., each document analysis becomes are the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores.

4. the method for a kind of rapid build active file full-text search instrument according to Claims 2 or 3, is characterized in that full-text index is set up module and set up inverted data structure index during described step 2..

5. the method for a kind of rapid build active file full-text search instrument according to claim 4, after it is characterized in that during step 4. that retrieval module resolves retrieval command, also can realize the sequence of result for retrieval, the highlighted display of keyword, search key weighted.

6. the method for a kind of rapid build active file full-text search instrument according to Claims 2 or 3 or 5 any one, is characterized in that described active file is word, pdf, txt form.